Category nosql_syndication

Getting Faster Writes with Riak

While preparing for an upcoming presentation about Riak for the Columbus Ruby Brigade, I wrote a simple data loader. When I initially ran the load, it took about 4 minutes to load the data on the worst run. When you’re waiting to test your data load and write a presentation, 4 minutes is an eternity. Needless to say, I got frustrated pretty quickly with the speed of my data loader, so I hit up the Riak channel on IRC and started digging into the Ruby driver’s source code.

The Results

              user     system      total        real
defaults 63.660000   3.270000  66.930000 (166.475535)
dw => 1  50.940000   2.720000  53.660000 (128.470094)
dw => 0  52.350000   2.740000  55.090000 (120.151827)
n => 2   52.850000   2.790000  55.640000 (132.023310)

The Defaults

Our default load uses no customizations. Riak is going to write data to three nodes in the cluster (n = 3). Since we’re using the default configuration, we can safely assume that Riak will use quorum for write confirmation (w = n/2 + 1). Finally, we can also assume that the durable write value is to use a quorum, since that’s the default for riak-client.

Because we’re writing to n (3) nodes and we’re waiting for w (2) nodes to respond, writes were slower than I’d like. Thankfully, Riak makes it easy to tune how it will respond to writes.

Changing the N Value

The first change that we can do is change the N value (replication factor). The N value should have a huge improvement for my test machine – Riak is only on one of my hard drives. Even solid state drives can only write to one place at a time. When we create the bucket we can change the bucket’s properties and set the N value. note It’s important that you set bucket properties when you ‘create’ the bucket. Buckets are created when keys are added to them and they are deleted when the last key is deleted.

b1 = client.bucket('animals_dw1',
                   :keys => false)
b1.props = { :n_val => 1, :dw => 1 }

In this chunk of code we set the N value to 1 and set the durable writes to 1. This means that only 1 replica will have to commit the record to durable storage in order for the write to be considered a success.

On the bright side, this approach is considerably faster. Here’s the bummer: by setting the N value to 1, we’ve removed any hope of durability from our cluster – the data will never be replicated. Any server failure will result in data loss. For our testing purposes, it’s okay because we’re trying to see how fast we can make things, not how safe we can make them.

How much faster? Our run with all defaults enabled took 166 seconds. Only writing to 1 replica shaved 38 seconds off of our write time. The other thing that I changed was setting returnbody to false. By default, the Ruby Riak client will return the object that was saved. Turning this setting off should make things faster – less bytes are flying around the network.

Forget About Durability

What happens when we turn down durability? That’s the dw => 0 result in the table at the beginning of the article. We get an 8 second performance boost over our last load.

What did we change? We set both the dw and w parameters to 0. This means that our client has told Riak that we’re not going to wait for a response from any replicas before decided that a write has succeeded. This is a fire and forget write – we’re passing data as quickly as possible to the client and to hell with the consequences.

So, by eliminating any redundancy, ignoring the current record from the database, and refusing to acknowledge any reads from the server, we’re able to get a 46.3 second performance improvement over our default values. This is impressive, but it’s roughly akin to throwing our data at a bucket, not into the bucket.

What if I Care About My Data?

What if you care about your data? After all, we got our performance improvement from setting the number of replicas to 1 and turning off write acknowledgement. The fourth, and final run, that I performed took a look at what would happen if we kept the number of replicas at a quorum (an N value of 2) and ignored write responses. If we’re just streaming data into a database, we may not care if a record gets missed here and there. It turns out that this is only slightly slower than running with scissors. It takes 132 seconds to write the data; only 4 seconds slower than with durable writes set to 1 and still nearly 34.5 seconds faster than using the defaults.

The most recent version of this sample code can be found on github at https://github.com/peschkaj/riak_intro. Sample data was located through Infochimps.com. The data load uses the Taxobox data set.

A Technical Plan for 2011

The Last Ten Years

My career has been particularly interesting. I’ve been very fortunate to work with a variety of different languages, platforms, databases, frameworks, and people. I started off working with Perl on HP-UX. As I started automating more of my job, I added ASP.NET to the mix. Eventually I learned about databases, first with Oracle, then SQL Server, then with PostgreSQL, and finally back to SQL Server. Along the way I’ve held job a variety of different job titles – system administrator, system engineer, developer, consultant, architect, and database administrator.

I’ve worked with a lot of different systems, architectures, and design philosophies. The one thing that’s stuck with me is that there is no one size fits all answer. That extends beyond languages and design patterns – it goes right down to the way we’re storing data. One of the most interesting things going on right now is that it’s easier than ever to pick the right tools for the job.

The Next Twelve Months

Over the next twelves months, I’m going to be digging into hybrid database solutions. Some people call it polyglot persistence. You can call it what you like, but the fact remains that it is no longer necessary to store all of our data in a relational database. Frankly, I’m encouraging people to look into ways to store their data outside of a relational database. Not because RDBMSes are bad or wrong, but because there is a lot of data that doesn’t need to be in a relational database – session data, cached data, and flexible data.

Why Focus on Hybrid Data?

The idea behind hybrid data is that we use multiple databases instead of one database. Let’s say that we have an online store where we sell musical equipment. We want to store customer data in a relational database, but where should we store the rest of our information? Conventional thinking says that we should keep storing our data in a relational database. Sessions might be stored in memory somewhere and shopping carts might get stored in the database, but they’ll end up on faster, more expensive, solid state drives.

There are other ways to store data.

Let’s think about all of this for a minute. Why do we force our databases into existing paradigms? Why aren’t we thinking about new and interesting ways to store our data?

Sessions are a great place to start. Sure, we could use something like memcached, but why not examine Redis or App Fabric Cache? Both of these databases have support for strongly typed data. They both allow the data to be persisted to disk, if needed, and they allow for data to be expired over time. This is perfect for working with any kind of cached data – it stays in a format our applications need but we can expire it or save it as needed.

The flexibility to store our data the way that applications use it is important. Session data should be rapidly accessible. Other applications don’t need to read it. It doesn’t need to be reportable. It merely needs to be fast.

Shopping carts are different. Amazon’s own use cases and needs drove the development of Dynamo to be a durable, eventually consistent, distributed key-value store. Shopping carts are write heavy environments. It’s rare that users need to view everything that’s in a shopping car, but they need to be able to review it quickly when the time comes. Likewise, when the time comes to review a shopping cart, any delay or slowdown means there’s a chance the user will simply abandon the cart. Dynamo fills these requirements quite well.

Since Dynamo is only available inside of Amazon, how are we supposed to work with it ourselves? Riak is a clone of Dynamo that meets our need for a shopping cart. It’s a key/value database; it’s fault tolerant, and it’s fast.

Why not store a shopping cart in a relational database? It is, after all, a pretty simple collection of a user identifier, an item number, an item description, price, and quantity. Shopping carts are highly transient. Once an order has been placed, the shopping cart is cleared out and the data in the cart moves into the ordering system. Most shopping carts will be active for a very short period of time – a matter of minutes at most. Over their short lives shopping carts will almost entirely be written to and only read a few times. Instead of building complex sharding mechanisms to spread load out across a number of database servers, why not use a database designed to handle large load spread across a number of servers?

Where Does This Fit Into the Enterprise?

Enterprises should be adopting these technologies as fast as they can. Not because they are replacing the relational database, but because they free the relation database from things it’s bad at and leave it to perform tasks that it excels at. Relational databases are great for core business logic – they have a lot of baked in functionality like data integrity and validation. As we’ve already discussed, relational databases are not well suited to storing highly volatile data.

By moving volatile data into better suited types of database, enterprises can increase the capacity of their database systems, provide redundancy, and increase scalability by using off the shelf solutions. The trick, of course, lies in integrating them. And that is what I’m going to be playing around with this year.

Three Mistakes I Made With MongoDB

When I initially started working with MongoDB, it was very easy to get started. I could create schemas, create data, and pretty much do everything I wanted to do very quickly. As I started progressing through working with MongoDB, I started running into more and more problems. They weren’t problems with MongoDB, they were problems with the way I was thinking about MongoDB.

I Know How Stuff Works

The biggest mistake that I made was carrying over ideas about how databases work. When you work with any tool for a significant period of time, you start to make assumptions about how other features will work. Over time, your assumptions get more accurate. On the whole, it’s a good thing. It gets tricky when you start to switch around your frame of reference.

As a former DBA and alleged database expert, I’m pretty comfortable with relational databases. Once you understand some of the internals of one database, you can make some safe assumptions about how other databases have been implemented. One of the advantages of MongoDB is that it’s a heck of a lot like an RDBMS (sorry guys, that’s just the truth of it). It has collections, that look and act a lot like tables. There are indexes, there’s a query engine, there’s even replication. There are even more features and functionality that map really well between MongoDB and relational databases. It’s close enough that it’s painless to make a switch. The paradigms don’t match up exactly, but there are similarities.

Unfortunately for me, the paradigms and terminology matched up enough that I felt comfortable making a large number of assumptions. Boy was I ever wrong. I’ve had to stop thinking that I know anything about MongoDB and look up the answers to questions, rather than assume I know anything. It’s been frustrating, but it’s also been educational.

I Know How Data is Modeled

It’s really easy to make assumptions about modeling data. This is another area where I made huge assumptions that caused a lot of problems. In a relational database, we know how to model data. Normalization is really well understood. I know how to design database structures to take advantage of best practices and techniques. I know how to work with O/R-Ms and I understand where there are tradeoffs to be made between normalization, denormalization, and the software that talks to the database. We all learn these things as we progress in our careers. Once you get used to normalization, it’s easy to fall into that pattern.

Document database, like MongoDB, don’t work to their full potential when you normalize your data. Document databases, in my experience, do the opposite; they work best when the data is stored as a document. Document data is similar to the way data is used in the application child data is stored with a parent. An order’s line items are just a child collection of the order, there is no OrderHeader and OrderDetails table. Years of working within the same set of rules made it easy to slip into old habits.

Let’s say I have a number of user created documents in a Documents collection. Documents have an author as well as a set of tags created by the author. In a relational database, we’d have something like:

CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(30) NOT NULL,
    password VARCHAR(50) NOT NULL
);

CREATE TABLE documents (
    document_id INT PRIMARY KEY,
    title VARCHAR(30) NOT NULL,
    body VARCHAR(MAX) NOT NULL,
    user_id INT NOT NULL REFERENCES users(id)
);

CREATE TABLE tags (
   tag_id INT PRIMARY KEY,
   name VARCHAR(30) NOT NULL
);

CREATE TABLE document_tags (
   document_id INT NOT NULL,
   tag_id INT NOT NULL
);

For a lot of things, this makes perfect sense. With MongoDB, we’d have a documents collection and a users collection. We might have a separate tags collection as well, but that would be used as an inverted index for searching. The users collection would be used for validating logins and populating your user profile, but when a document is saved, we wouldn’t store a pointer to the appropriate record in the users collection. The appropriate thing to do would be to cache the data locally as well as store a pointer to the users collection. The same hold true for the document’s tags – why create a join construct between the two collections when we can store all of the tag data we need in the appropriate document? If we decide that we need to find documents by their tags we have two choices:

  1. Create an index on the document.tags property
  2. Create a tags inverted index.

If you’re interested in indexes, the MongoDB documentation on the subject is a great place to start. There is a specific kind of index called a multikey that allows you to index arrays of values. Inverted indexes are interested and I covered them while talking about building secondary indexes in Riak.

With a document database we’re trying to minimize the number of reads that we’re performing at any given time. A document should be a logical construct of whatever application entity we’re saving. A document would be a record of the document at the time it was saved – there will be cached information from the user, the document and its associated metadata, and a list of tags.

Of course, since we’re talking about data modeling, there are arguably n(n+1) ways to accomplish this and at least n+1 correct ways to accomplish this. If you really don’t like it, feel free to comment about it.

I Can Just Drop This In ###

Despite appearances and claims to the contrary, MongoDB is not a drop in replacement for an RDBMS. A relational database provides a phenomenal number of features for free – indexing, declarative referential integrity, transaction support, multi-version concurrency control, and multi-statement transactions to name a few. These features come with a price. Likewise, MongoDB provides a different set of features and they also have a price.

Believing the hype caused some sticky problems for me, not because I destroyed important data or anything like that, but because I assumed that I could use the same tools and tricks. I used MongoMapper to handle my database access. MongoMapper is a fine piece of code and it made a lot of things very easily. Using an O/R-M made it feel like I was using a relational database. It made things trickier, especially when I ran into situations where I was running into some of the blurry places I’ve already talked about. In hindsight, I should have used the stock drivers, built my own abstractions, and the replaced that when it became necessary. I don’t say that because I think I could do a better job, but because it makes more sense to build something from scratch yourself for the first time and then replace it when you’re building more plumbing than functionality.

Would I Use MongoDB Again?

Sure, if the project needed it. I don’t believe that MongoDB is a drop in replacement for an RDBMS. Thinking about MongoDB and RDBMSes that was does a disservice to both MongoDB and the RDBMS: they both have their own strengths and weaknesses.

What You’re Missing About Stateless Computing

Once, long ago, men carved their knowledge into the walls of caves so that it would be available for all time. Unfortunately, their knowledge was tied to once place. Eventually, a forward thinking cave dweller thought about carving his knowledge into a clay tablet. He was savagely beaten to death and his family burned as witches. Eventually the other cave dwellers realized that it was probably a good idea to have a more portable way to store their knowledge and they too adopted this portable clay-based knowledge transfer system.

Fast forward to the tail end of 2010 and people are saying that Microsoft has got it wrong with the Azure VM role. People are already lambasting it as a laughable concept that’s needlessly complex to patch. I can’t argue with that, it is complex to patch, but there’s a reason for that complexity and it’s called stateless computing.

It’s like the switch from procedural/object-oriented programming to functional programming. When you first switch, you get pissed off that you can’t reassign variables and that functions can’t have side effects. You get used to that pretty quickly and start doing crazy things with tail recursion and other functional paradigms that ultimately save you memory. remove debugging headaches, and give you an incredible amount of computing stability.

With the last paragraph in mind, let’s look at Azure VMs again – we can’t patch the VM directly. It’s stateless. What does a stateless VM buy you?

  • It’s easy to spin up additional, identical VMs. There’s no worrying if some master image is the same: it is.
  • It’s easy to back out incompatible patches – just remove the differencing VHD.
  • There are no side effects because of errant garbage living on the C: drive.
  • Security – if a virus infects your VM, just reboot.
  • Complex, time consuming patches can be applied once and quickly moved into place.
  • There’s a load balancer in front of every Azure instance. Operations must be idempotent, even when executed against different instances.

Managing state isn’t a component of your operating system in Azure, it’s a component of the storage tier. New paradigms require new ways of thinking. Sometimes a new way of thinking seems broken, wrong, or foolish.

If you’re looking to customize your Azure deployment stack without sacrificing the flexibility of using Azure, then Azure VM roles are for you.

If you’re looking for a replacement for your current VM Ware installation, Microsoft’s Azure VM roles aren’t for you. But while you’re fiddling around with VM settings, I’m going to be playing Scrabble. 

Querying Hive with Toad for Cloud Databases

I recently talked about using the Toad for Cloud Databases Eclipse plug-in to query an HBase database. After I finished up the video, I did some work loading a sample dataset from Retrosheet into my local Hive instance.

This 7 minute tutorial shows you brand new functionality in the Toad for Cloud Databases Eclipse plug-in and how you can use it to perform data warehousing queries against Hive.

Querying HBase with Toad for Cloud Databases

We recently released a new version of Toad for Cloud Databases as an Eclipse plug-in. While this functionality has been in Toad for Cloud since the last release, this video shows Toad for Cloud running in Eclipse and demonstrates some basic querying against HBase using Toad for Cloud’s ability to translate between ANSI compliant SQL code and native database calls.

[media id=5 width=680 height=560]

If the video above doesn’t work for you, everything is available full size at NoSQLPedia – Querying HBase.

New Uses for NoSQL

We all know that you can use NoSQL databases to store data. And that’s cool, right? After all, NoSQL databases can be massively distributed, are redundant, and really, really fast. But some of the things that make NoSQL database really interesting aren’t just the redundancy, performance, or their ability to use all of those old servers in the closet. Under the covers, NoSQL databases are supported by complex code that makes these features possible – things like distributed file systems.

What’s a Brackup?

Brackup is a backup tool. There are a lot of backup tools on the market, what makes this one special?

First, it’s free.

Second, it’s open source; which means it’s always going to be free.

Third, it can chunk your files – files will be crammed into chunks for faster access and distributed across your backup servers. Did you know that opening a filehandle is one of the single most expensive things you can ever do in programming?

Fourth, it supports different backends.

It Can Backup to Riak

I’ve mentioned Riak a few times around here. Quick summary: Riak is a distributed key-value database.

So?

So, this means that when you take a backup, Brackup is going to split your data into different chunks. These chunks are going to be sent to the backup location. In this case, the backup location is going to be your Riak cluster. As Brackup goes along and does its work, it sends the chunks off to Riak.

Unlike sending your data to an FTP server or Amazon S3, it’s going to get magically replicated in the background by Riak. If you lose a backup server, it’s not a big deal because Riak will have replicated that data across multiple servers in the cluster. Backing up your backups just got a lot easier.

Why Is the NoSQL Part Important?

NoSQL can be used for different things. It’s not a just a potential replacement for an RDBMS (and the beginning of another nerd holy war). Depending on the data store and your purpose, you can use a NoSQL database for a lot of different things – most notably as a distributed file system. This saves time and money since you don’t have to buy a special purpose product, you can use what’s already there.

Comparing MongoDB and SQL Server Replication

MongoDB has replication built in. So does SQL Server, Oracle, DB2, PostgreSQL, and MySQL. What’s the difference? What makes each MongoDB a unique and special snowflake?

I recently read a three part series on MongoDB repication (Replication Internals, Getting to Know Your Oplog, Bending the Oplog to Your Will) in an effort to better understand MongoDB’s replication compared to SQL Server’s replication.

Logging Sidebar

Before we get started, it’s important to distinguish between the oplog and MongoDB’s regular log. By default, MongoDB pipes its log to STDOUT… unless you supply the --logpath command line flag. Logging to STDOUT is fine for development, but you’ll want to make sure you log to a file for production use. The MongoDB log file is not like SQL Server’s log. It isn’t used for recovery playback. It’s an activity log. Sort of like the logs for your web server.

What’s The Same?

Both MongoDB and SQL Server store replicated data in a central repository. SQL Server stores transactions to be replicated in the distribution database. MongoDB stores replicated writes in the oplog collection. The most immediate difference between the two mechanisms is that SQL Server uses the transaction as the demarcation point while MongoDB uses the individual command as the demarcation point.

All of our transactions (MongoDB has transactions… they’re just only applied to a single command) are logged. That log is used to ship commands over to a subscriber. Both SQL Server and MongoDB support having multiple subscribers to a single database. In MongoDB, this is referred to as a replica set – every member of the set will receive all of commands from the master. MongoDB adds some additional features: any member of a replica set may be promoted to the master server if the original master server dies. This can be configured to happen automatically.

The Ouroboros

The Ouroboros is a mythical creature than devours its own tail. Like the Ouroboros, the MongoDB oplog devours its own tail. In ideal circumstances, this isn’t a problem. The oplog will happily write away. The replica servers will happily read away and, in general, keep up with the writing to the oplog.

The oplog file is a fixed size so, like the write ahead log in most RDBMSes, it will begin to eat itself again. This is fine… most of the time.

Unfortunately, if the replicas fall far enough behind, the oplog will overwrite the transactions that the replicas are reading. Yes, you read that correctly – your database will overwrite undistributed transactions. DBAs will most likely recoil in horror. Why is this bad? Well, under extreme circumstances you may have no integrity.

Let’s repeat that, just in case you missed it the first time:

There is no guarantee of replica integrity.

Now, before you put on your angry pants and look at SQL Server Books Online to prove me wrong, this is also entirely possible with transactional replication in SQL Server. It’s a little bit different, but the principle still applies. When you set up transactional replication in SQL Server, you also need to set up a retention period. If your replication is down for longer than X hours, SQL Server is going to tell you to cram it up your backside and rebuild your replication from scratch.

Falling Behind

Falling behind is easy to do when a server is under heavy load. But, since MongoDB avoids writing to disk to increase performance, that’s not a problem, right?

Theoretically yes. In reality that’s not always the case.

When servers are under a heavy load, a lot of weird things can happen. Heavy network traffic can result in TCP/IP offloading – the network card can offload work to the CPU. When you’re using commodity hardware with commodity storage, you might be using software RAID instead of hardware RAID to simulate one giant drive for data. Software RAID can be computationally expensive, especially if you encounter a situation where you start swapping to disk. Before you know it, you have a perfect storm of one off factors that have brought your shiny new server to its knees.

In the process, your oplog is happily writing away. The replica is falling further behind because you’re reading from your replica and writing to the master (that’s what we’re supposed to do, after all). Soon enough, your replicas are out of sync and you’ve lost data.

Falling Off a Cliff

Unfortunately, in this scenario, you might have problems recovering because the full resync also uses a circular oplog to determine where to start up replication again. The only way you could resolve this nightmare storm would be to shut down your forward facing application, kill incoming requests, and bring the database back online slowly and carefully.

Stopping I/O from incoming writes will make it easy for the replicas to catch up to the master and perform any shard reallocation that you need to split the load up more effectively.

Climbing Gear, Please

I’ve bitched a lot in this article about MongoDB’s replication. As a former DBA, it’s a scary model. But I’ve bitched a lot in the past about SQL Server’s transactional replication – logs can grow out of control if a subscriber falls behind or dies – but it happens with good reason. The SQL Sever dev team made the assumption that a replica should be consistent with the master. In order to keep a replica consistent, all of the undistributed commands need to be kept somewhere (in a log file) until all of the subscribers/replicas can be brought up to speed. This does result in a massive hit to your disk usage, but it also keeps your replicated databases in sync with the master.

Just like with MongoDB, there are times when a SQL Server subscriber may fall so far behind that you need to rebuild the replication. This is never an easy choice, no matter which platform you’re using, and it’s a decision that should not be taken lightly. MongoDB makes this choice a bit easier because MongoDB might very well eat its own oplog. Once that happens, you have no choice but to rebuild replication.

Replication is hard to administer and hard to get right. Be careful and proceed with caution, no matter what your platform.

At Least There is a Ladder

You can climb out of this hole and, realistically, it’s not that bad of a hole. In specific circumstances you may end up in a situation where you will have to take the front end application offline in order to resync your replicas. It’s not the best option, but at least there is a solution.

Every feature has a trade off. Relational databases trade integrity for performance (in this case) whereas MongoDB trades immediate performance for potential maintenance and recovery problems.

Further Reading

MongoDB

SQL Server

Open Sourcing Sawzall – What Does It Mean?

For Data Analytics or automotive modification, you will find no finer tool.

While perusing twitter, I saw that Google has open sourced Sawzall, one of their internal tools for data processing. WTF does this mean?

Sawzall, WTF?

Apart from a tool that I once used to cut the muffler off of my car (true story), what is Sawzall?

Sawzall is a procedural language for analyzing excessively large data sets. When I say “excessively large data sets”, think Google Voice logs, utility meter readings, or the network traffic logs for the Chicago Public Library. You could also think of anything where you’re going to be crunching a lot of data over the course of many hours on your monster Dell R910 SQL Server.

There’s a lengthy paper about how Sawzall works, but I’ll summarize it really quickly. If you really want to read up on all the internal Sawzall goodness, you can check it out on Google code – Interpreting the Data: Parallel Analysis with Sawzall.

Spell It Out for Me

At its most basic, Sawzall is a MapReduce engine, although the Google documentation goes to great pains to not use the word MapReduce, so maybe it’s not actually MapReduce. It smells oddly like MapReduce to me.

I’ll go into more depth on the ideas behind MapReduce in the future, but here’s the basics of MapReduce as far as Sawzall is concerned:

  1. Data is split into partitions.
  2. Each partition is filtered. (This is the Map.)
  3. The results of the filtering operation are used by an aggregation phase. (This is the Reduce.)
  4. The results of the aggregation are saved to a file.

It’s pretty simple. That simplicity makes it possible to massively parallelize the analysis of data. If you’re in the RDBMS world, think Vertica, SQL Server Parallel Data Warehouse, or Oracle Exadata. If you are already entrenched and in love with NoSQL, you already know all about MapReduce and probably think I’m an idiot for dumbing it down so much.

The upside to Sawzall’s approach is that rather than write a Map program and a Reduce program and a job driver and maybe some kind of intermediate aggregator, you just write a single program in the Sawzall language and compile it.

… And Then?

I don’t think anyone is sure, yet. One of the problems with internal tools is that they’re part of a larger stack. Sawzall is part of Google’s internal infrastructure. It may emit compiled code, but how do we go about making use of those compiled programs in our own applications? Your answer is better than mine, most likely.

Sawzall uses something called Protocol Buffers – PB is a cross language way to efficiently move objects and data around between programs. It looks like Twitter is already using Protocol Buffers for some of their data storage needs, so it might only be a matter of time before they adopt Sawzall – or before some blogger opines that they might adopt Sawzall ;) .

So far nobody has a working implementation of Sawzall running on top of any MapReduce implementations – Hadoop, for instance. At a cursory glance, it seems like Sawzall could be used in Hadoop Streaming jobs. In fact, Section 10 of the Sawzall paper seems to point out that Sawzall is a record by record analytical language – your aggregator needs to be smart enough to handled the filtered records.

Why Do I Need Another Language?

This is a damn good question. I don’t program as much as I used to, but I can reasonably write code in C#, Ruby, JavaScript, numerous SQL dialects, and Java. I can read and understand at least twice as many languages. What’s the point of another language?

One advantage of a special purpose language is that you don’t have to worry shoehorning domain specific functionality into existing language constructs. You’re free to write the language the way it needs to be written. You can achieve a wonderful brevity by baking features into the language. Custom languages let developers focus on the problems at hand and ignore implementation details.

What Now?

You could download the code from the Google Code repository, compile it, and start playing around with it. It should be pretty easy to get up and running on Linux systems. OS X developers should look at these instructions from a helpful Hacker News reader. Windows developers should install Linux on a VM or buy a Mac.

Outside of downloading and installing Sawzall yourself to see what the fuss is about, the key is to keep watching the sky and see what happens.

CloudDBPedia is Changing Its Name to NoSQLPedia

That’s right, we’re changing the name from CloudDBPedia to NoSQLPedia.

When we originally started the site, the focus in the technology world was on cloud computing and cloud databases. Over time, the industry has changed and people have started focusing on NoSQL as the terminology of choice for not-so-relational databases. And, let’s face it, NotSoRelationalDBPedia.com is a bit lengthy for quick and easy typing.

The name change doesn’t reflect any change in focus for the site. We’re still going to be bringing you the best in blogging and community driven technical content about emerging database technology. As time goes on, we’ll be adding more and more into the mix and I think that you’ll be happy about what we’ve got in the pipeline.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.