Introduction to Riak … TONIGHT!

I’ll be speaking at the Columbus Ruby Brigade and giving an introduction to Riak tonight at 6:30PM!

There will be pizza and soda and Ruby and me. You can even stick around afterwards while we all go next door for drinks (you can buy my Diet Cokes all night if you really like the presentation).

Riak: An Overview

This presentation will lead you through an overview of Riak: a flexible, decentralized key-value store. Riak was designed to provide a friendly HTTP/JSON interface and provide a database that’s well suited for reliable web applications.

Add it to your calendar!

Protecting Your Content – Copyright, Licensing, and You

Why Should I Worry About Licensing?

You probably just have a blog, or maybe you haven’t even started blogging yet. Maybe you’re just sharing your thoughts on Facebook Notes or Google Pages. However you look at it you’re probably certain that you don’t need to worry about your writing on the web. It is, after all, your content.

Think again.

In all fairness, Facebook’s terms of use are some of the more consumer friendly terms of service out there. Facebook does not claim copyright over your content, but using Facebook immediately grants Facebook “a non-exclusive, transferable, sub-licensable, royalty-free, worldwide license to use any IP content that you post on or in connection with Facebook (‘IP License’)”. Basically, Facebook can use any picture or thought you’ve posted at any time with no notice to you. Their ability to use your content continues as long as you have an account and you have not shared your content with anyone.

Google’s terms of service are not as friendly as Facebook’s. To start with, Google’s terms use legal language while Facebook’s terms are in something reasonably close to plain English. Google’s terms get worse from there. Instead of allowing you to remove your content by permanently deleting it (and all copies), Google’s terms state that you’re giving them the right to use your thoughts until the end of time, or until you stop using all of Google’s services (see sections 11 and 13.2). Both companies’ terms of service contain my favorite legal provision: the terms are subject to change at any time. In short, if you aren’t hosting your own content, you don’t own it. Not completely. You can claim you’re copyrighting it, but someone else can use it because you’ve implicitly given them permission, and that permission may change.

Back to my earlier question: why should you worry about licensing? You should worry about licensing because you want to be able to control your own content. There’s nothing that stops a third party from changing the terms of service to require their permission if you republish something. If you wanted to republish a blog post on another site, syndicate your content, or print something you wrote in a book you could suddenly find yourself in a legal mess. What if you don’t want pictures of yourself, your friends, or your children to appear in ads?

Licensing comes down to control over your content and maintaining that control into the future. If you want to keep control, you need to examine the license that you have chosen for your content. This doesn’t just apply to the written word, it applies to your presentations, your photographs, and your code samples.

Written Licensing

What’s the best way to protect content you’ve written? That all depends on how you want people to be able to use and re-use your content.

The Well Worn Path: Copyrighting Your Work

The strictest way to protect your content is to copyright it. Stanford University have compiled a great list of copyright resources and it’s important to understand the rights around your work. A copyrighted work doesn’t need to be marked as such, but it will make it much easier to enforce your copyright. Very few people actively want to steal your work, by including a copyright notice with contact information you are making it easier for other authors to track you down and get your permission to use part of your work. The best part is that because of international treaties, there is very little difference in copyright laws between different countries.

Keep in mind that copyrighting your work does not prevent others from reusing portions of your work under fair use principles. Fair use is a tricky thing and is subject to some vague criteria. If you aren’t competing with the author, copying wholesale, building a new work, and are not motivated by a desire for commercial gain, you’re on the way to falling under fair use rules. When in doubt, ask the original author for permission. If you can’t find the original author, find an attorney.

While there are some nuances to copyright law, it is fairly straightforward. You mark your work as copyrighted and that’s it. Others can make use of portions of your work under fair use guidelines and they should ask for permission, but it isn’t strictly necessary.

Flexible Designs for the Future: Creative Commons

A Creative Commons license is, on the surface, not so different from traditional copyright. It’s a more flexible copyright. Rather than have a single, restrictive agreement between the copyright holder and the rest of the world, the Creative Commons license makes it easy for copyright holders to expressly allow certain behaviors.

It all boils down to a few questions:

  • Do you want to allow commercial use of your work?
  • Do you want to allow adaptations of your work?
  • Do you want the terms of the license to continue?

Saying “no” to any of these questions doesn’t prevent anyone from using your work in those ways, they just need to obtain your express permission. A Creative Commons license page clearly explains the terms of a particular license making it very easy for readers and other authors to learn how they can or cannot use your work.

One of the most important aspects of the Creative Commons license is the ability to require future authors to share alike. Adding the share alike provision to your Creative Commons license requires future collaborators to distribute their derivative work under the same license; your work and all work that builds on it will always be available under the same terms you envisioned when you created the content.

Learn more.

The Public Domain

I’ll admit it freely: when I started writing this, I didn’t know a whole lot about how the public domain worked and what it meant. I knew that everything in the public domain was free and couldn’t be taken under someone else’s control, but I didn’t know much more than that.

A work enters the public domain when the intellectual property rights on the work expire or when those rights are forfeited. Basically, I can take anything I’ve previously written and decree that my work is now in the public domain. It belongs to everyone at that point. Work that has entered the public domain is free for anyone else to build upon. In many ways, the public domain is crucial for the advancement of science and the arts. It makes it possible to build on earlier works, to examine and expand upon the work of Isaac Newton or to re-arrange a symphony to be performed by kazoos and barking dogs. Works in the public domain carry no restrictions on their use.

Unfortunately, the definition of public domain varies from country to country so there’s no reliable guarantee or best guess that you can make about how something can be used or re-used, even if the author states their work is in the public domain. When in doubt, consult an attorney (or Google).

The biggest thing to remember about putting your own work in the public domain is that it’s out there for anyone to use and re-use. A less scrupulous person could collect your blog posts and arrange them into a coherent narrative and the publish it as a book. They could also make as many changes as they wanted and there would be nothing you could do to correct the situation.

Software Licensing

Why should we even be talking about software licensing? Software licensing is important if you want to release software for people to use, or even if you want to put sample code on your blog for others to re-use. Of course, you could state in your blog’s copyright that all of your source code is covered under the same restrictive copyright as the rest of your blog, but where’s the fun in that?

Proprietary Software

This is software that is exclusively licensed by the copyright holder. The copyright holder says “here, you can use this because you gave me money, but you have to abide by these rules.” After which they drop a license document the size of a phone book on your desk with an invoice stapled to the top.

So it’s not really like that. How does it work?

With proprietary software, the copyright holder grants you the right to use their software within certain conditions – you can’t modify it or sell it along to your buddies or reverse engineer it to make your own version. License terms vary from vendor to vendor. Some are incredibly permissive and others are very strict. It’s important to look at your software license if you’re ever in doubt of what you can or can’t do with your software.

Likewise, if you’re going to be creating software, you need to be aware of what the terms of your license mean. Commercial software is best licensed under a proprietary license. After all, if I can download and compile your source code free of charge, why should I pay you for your software?

Open Source Software

There are some people who will take issue and say that I should call this section Free and Open Source Software (FOSS) or Free, Libre, and Open Source Software (FLOSS). To these people I say, “get your own blog!”

Open Source Software (OSS) is a contentious area of software. In practice, OSS is software that is released under a specific license and the source code is distributed with the software. In fact, OSS software usually comes as nothing but source code with a license attached. Helpful people often provide compiled versions of the software for various hardware and software platforms.

One of the greatest strengths of open source software is that future users of the software and given specific rights that are normally reserved for copyright holders. This is a lot like Creative Commons licensing in some ways. There are far too many open source licenses to examine them in any detail, so be sure to do your homework if you ever need to choose one.

Now, why should you choose an open source license? I choose to release my demo code under an open source license because I want people to be able to use it, re-use it, and feel free to contribute back. Demo code should stand on its own, but it’s important to remember that your demo code is part of your reputation – keep it safe.

Some good options for open source licenses are the Apache License, the MIT License, or the LGPL. Make sure you read the licenses and understand them before using them. Some licenses have more provisions than others, some restrict future commericial use, and some have almost no provisions at all (the MIT and BSD licenses are like this).

Public Domain

Public domain isn’t specific to the written word, art, and music – software can be covered under the public domain as well. The same legal ramifications apply to software released under the public domain. One of the more famous pieces of public domain software is SQLite.

Why Should I Be Worried About This?

Anyone worried about maintaining control of their own work should be worried about copyright and software licensing. Maintaining control of your work and how it can be distributed is an important part of producing content. If you want to be permissive about ho`w your work is used, you can grant rights to people in advance through the Creative Commons or through open source licensing. If you want people to request permission, you can use stricter copyright requirements and proprietary software licensing. There are many choices available.

Database Restores – Where’s my Transaction Log Backup?

Developers! DBAs! Has this ever happened to you?

Surprise! It's a database migration error!

You’re chugging along on a Friday night getting ready for your weekend deployment. Your 2 liter of Shasta is ice cold, you have your all Rush mix tape, and you’re wearing tube socks with khakis. Things are looking up. You open up your deployment script. You’re confident because you’ve tested it in the QA environment and everything worked. You press F5 and lean back in your chair, confident that the script is going to fly through all of the changes. Suddenly, there’s an error and you’re choking in surprise on Shasta.

In an ideal world, you could pull out your trusty log backups and do a point in time restore, right? What if you’ve never taken a transaction log backup? What if you only have full database backups? Can you still recover from this situation? The answer, thankfully, is yes.

Let’s break something!

USE ftgu;
GO

-- at midnight, we took our initial back up
BACKUP DATABASE ftgu TO DISK = 'C:\ftgu-1.bak'
GO

-- customer data from the business is inserted
-- more customer data is inserted

-- some kind of migration goes here

-- insert a bad value
INSERT INTO Bins (Shelf, Bin)
VALUES ('B', 9)
GO

SELECT GETDATE();

SELECT * FROM Bins WHERE Shelf = 'B' ORDER BY BinID DESC;
GO

-- wait for a bit
WAITFOR DELAY '00:01:00';
GO

-- do something dumb
DELETE p 
FROM Products p
JOIN Bins b ON p.BinID = b.BinID
WHERE b.Shelf = 'B';

DELETE FROM Bins WHERE Shelf = 'B'
GO

SELECT GETDATE();
GO

We have a starting backup, no t-log backups, and we’ve gone and deleted some important data from the production database. How do we get it back? If we restored the database from our first backup we might lose a lot of data. Who knows when the last database backup was taken? Oh, midnight. So, in this case, we’d lose a day of data. Well, bugger. In a panic, we save the state of our broken database.

-- ack!
BACKUP DATABASE ftgu TO DISK = 'C:\ftgu-2.bak';

And then we realize that we also need our transaction log:

-- ah crap, I need to back up my log to get point in time recovery!
BACKUP LOG ftgu
TO DISK = 'C:\ftgu-log-1.trn';
GO

Here’s the kicker – the transaction log has never been backed up. (In my experience, this is all too common.) This database has been running for a week or a year or three years without any kind of transaction log backups. We’re screwed right? I mean, wouldn’t we have to apply all of the transactions from the log to the very first full backup we have? No.

Let’s get started and restore our last good backup. We always have our backup with missing data, just in case we need it for some reason.

-- switch to master (need to make sure nobody else is using that database)
USE master;
GO

-- restore the last full backup with known good data
-- make sure to specify NORECOVERY so we can 
-- apply our transaction log backup
RESTORE DATABASE ftgu
FROM DISK = 'C:\ftgu-1.bak'
WITH REPLACE, NORECOVERY;
GO

SQL Server is cunning and records the log sequence number (LSN) from the last full backup (technically it’s the start and end LSN from the last full backup). If we have a log backup that encompasses the relevant LSNs, we’re good to go. Since our transaction logs were never backed up before today, we’re safe.

We’re going to use something called

-- restore the log backup until right before we started
-- this is called "point in time recovery"
RESTORE LOG ftgu
FROM DISK = 'C:\ftgu-log-1.trn'
WITH STOPAT = '2011-02-13 10:03:55.653';

Even though we never took a transaction log backup before today, we’re able to take a backup and recover from what initially seemed like a bad situation.

Introduction to Riak – Next Monday

I’ll be speaking at the Columbus Ruby Brigade and giving an introduction to Riak next Monday, February 21, at 6:30PM.

Riak: An Overview

This presentation will lead you through an overview of Riak: a flexible, decentralized key-value store. Riak was designed to provide a friendly HTTP/JSON interface and provide a database that’s well suited for reliable web applications.

Add it to your calendar!

SQL Saturday 60 Resources

SQL Saturday 60 was a week ago and I completely failed to post resources from the presentation in a timely manner.

The SQL Server Internals resources have been available for a while: http://facility9.com/resources/sql-server-internals… You just had to know to look for them.

The Modeling Muddy Data talk is available on GitHub: https://github.com/peschkaj/Muddy-Data. This presentation is released under a Creative Commons Attribution-ShareAlike license which means that we can all make things better by collaborating on the presentation materials. I’ll slowly be adding more information to the write up of the talk that is in the README.

Introduction to Riak at Columbus Ruby Brigade

I’ll be speaking at the Columbus Ruby Brigade and giving an introduction to Riak on February 21 at 6:30PM.

Riak: An Overview

This presentation will lead you through an overview of Riak: a flexible, decentralized key-value store. Riak was designed to provide a friendly HTTP/JSON interface and provide a database that’s well suited for reliable web applications.

Add it to your calendar!

Querying Riak – Key Filters and MapReduce

A while back we talked about getting faster writes with Riak. Since then, I’ve been quiet on the Riak front. Let’s take a look at how we can get data out of Riak, especially since I went to great pains to throw all of that data into Riak as fast as my little laptop could manage.

Key filtering is a new feature in Riak that makes it much easier to restrict queries to a subset of the data. Prior to Riak 0.13, it was necessary to write MapReduce jobs that would scan through all of the keys in a bucket. The problem is that the MapReduce jobs end up loading both the key and the value into memory. If we have a lot of data, this can cause a huge performance hit. Instead of loading all of the data key filtering lets us look at the keys themselves. We’re pre-processing the data before we get to our actually query. This is good because 1) software should do as little as possible and 2) Riak doesn’t have secondary indexing to make querying faster.

Here’s how it works: Riak holds all keys in memory, but the data remains on disk. The key filtering code scans the keys in memory on the nodes in our cluster. If any keys match our criteria, Riak will pass them along to any map phases that are waiting down the pipe. I’ve written the sample code in Ruby but this functionality is available through any client.

The Code

We’re using data loaded with load_animal_data.rb. The test script itself can be found in mr_filter.rb. Once again, we’re using the taxoboxes data set.

The Results

             user     system      total        real
mr       0.060000   0.030000   0.090000 ( 20.580278)
filter   0.000000   0.000000   0.000000 (  0.797387)

MapReduce

First, the MapReduce query:

{"inputs":"animals",
 "query":[{"map":{"language":"javascript",
                  "keep":false,
                  "source":"function(o) { if (o.key.indexOf('spider') != -1) return [1]; else return []; }"}},
          {"reduce":{"language":"javascript",
                     "keep":true,
                     "name":"Riak.reduceSum"}}]}

We’re going to iterate over every key value pair in the animals bucket and look for a key that contains the word ‘spider’. Once we find that key, we’re going to return a single element array containing the number 1. Once the map phase is done, we use the built-in function Riak.reduceSum to give us a sum of the values from the previous map phase. We’re generating a count of the records that match our data – how many spiders do we really have?

Key Filtering

The key filtering query doesn’t look that much different:

{"inputs":{"bucket":"animals",
           "key_filters":[["matches","spider"]]},
 "query":[{"map":{"language":"javascript",
                  "keep":false,
                  "source":"function(o) { return [1]; }"}},
          {"reduce":{"language":"javascript",
                     "keep":true,
                     "name":"Riak.reduceSum"}}]}

It’s not that much different – the map query has been greatly simplified to just return [1] on success and the search criteria has been moved into the inputs portion of the query. The big difference is in the performance: the key filter query is 26 times faster.

This is a simple example, but a 26x improvement is nothing to scoff at. What it really means is that the rest of our MapReduce needs to work on a smaller subset of the data which, ultimately, makes things faster for us.

A Different Way to Model Data

Now that we have our querying basics out of the way, let’s look at this problem from a different perspective; let’s say we’re tracking stock performance over time. In a relational database we might have a number of tables, notably a table to track stocks and a table to track daily_trade_volume. Theoretically, we could do the same thing in Riak with some success, but it would incur a lot of overhead. Instead we can use a natural key to locate our data. Depending on how we want to store the data, this could look something like YYYY-MM-DD-ticker_symbol. I’ve created a script to load data from stock exchange data. For my tests, I only loaded the data for stocks that began with Q. There’s an a lot of data in this data set, so I kept things to a minimum in order to make this quick.

Since our data also contains the stock exchange identifier, we could even go one step further and include the exchange in our key. That would be helpful if we were querying based on the exchange.

If you take a look at [mr_stocks.rb][8] you’ll see that we’re setting up a query to filter stocks by the symbol QTM and then aggregate the total trade volume by month. The map phase creates a single cell array with the stock volume traded in the month and returns it. We use the Riak.mapValuesJson function to map the raw data coming in from Riak to a proper JavaScript object. We then get the month that we’re looking at by parsing the key. This is easy enough to do because we have a well-defined key format.

function(o, keyData, arg) {
  var data = Riak.mapValuesJson(o)[0];
  var month = o.key.split('-').slice(0,2).join('-');
  var obj = {};
  obj[month] = data.stock_volume;
  return [ obj ];
}

If we were to look at this output we would see a lot of rows of unaggregated data. While that is interesting, we want to look at trending for stock trades for QTM over all time. To do this we create a reduce function that will sum up the output of the map function. This is some pretty self explanatory JavaScript:

function(values, arg) {
  return [ values.reduce(function(acc, item) {
               for (var month in item) {
                 if (acc[month]) { acc[month] += parseInt(item[month]); }
                 else { acc[month] = parseInt(item[month]); }
               }

               return acc;
             })
  ] 
}

Okay, so that might not actually be as self-explanatory as anyone would like. The JavaScript reduce method is a newer one. It will accumulate a single result (the acc variable) for all elements in the array. You could use this to get a sum, an average, or whatever you want.

One other thing to note is that we use parseInt. We probably don’t have to use it, but it’s a good idea. Why? Riak is not aware of our data structures. We just store arrays of bytes in Riak – it could be a picture, it could be text, it could be a gzipped file – Riak doesn’t care. JavaScript only knows that it’s a string. So, when we want to do mathematical operations on our data, it’s probably wise to use parseInt and parseFloat.

Where to Now?

Right now you probably have a lot of data loaded. You have a couple of options. There are two scripts on github to remove the stock data and the animal data from your Riak cluster. That’s a pretty boring option. What can you learn from deleting your data and shutting down your Riak cluster? Not a whole lot.

You should open up mr_stocks.rb and take a look at how it works. It should be pretty easy to modify the map and reduce functions to output total trade volume for the month, average volume per day, and average price per day. Give it a shot and see what you come up with.

If you have questions or run into problems, you can hit up the comments, the Riak Developer Mailing List, or hit up the #riak IRC room on irc.freenode.net if you need immediate, real time help with your problem.

Data Durability

A friend of mine half-jokingly says that the only reason to put data into a database is to get it back out again. In order to get data out, we need to ensure some kind of durability.

Relational databases offer single server durability through write-ahead logging and checkpoint mechanisms. These are tried and true methods of writing data to a replay log on disk as well as caching writes in memory. Whenever a checkpoint occurs, dirty data is flushed to disk. The benefit of a write ahead log is that we can always recover from a crash (so long as we have the log files, of course).

How does single server durability work with non-relational databases? Most of them don’t have write-ahead logging.

MongoDB currently has limited single server durability. While some people consider this a weakness, it has some strengths – writes complete very quickly since there is no write-ahead log that needs to immediately sync to disk. MongoDB also has the ability to create replica sets for increased durability. There is one obvious upside to replica sets – the data is in multiple places. Another advantage of replica sets is that it’s possible to use getLastError({w:...}) to request acknowledgement from multiple replica servers before a write is reported as complete to a client. Just keep in mind that getLastError is not used by default – application code will have to call the method to force the sync.

Setting a w-value for writes is something that was mentioned in Getting Faster Writes with Riak. Although, in that article we were decreasing durability to increase write performance. In Amazon Dynamo inspired systems writes are not considered complete until multiple clients have responded. The advantage is that durable replication is enforced at the database and clients have to elect to use less security for the data. Refer to the Cassandra documentation on Writes and Consistency or the Riak Replication documentation for more information on how Dynamo inspired replication works. Datastores using HDFS for storage can take advantage of HDFS’s built-in data replication.

Even HBase, a column-oriented database, uses HDFS to handle data replication. The trick is that rows may be chopped up based on columns and split into regions. Those regions are then distributed around the cluster on what are called region servers. HBase is designed for real-time read/write random-access. If we’re trying to get real-time reads and writes, we can’t expect HBase to immediately sync files to disk – there’s a commit log (RDBMS people will know this as a write-ahead log). Essentially, when a write comes in from a client, the write is first written to the commit log (which is stored using HDFS), then it’s written in memory and when the in-memory structure fills up, that structure is flushed to the filesystem. Here’s something cunning: since the commit log is being written to HDFS, it’s available in multiple places in the cluster at the same time. If one of the region servers goes down it’s easy enough to recover from – that region server’s commit log is split apart and distributed to other region servers which then take up the load of the failed region server.

There are plenty of HBase details that have been grossly oversimplified or blatantly ignored here for the sake of brevity. Additional details can be found in HBase Architecture 101 – Storage as well as this Advanced HBase presentation. As HBase is inspired by Google’s big table, additional information can be found in Chang et al. Bigtable: A distributed storage system for structured data and The Google File System.

Interestingly enough, there is a proposed feature for PostgreSQL 9.1 to add synchronous replication to PostgreSQL. Current replication in PostgreSQL is more like asynchronous database mirroring in SQL Server, or the default replica set write scenario with MongoDB. Synchronous replication makes it possible to ensure that data is being written to every node in the RDBMS cluster. Robert Haas discusses some of the pros and cons of replication in PostgreSQL in his post What Kind of Replication Do You Need?.

Microsoft’s Azure environment also has redundancy built in. Much like Hadoop, the redundancy and durability is baked into Azure at the filesystem. Building the redundancy at such a low level makes it easy for every component of the Azure environment to use it to achieve higher availability and durability. The Windows Azure Storage team have put together an excellent overview. Needless to say, Microsoft have implemented a very robust storage architecture for the Azure platform – binary data is split into chunks and spread across multiple servers. Each of those chunks is replicated so that there are three copies of the data at any given time. Future features will allow for data to be seamlessly geographically replicated.

Even SQL Azure, Microsoft’s cloud based relational database, takes advantage of this replication. In SQL Azure when a row is written in the database, the write occurs on three servers simultaneously. Clients don’t even see an operation as having committed until the filesystem has responded from all three locations. Automatic replication is designed into the framework. This prevents the loss of a single server, rack, or rack container from taking down a large number of customers. And, just like in other distributed systems, when a single node goes down, the load and data are moved to other nodes. For a local database, this kind of durability is typically only obtained using a combination of SAN technology, database replication, and database mirroring.

There is a lot of solid technology backing the Azure platform, but I suspect that part of Microsoft’s ultimate goal is to hide the complexity of configuring data durability from the user. It’s foreseeable that future upgrades will make it possible to dial up or down durability for storage.

While relational databases are finding more ways to spread load out and retain consistency, there are changes in store for MongoDB to improve single server durability. MongoDB has been highly criticized for its lack of single server durability. Until recently, the default response has been that you should take frequent backups and write to multiple replicas. This is still a good idea, but it’s promising to see that the MongoDB development team are addressing single server durability concerns.

Why is single server durability important for any database? Aside from guaranteeing that data is correct in the instance of a crash, it also makes it easier to increase adoption of a database at the department level. A durable single database server makes it easy to build an application on your desktop, deploy it to the server under your desk, and move it into the corporate data center as the application gains importance.

Logging and replication are critical technologies for databases. They guarantee data is durable and available. There are also just as many options as there are databases on the market. It’s important to understand the requirements of your application before choosing mechanisms to ensure durability and consistency across multiple servers.

References

Goals for 2011 – Early Update

It’s a bit early to be updating my goals for 2011, but I’m really excited about this one. Over the course of last week, I wrote an article about loading data into Riak. I had a brief conversation with Mark Phillips (blog | twitter) about adding some of the code to the Riak function contrib.

This is where a sane person would say “Yeah, sure Mark, do whatever you want with my code.” Instead I said something like “I’d be happy to share. How about I make a generic tool?” About 40 minutes later I had a working chunk of code. 30 minutes after that I had refactored the code into a driver and a library. I wrote up some documentation and sent everything off to be included in the main Riak function contrib repository. A couple of days and a documentation correction later and you can now see my first code contribution to the open source world on the internet: Importing YAML.

While I’m really excited about this, and it’s very important to me, there’s more to take away from this than just “Yay, I did something!” We’re all able to give something back to our community. In this case I took code I had written to perform benchmarks and extracted a useful piece of demonstration code from it. Share your knowledge with the world around you – it’s how we get smarter.

Getting Faster Writes with Riak

While preparing for an upcoming presentation about Riak for the Columbus Ruby Brigade, I wrote a simple data loader. When I initially ran the load, it took about 4 minutes to load the data on the worst run. When you’re waiting to test your data load and write a presentation, 4 minutes is an eternity. Needless to say, I got frustrated pretty quickly with the speed of my data loader, so I hit up the Riak channel on IRC and started digging into the Ruby driver’s source code.

The Results

              user     system      total        real
defaults 63.660000   3.270000  66.930000 (166.475535)
dw => 1  50.940000   2.720000  53.660000 (128.470094)
dw => 0  52.350000   2.740000  55.090000 (120.151827)
n => 2   52.850000   2.790000  55.640000 (132.023310)

The Defaults

Our default load uses no customizations. Riak is going to write data to three nodes in the cluster (n = 3). Since we’re using the default configuration, we can safely assume that Riak will use quorum for write confirmation (w = n/2 + 1). Finally, we can also assume that the durable write value is to use a quorum, since that’s the default for riak-client.

Because we’re writing to n (3) nodes and we’re waiting for w (2) nodes to respond, writes were slower than I’d like. Thankfully, Riak makes it easy to tune how it will respond to writes.

Changing the N Value

The first change that we can do is change the N value (replication factor). The N value should have a huge improvement for my test machine – Riak is only on one of my hard drives. Even solid state drives can only write to one place at a time. When we create the bucket we can change the bucket’s properties and set the N value. note It’s important that you set bucket properties when you ‘create’ the bucket. Buckets are created when keys are added to them and they are deleted when the last key is deleted.

b1 = client.bucket('animals_dw1',
                   :keys => false)
b1.props = { :n_val => 1, :dw => 1 }

In this chunk of code we set the N value to 1 and set the durable writes to 1. This means that only 1 replica will have to commit the record to durable storage in order for the write to be considered a success.

On the bright side, this approach is considerably faster. Here’s the bummer: by setting the N value to 1, we’ve removed any hope of durability from our cluster – the data will never be replicated. Any server failure will result in data loss. For our testing purposes, it’s okay because we’re trying to see how fast we can make things, not how safe we can make them.

How much faster? Our run with all defaults enabled took 166 seconds. Only writing to 1 replica shaved 38 seconds off of our write time. The other thing that I changed was setting returnbody to false. By default, the Ruby Riak client will return the object that was saved. Turning this setting off should make things faster – less bytes are flying around the network.

Forget About Durability

What happens when we turn down durability? That’s the dw => 0 result in the table at the beginning of the article. We get an 8 second performance boost over our last load.

What did we change? We set both the dw and w parameters to 0. This means that our client has told Riak that we’re not going to wait for a response from any replicas before decided that a write has succeeded. This is a fire and forget write – we’re passing data as quickly as possible to the client and to hell with the consequences.

So, by eliminating any redundancy, ignoring the current record from the database, and refusing to acknowledge any reads from the server, we’re able to get a 46.3 second performance improvement over our default values. This is impressive, but it’s roughly akin to throwing our data at a bucket, not into the bucket.

What if I Care About My Data?

What if you care about your data? After all, we got our performance improvement from setting the number of replicas to 1 and turning off write acknowledgement. The fourth, and final run, that I performed took a look at what would happen if we kept the number of replicas at a quorum (an N value of 2) and ignored write responses. If we’re just streaming data into a database, we may not care if a record gets missed here and there. It turns out that this is only slightly slower than running with scissors. It takes 132 seconds to write the data; only 4 seconds slower than with durable writes set to 1 and still nearly 34.5 seconds faster than using the defaults.

The most recent version of this sample code can be found on github at https://github.com/peschkaj/riak_intro. Sample data was located through Infochimps.com. The data load uses the Taxobox data set.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 531 items have been purified.