October 2010
Mon Tue Wed Thu Fri Sat Sun
« Sep   Nov »
 123
45678910
11121314151617
18192021222324
25262728293031

Month October 2010

Upcoming Talks – Next Week

Next week I’ll be in the San Francisco Bay area. More specifically, I’ll be giving three lightning talks at three separate Cloud Camps. It’s the same talk each time, I’ll be giving a general intro to NoSQL and cloud databases.

Silicon Valley Cloud Camp in Santa Clara, CA.
Cloud Camp Santa Clara also in Santa Clara, CA.
Cloud Camp SF @ QCon in San Francisco, CA.

All of these events start at 6:30PM, the Lightning Talks start at 6:45PM, and I have no idea when I’m going to take the stage, but it promises to be good. If you’re in the area and would like to hang out, hit me up in the comments and we can arrange a time to talk.

Testing for Performance

You know that you should be testing your code. You even know that you should be testing your SQL. But why? We need to make sure that changes to our code are safe, prevent regressions, and that we catch edge cases.

But are you testing your code for performance?

Are you testing for performance? You can bet these people are.

Are you testing for performance? You can bet these people are.

Changes to code can make your code faster or slower, depending on indexing as well as user defined functions and built-in functions. Different computations can result in different in different execution plans. If changes to your code can cause drastic changes to your application performance, why aren’t you monitoring the performance of your code?

Test frameworks, like T-SQL Unit, make it possible to wrap the execution of your stored procedures in other processes. By taking advantage of these hooks it’s possible to time the execution of each procedure and record the results in a table (possibly even correlating each run to the appropriate version from source control). You can see how query performance changes over time.

Testing your code is important – you can prevent changes from causing both logical and performance problems.

Foursquare and MongoDB: What If

People have chimed in and talked about the Foursquare outage. The nice part about these discussions is that they’re focusing on the technical problems with the current set up and Foursquare. They’re picking it apart and looking at what is right, what went wrong, and what needs to be done differently in MongoDB to prevent problems like this in the future.

Let’s play a “what if” game. What if Foursquare wasn’t using MongoDB? What if they were using something else?

Riak

Riak is a massively scaleable key/value data store. It’s based on Amazon’s Dynamo. If you don’t want to read the paper, that just means that it uses some magic to make sure that data is evenly spread throughout the database cluster and that adding a new node makes it very easy to rebalance the cluster.

What would have happened at Foursquare if they had been using Riak?

Riak still suffers from the same performance characteristics around disk access as MongoDB – once you have to page to disk, operations become slow and throughput dries up. This is a common problem in any software that needs to access disks – disks are slow, RAM is fast.

Riak, however, has an interesting distinction. It allocates keys inside the cluster using something called a consistent hash ring – this is just a convenient way to rapidly allocate ranges of keys to different nodes within a cluster. The ring itself isn’t interesting. What’s exciting is that the ring is divided into partitions (64 by default). Every node in the system claims an equal share of those 64 partitions. In addition, each partition is replicated so there are always three copies of a give key/value pair at any time. Because there are multiple copies of the data, it’s unlikely that any single node will fail or become unbalanced. In theory, if Foursquare had used Riak it is very unlikely that we’ll run into a problem were a single node becomes full.

How would this consistent hash ring magical design choice have helped? Adding a new node causes the cluster to redistribute data in the background. The new node will claim and equal amount of space in the cluster and the data will be redistributed from the other nodes in the background. The same thing happens when a node fails, by the way. Riak also only stores keys in memory, not all of the data. So it’s possible to reference an astronomical amount before running out of RAM.

There’s no need to worry about replica sets, re-sharding, or promoting a new node to master. Once a node joins a riak cluster, it takes over its share of the load. As long as you have the network throughput on your local network (which you probably do), then this operation can be fairly quick and painless.

Cassandra

Cassandra, like Riak, is based on Amazon’s Dynamo. It’s a massively distributed data store. I suspect that if Foursquare had used Cassandra, they would have run into similar problems.

Cassandra makes use of range partitioning to distribute data within the cluster. The Foursquare database was keyed off of the user name which, in turn, saw abnormal growth because some groups of users were more active than others. Names also tend to clump around certain letters, especially when you’re limited to a Latin character set. I know a lot more Daves that I know Zachariahs, and I have a lot of friends whose names start with the letter L. This distribution causes data within the cluster to be overly allocated to one node. This would, ultimately, lead to the same problems that happened at Foursquare with their MongoDB installation.

That being said, it’s possible to use a random partitioner for the data in Cassandra. The random partitioner makes it very easy to add additional nodes and distribute data across them. The random partitioner comes with a price. It makes it impossible to do quick range slice queries in Cassandra – you can no longer say “I want to see all of the data for January 3rd, 2010 through January 8th, 2010”. Instead, you would need to build up custom indexes to support your querying and build batch processes to load the indexes. The tradeoffs between the random partitioner and the order preserving partitioner are covered very well in Dominic Williams’s article Cassandra: RandomPartitioner vs OrderPreservingPartitioner.

Careful use of the random partitioner and supporting batch operations could have prevented the outage that Foursquare saw, but this would have lead to different design challenges, some of which may have been difficult to overcome without resorting to a great deal of custom code.

HBase/HDFS/Hadoop

HBase is a distributed column-oriented database built on top of HDFS. It is based on Google’s Bigtable database as described in “Bigtable: A Distributed Storage System for Structured Data”. As an implementation of Bigtable, HBase has a number of advantages over MongoDB – write ahead logging, rich ad hoc querying, and redundancy

HBase is not going to suffer from the same node redistribution problems that caused the Foursquare outage. When it comes time to add a new node to the cluster data will be migrated in much larger chunks, one data file at a time. This makes it much easier to add a new data node and redistribute data across the network.

Just like a relational database, HBase is designed so that all data doesn’t need to reside in memory for good performance. The internal data structures are built in a way that makes it very easy to find data and return it to the client. Keys are held in memory and two levels of caching make sure that frequently used data will be in memory when a client requests it.

HBase also has the advantage of using a write ahead log. In the event of a drive failure, it is possible to recover from a backup of your HBase database and play back the log to make sure data is correct and consistent.

If all of this sounds like HBase is designed to be a replacement for an RDBMS, you would be close. HBase is a massively distributed database, just like Bigtable. As a result, data needs to be logged because there is a chance that a hardware node will fail and will need to be recovered. Because HBase is a column-oriented database, we need to be careful not to treat it exactly like a relational database, but

A Relational Database

To be honest, Foursquare’s data load would be trivial in any relational database. SQL Server, Oracle, MySQL, and PostgreSQL can all handle orders of magnitude more data than the 132GB of data that Foursquare was storing at the time of the outage. This begs the question “How we could handle the constant write load?” Foursquare is a write-intensive application.

Typically, in the relational database world, when you need to scale read and write loads we add more disks. There is a finite amount of space in a server chassis and these local disks don’t provide the redundancy necessary for data security and performance; software RAID is also CPU intensive and slow. A better solution is to purchase a dedicated storage device, either a SAN, NAS, or DAS. All of these devices offer read/write caching and can be configured with in a variety of RAID levels for performance and redundancy.

RDBMSes are known quantities – they are easy to scale to certain points. Judging by the amount of data that Foursquare reports to have, they aren’t likely to reach the point where an RDBMS can no longer scale for a very long time. The downside to this approach is that an RDBMS is much costlier per TB of storage (up to ten times more expensive) than using MongoDB, but if your business is your data, then it’s important to keep the data safe.

Conclusion

It’s difficult to say if a different database solution would have prevented the Foursquare outage. But it is a good opportunity to highlight how different data storage systems would respond in the same situation.

Thoughts on Free Amazon Web Services

Last week, Amazon announced that we could all get some free AWS if we signed up for a new account. Just what do you get for signing up? Take a look.

What you get with the AWS Free Usage Bundle

The First Catch

First off, this is only free for the first 12 months. After that you’re going to have to pay as you go.

In a way, this is like Microsoft’s BizSpark, but in the clouds. You get to ride a long for free, for a while. After that while, you’re going to have to pony up some money to keep riding. Nobody said it was free forever.

The Second Catch

Let’s say you start out with your free AWS account and you build an application to help manage your yarn collection. You write some code and you’re happily trucking along. One day you decide to share the link with a few friends who are also really into knitting. The next thing you know, everyone is using your yarn tracking website. This is great, right? It is great… until the bill shows up.

The free AWS only lasts for 12 months. Or until you exceed your service levels. Whichever comes first. Unless you have definite plans to monetize your service, you will need to carefully monitor your application load.

What Do You Really Get?

So what are you really getting for free? We all know that nothing is free, so what makes Amazon’s deal noteworthy?

The Server

You get a virtualized server with 613MB of memory. That barely sounds like enough to power your microwave, but in the world of Linux servers, that may be more than enough for your website. Even if it isn’t enough horsepower for your production application, it’s enough to develop in a realistic environment and deploy to your first round of beta users.

The Load Balancer

Once you’ve got more than a few users, you might need to move beyond that tiny 613MB server. Or you might shard your application out. Or you could use it to help you manage your fault tolerance. There are a lot of reasons to use a load balancer.

Amazon has been thinking of you and you get 750 hours of their load balancing service free. Per month. 750 hours is 31.25 days. You can’t use all the free that you get. It’s just that free.

There is a limit on the amount of free traffic that the load balancer will handle, but that should encourage you to keep your apps lean and mean, right?

Storage

Free storage! 10GB of Amazon Elastic Block Storage sounds like 10GB of free something or another that I don’t know about. Reading the documentation doesn’t clear this up. Basically, you get free magical storage that you can format however you want and attach wherever you want in your Amazon virtual server farm. You can even take point in time snapshots of these storage devices so you can revert your storage to a known good state. That’s pretty cool, eh? Try getting your SAN administrator to let you do that.

More free storage! Five gigs of Amazon S3 is a lot of S3 storage. I use S3 to host large files that I don’t want to upload through WordPress – video, PDFs, and presentations. It’s a great way to add a lot of hosting to your free or cheap hosting account. Likewise, it’s a great way to add storage to your free AWS instances and remove some load from your virtual server. On the down side, it does look like you’re going to pay for the traffic that your readers consume. But, hey, you already counted on that, right?

Even more free storage! 1GB of SimpleDB storage! SimpleDB is a distributed database with a few limitations. Despite the limitations, it’s a sold platform for developing web-based/cloud applications. You can access the database from just about anywhere and it should be available as long as Amazon’s servers are up and running. And you had better believe that Amazon is going to stay up and running.

Bandwidth

Surprise! You get 30GB of free traffic a month. Well, 15GB up and 15GB down. But it’s sort of the same. I think. Maybe? The point is that you get some free bandwidth. Bandwidth can get pricey which makes any amount of free bandwidth a good thing.

Implications For Your Design

With a hard cap of 15GB on your traffic, you’ll want to make sure that you’re using images that are as compressed as possible, minified CSS and JavaScript, clean HTML source code, and lean protocols. You’ll want to be careful in development and make sure that you’re only pulling back the data that you need and nothing more, and that you make as few round trips to the server as possible. Of course, you were doing all of this before, right?

Putting it All Together

The free tier of AWS is a good introduction to working with AWS. It provides more than just a simple virtual machine, it provides an entire infrastructure to get you started. You can rapidly build out from a single to server to a large number of servers and still use most of these free services. If you don’t need something, you’re not going to pay for it. This is a lot cheaper of an option than buying your own hardware or trying to work within the confines of a hosting provider. This is your own virtual hardware to develop with as you please.

One more thing – Amazon may stop accepting new people into the magical free AWS tier at any moment. Act now, supplies are limited.

What I’m Reading 2010-10-22

Free Amazon Web Services for new customers. Amazon are giving away 12 months (from the date of your sign up) of a bunch of Amazon services. If you want to try to start up a business, now is the time to try!

IronRuby is alive! There was some conjecture about the life of IronRuby after Microsoft cut the team. One of the original developers is picking up where he left off.

Programming is for Stupid People… You know, I think he’s right!

SQL Server Transaction Marks are a great feature that very few people seem to know about. I’ve used these in the past for critical workloads and transaction demarcation. You should try it out, you’ll like it.

Using MySQL as NoSQL Just because you need something to be fast, doesn’t mean you have to learn something new.

ARel Two Point Ohhhhh Yaaaaaa Arel is the new Ruby on Rails/ActiveRecord ORM. It’s also interesting because it is an abstract syntax tree – meaning that it can be used to make arbitrary grammars for anything (including SQL). Aaron Patterson walks through the optimizations that were made to Arel 2.0 to bring it up to par with the old ActiveRecord implementation.

Hadoop World Follow Up

I should have written this right when I got back from Hadoop World, instead of a week or so later, but things don’t always happen the way you plan. Before I left to go to Hadoop World (and points in between), I put up a blog post asking for questions about Hadoop. You guys responded with some good questions and I think I owe you answers.

What Is Hadoop?

Hadoop isn’t a simple database; it’s a bunch of different technologies built on top of the Hadoop common utilities, MapReduce, and HDFS (Hadoop Distributed File System). Each of these products serves a simple purpose – HDFS handles storage, MapReduce is a parallel job distribution system, HBase is a distributed database with support for structured tables. You can find out more on the Apache Hadoop page. I’m bailing on this question because 1) it wasn’t asked and 2) I don’t think it’s fair or appropriate for me to regurgitate these answers.

How Do You Install It?

Installing Hadoop is not quite as easy as installing Cassandra. There are two flavors to choose from:

Cloudera flavors

Homemade flavors

  • If you want to run Hadoop natively on your local machine, you can go through the single node setup from the Apache Hadoop documentation.
  • Hadoop The Definitive Guide also has info on how to get started.
  • Windows users, keep in mind that Hadoop is not supported on Windows in production. If you want to try it, you’re completely on your own and in uncharted waters.

What Login Security Model(s) Does It Have?

Good question! NoSQL databases are not renowned for their data security.

Back in the beginning, at the dawn of Hadoopery, it was assumed that everyone accessing the system would be a trusted individual operating inside a secured environment. Clearly this happy ideal won’t fly in a modern, litigation fueled world. As a result, Hadoop has support for a Kerberos authentication (now meaning all versions of Hadoop newer than version 0.22 for the Apache Hadoop distribution, Cloudera’s CDH3 distribution, or the 0.20.S Yahoo! Distribution of Hadoop). The Kerberos piece handles the proper authentication, but it is still up to Hadoop and HDFS(it’s a distributed file system, remember?) to make sure that an authenticated user is authorized to mess around with a file.

In short: Kerberos guarantees that I’m Jeremiah Peschka and Hadoop+HDFS guarantee that I can muck with data.

N.B. As of 2010-10-19 (when I’m writing this), Hadoop 0.22 is not available for general use. If you want to run Hadoop 0.22, you’ll have to build from the source control repository yourself. Good luck and godspeed.

When Does It Make Sense To Use Hadoop Instead of SQL Server, Oracle, or DB2?

Or any RDBMS for that matter. Personally, I think an example makes this easier to understand.

Let’s say that we have 1 terabyte of data (the average Hadoop cluster is reported as being 114.5TB in size) and we need to process this data nightly – create reports, analyze trends, detect anomalous data, etc. If we are using an RDBMS, we’d have to batch this data up (to avoid transaction log problems). We would also need to deal with the limitations of parallelism in your OS/RDBMS combination, as well as I/O subsystem limitations (we can only read so much data at one time). SQL dialects are remarkably bad languages for loop and flow control.

If we were using Hadoop for our batch operations, we’d write a MapReduce program (think of it like a query, for now). This MapReduce program would be distributed across all of the nodes in your cluster and then run in parallel using local storage and resources. So instead of hitting a single SAN across 8 or 16 parallel execution threads, we might have 20 commodity servers all processing 1/20 of the data simultaneously. Each server is going to process 50GB. The results will be combined once the job is done and then we’re free to do whatever we want with it – if we’re using SQL Server to store the data in tables for report, we would probably bcp the data into a table.

Another benefit of Hadoop is that the MapReduce functionality is implemented in Java, C, or another imperative programming language. This makes it easy to solve computationally intensive operations in MapReduce programs. There are a large number of programming problems that cannot be easily solved in a SQL dialect; SQL is designed for retrieving data and performing relatively simple transformations on it, not for complex programming tasks.

Hadoop (Hadoop common + HDFS + MapReduce) is great for batch processing. If you need to consume massive amounts of data, it’s the right tool for the job. Hadoop is being used in production for point of sale analysis, fraud detection, machine learning, risk analysis, and fault/failure prediction.

The other consideration is cost: Hadoop is deployed on commodity servers using local disks and no software licensing fees (Linux is free, Hadoop is free). The same thing with an RDBMS is going to involve buying an expensive SAN, a high end server with a lot of RAM, and paying licensing fees to Microsoft, Oracle, or IBM as well as any other vendors involved. In the end, it can cost 10 times less to store 1TB of data in Hadoop than in an RDBMS.

HBase provides the random realtime read/write that you might be used to from the OLTP world. The difference, however, is that this is still NoSQL data access. Tables are large and irregular (much like Google’s Big Table) and there are no complex transactions.

Hive is a data warehouse toolset that sits on top of Hadoop. It supports many of the features that you’re used to seeing in SQL, including a SQL-ish query language that should be easily learned but also provides support for using MapReduce functionality in ad hoc queries.

In short, Hadoop and SQL Server solve different sets of problems. If you need transactional support, complex rule validation and data integrity, use an RDBMS. If you need to process data in parallel, perform batch and ad hoc analysis, or perform computationally expensive transformations then you should look into Hadoop.

What Tools Are Provided to Manage Hadoop?

Unfortunately, there are not a lot of management tools on the market for Hadoop – the only tools I found were supplied by Cloudera. APIs are available to develop your own management tools. From the sound of discussions that I overheard, I think a lot of developers are going down this route – they’re developing monitoring tools that meet their own, internal, needs rather than build general purpose tools that they could sell to third parties. As the core product improves, I’m sure that more vendors will be stepping up to the plate to provide additional tools and support for Hadoop.

Right now, there are a few products on the market that support Hadoop:

  • Quest’s Toad for Cloud makes it easy to query data stored using HBase and Hive.
  • Quest’s OraOop is an Oracle database driver for Sqoop – you can think of Sqoop as Hadoop’s equivalent of SQL Server’s bcp program
  • Karmasphere Studio is a tool for writing, debugging, and watching MapReduce jobs.

What Is The State of Write Consistency?

As I mentioned earlier, there is no support for ACID transactions. On the most basic level, Hadoop processes data in bulk from files stored in HDFS. When writes fail, they’re retried until the minimum guaranteed number of replicas is written. This is a built-in part of HDFS, you get this consistency for free just by using Hadoop.

As far as eventual consistency is concerned, Hadoop uses an interesting method to ensure that data is quickly and effectively written to disk. Basically, HDFS finds a place on disk to write the data. Once that write has completed, the data is forwarded to another node in the cluster. If that node fails to write the data, a new node is picked and the data is written until the minimum number of replicas have had data written to them. There are additional routines in place that will attempt to spread the data throughout the data center.

If this seems like an overly simple explanation of how HDFS and Hadoop write data, you’re right. The specifics of write consistency can be found in the HDFS Architecture Guide (which is nowhere near as scary as it sounds).

Upcoming Presentations!

So, it’s only the one presentation, but it is still a presentation.

On October 26th, I’ll be presenting for the Application Development Virtual Chapter. If you weren’t able to attend the Columbus Code Camp, October 26th will be your lucky day! I’ll be revisiting my presentation, Refactoring SQL. Live Meeting will be the only way you can get a hold of this gem, so make sure you’re ready to rock and roll. The party starts at 12PM Eastern, so make sure you’re there on time.

Also, if you’re going to be in the San Francisco Bay area for the first week in November, look me up. I’ll be speaking at three cloud camps out there. Good times.

Columbus Code Camp Slides and Round Up

This weekend I attended the Columbus Code Camp. This was the first code camp to be held in Columbus, and I think it was a success.

There were two tracks of speakers talking about a variety of different topics – Clojure, automated testing, SQL, Ruby, and phone development. The local turn out was good and the sponsor turn out was great as well. Sponsors donated a lot of prizes, swag, and offered their support in a variety of ways. All in all, I would say it was a great event.

I gave a talk about Refactoring SQL. It was the first time I’d presented it and I was polishing the slides until 1:00AM the night before. Lucky for you guys, the slides are now up on SlideShare: Refactoring SQL

.

What I’m Reading – 2010-10-08

Cassandra: RandomPartitioner vs OrderPreservingPartitioner Data order is important in relational databases and it’s something that you need to be aware of with a non-relational database, too. Improperly ordered data can put a huge load on a few nodes in a cluster. This article goes over the trade-offs in Cassandra of using a random data order vs key ordered data.

liblfds Want to write your own NoSQL database in C? These (free) libraries should make it pretty easy to do. They’d probably work good for other projects. If nothing else, it’s a good read to see how some of the underlying data structures are implemented.

Riak Bitcask Capacity Planning Spreadsheet We have capacity planning tools for our relational database, why not have one for our non-relational databases?

Foursquare outage post mortem Foursquare uses the NoSQLs (specifically MongoDB). They had a big nasty outage last week. One of the interesting things this thread brings up is that NoSQL databases still require a good knowledge of their internals in high performance scenarios. You can’t abstract away the problems even if you abstract away how data is stored on disk.

Relational Data, Document Databases and Schema Design There are some things that are really different. Get your learn on.

Prime Number Shitting Bear

Syndicated Bloggers at CloudDBPedia

I’ve been quiet about the bloggers we’ve been adding over at CloudDBPedia, I’m going to fix that. Today I’m happy to announce that we are syndicating 11 RSS feeds.

  • 2 Blokes Marketing – Christian Hasker and Andy Grant should be familiar to fans of SQLServerPedia, they’ve been instrumental in SQLServerPedia’s growth and they bring an interesting marketing slant to the conversation.
  • Brent Ozar – Brent hasn’t been talking much about NoSQL recently, but when he does he comes to the table with the viewpoint of a SQL Server DBA.
  • Buck Woody – Buck Woody is a recent addition to the Azure team at Microsoft. He brings 20+ years of experience to the table with databases and software design.
  • Guy Harrison – Guy is the Director of Research and Development at Quest. He lives in Australia and creates some interesting software like Toad for Cloud Databases. Guy has also authored two books, one on Oracle and one covering MySQL.
  • Jeremiah Peschka – hooray for me! I like data. I’m interested in NoSQL and cloud solutions because they make some interesting changes to how we all think about information storage.
  • Kevin Kline – Kevin is another Database Expert at Quest. He is the author of SQL in a Nutshell and is an all around good guy.
  • Kyle Banker – Kyle is a developer at 10gen, makers of MongoDB. Kyle maintains the Ruby driver for MongoDB.
  • Ayende Rahien – Ayende is well known as a founding or contributing to a number of open source projects (NHibernate, Castle, Rhino Mocks, NHibernate Query Analyzer, Rhino Commons) and is a prolific blogger. In addition, he is also writing RavenDB – a document database implemented in .NET.
  • Randy Guck – Randy works at Quest Software and has worked on a variety of platforms.
  • Kristina Chodorow works for 10gen, makers of MongoDB. She works on the core MongoDB server as well as the Perl and PHP drivers and has recently published a book about MongoDB.
  • The Basho Blog is the collective blog of the makers of Riak. In addition to blogging about their own database, there are a lot of links to presentations that Basho team members are doing at user groups, conferences, and online.

I want to say thank you to the bloggers who have generously agreed to syndicate their blog at CloudDBPedia.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 230 items have been purified.