Category Hadoop

Querying Hive with Toad for Cloud Databases

I recently talked about using the Toad for Cloud Databases Eclipse plug-in to query an HBase database. After I finished up the video, I did some work loading a sample dataset from Retrosheet into my local Hive instance.

This 7 minute tutorial shows you brand new functionality in the Toad for Cloud Databases Eclipse plug-in and how you can use it to perform data warehousing queries against Hive.

Facebook Messaging – HBase Comes of Age

Update: I want to thank Ben Black, Todd Lipcon, and Kelley Reynolds for pointing out the inaccuracies in the original post. I’ve gone through the references they provided and made corrections.

Facebook, the data giant of the internet, recently unveiled a new messaging system. To be fair, I would normally ignore feature roll outs and marketing flimflam, but Facebook’s announcement is worthy of attention not only because of the underlying database but how the move was done.

Historically Speaking

Long ago, the engineers at Facebook developed a product named Cassandra to meet their needs for inbox search. This was a single purpose application, and a lot of Facebook’s data continued to live in MySQL clusters. Cassandra solves the needs of the original Facebook Inbox search very well.

Building a New Product

Whenever we solve a new problem, it’s a good idea to investigate all of the options available before settling on a solution. By their own admission, the developers at Facebook evaluated at several platforms (including Cassandra, MySQL, and Apache HBase) before deciding to use Apache HBase. What makes this decision interesting is not just the reasons that Apache HBase was chosen, but also the reasons that MySQL and Cassandra were not chosen.

Why Not Use MySQL?

It’s not MySQL can’t scale. MySQL can clearly scale – many high traffic, high data volume sites and applications use MySQL. The problem comes when you are scaling a relational database to a massive size – Facebook estimate that they will be storing well over 135 billion messages every month (15 billion person-to-person messages plus 120 billion chat messages). Even if these messages were limited in size to a single SMS (160 characters), that’s 21,600,000,000,000 bytes of data per month. Or, as we say where I come from, a buttload of data.

When you start storing that amount of data, strange things start to happen. Updates to indexes take a long time, data statistics get updated very rarely. With a use case like tracking email and messaging, prolific users of the messaging platform could cause massive data skew which would cause queries for other users to perform poorly.

Data sharding is a difficult problem to solve with an RDBMS at any scale, but when you move to the massive scales that Facebook is targeting, data sharding moves from a difficult problem to an incredibly difficult problem. Sharding algorithms needs to be carefully chosen to ensure that data is spread evenly across all of the database servers in production and the algorithm needs to be flexible enough to ensure that new servers can be easily added into the database without disrupting the existing set up.

As Todd Lipcon points out in the comments, one of the issues with scaling an RDBMS is the write pattern of B+-tree indexes. Writing data to a B+-tree index with an arbitrarily increasing integer key is fairly trivial – it’s effectively an append to the index. Things get trickier when the index is used to support various searches. Keying by email address or subject line will cause rows to be inserted in the middle of the index leading to all kinds of undesirable side effects like page splits and index fragmentation. Performing maintenance on large indexes is problematic in terms of time to complete, CPU time, and I/O issues.

Why Not Use Cassandra?

If MySQL can’t handle this kind of load, why not use Cassandra? Cassandra was purpose built to handle Facebook’s Inbox Search feature, after all.

Cassandra, unfortunately, has it’s own problems dealing with scale. One of the difficulties scaling Cassandra is a poorly designed partitioning scheme. It is safe to assume that the Facebook engineering team has enough experience with Cassandra to know how to effectively partition their data. If partitioning data isn’t a likely problem, what is?

Is replication the problem?

When a record is written to Cassandra using the data is immediately, and asynchronously, written to every other server that is receiving a replica of the data. Cassandra allows you to tune the replication settings varying from no write confirmation through full write confirmation. Different consistency levels are appropriate for different workloads, but they all carry different risks. For example, using a consistency level of ConsistencyLevel.ALL would require every replica to respond before the write returns to the client as successful. Needless to say, this isn’t a recommended best practice. Benjamin Black (blog | twitter) informs me that the recommended best practice is to use ConsistencyLevel.ONE or ConsistencyLevel.QUORUM with a replication factor of 3.

Of course, thinking about Cassandra’s implementation could be a non-issue because the Facebook implementation of Cassandra is proprietary and may have very few similarities to the current open source Apache Cassandra codebase.

Cassandra’s replication can be configured to meet the needs of a variety of workloads and deployment patterns and we can rule that out as the reason to not choose Cassandra. In this post from Facebook Engineering, the only concrete statement is “We found Cassandra’s eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure.” Cassandra is designed to be constantly available and eventually consistent – reads will always succeed but they may not always return the same data – the inconsistencies are resolved through a read repair process. HDFS (the data store underlying HBase) is designed to be strongly consistent, even if that consistency causes availability problems.

Choosing HBase

Cassandra is a known quantity at Facebook. Cassandra was, after all, developed at Facebook. The operations teams have a great deal of experience with supporting and scaling Cassandra. Why make the switch?


HBase’s replication model (actually, it’s the HDFS replication model) differs from Cassandra’s replication through replication pipelining. Replication pipelining makes it possible to have guarantees of data consistency – we know that every node will have the same data once a write has completed. We can guarantee write order (which is important for knowing when a message arrives in your inbox).

Because of HDFS’s data replication strategy, it’s possible to gain some automatic fault tolerance. The load will be pipelined through multiple servers – if any single server fails, the query can be served by another HDFS node with the data. Likewise, data replication makes it easy to handle node failures. If a node fails, queries for that specific region of data can be routed to one of the remaining replica servers. A new server can be added and the data from the failed server can easily be replicated to the new server.

In the spirit of full disclosure, this replication is not a feature specific to HBase/HDFS – Cassandra is able to do this as well. Todd Lipcon’s comment includes some information on the differences between HDFS and Cassandra replication. It’s important to remember that HBase/HDFS and Cassandra use two different replication methodologies. Cassandra is an highly available, eventually consistent system. HBase/HDFS is strongly consistent system. Each is appropriate for a certain types of tasks.


One of the criticisms of many NoSQL databases is that they do not log data before it is written. This is, clearly, not the case. Both HBase and Cassandra use a write ahead logger to make sure that data can be safely written to disk in the log before it is persisted to disk. This allows either data store, much like a relational database, to recover from crashes without having to write to disk every time a new row is inserted.

Existing Expertise

The technology behind HBase – Hadoop and HDFS – is very well understood and has been used previously at Facebook. During the development of their data warehouse, Facebook opted to use Hive. Hive is a set of tools for data analytics built on top of Hadoop and HDFS. While Hive performs very well with analytic queries on largely static data, it does not perform well with rapidly changing data. This makes Hive poorly suited for the new Social Messaging feature at Facebook. Since Hive makes use of Hadoop and HDFS, these shared technologies are well understood by Facebook’s operations teams. As a result, the same technology that allows Facebook to scale their data will be the technology that allows Facebook to scale their Social Messaging feature. The operations team already understands many of the problems they will encounter.

Other Technologies

There are, of course, other technologies involved, but the most interesting part to me is the data store. Facebook’s decision to use HBase is a huge milestone. HBase has come of age and is being used by a prominent customer. Facebook developers are contributing their improvements back to the open source community – they’ve publicly said that they are running the open source version of Hadoop.

Open Sourcing Sawzall – What Does It Mean?

For Data Analytics or automotive modification, you will find no finer tool.

While perusing twitter, I saw that Google has open sourced Sawzall, one of their internal tools for data processing. WTF does this mean?

Sawzall, WTF?

Apart from a tool that I once used to cut the muffler off of my car (true story), what is Sawzall?

Sawzall is a procedural language for analyzing excessively large data sets. When I say “excessively large data sets”, think Google Voice logs, utility meter readings, or the network traffic logs for the Chicago Public Library. You could also think of anything where you’re going to be crunching a lot of data over the course of many hours on your monster Dell R910 SQL Server.

There’s a lengthy paper about how Sawzall works, but I’ll summarize it really quickly. If you really want to read up on all the internal Sawzall goodness, you can check it out on Google code – Interpreting the Data: Parallel Analysis with Sawzall.

Spell It Out for Me

At its most basic, Sawzall is a MapReduce engine, although the Google documentation goes to great pains to not use the word MapReduce, so maybe it’s not actually MapReduce. It smells oddly like MapReduce to me.

I’ll go into more depth on the ideas behind MapReduce in the future, but here’s the basics of MapReduce as far as Sawzall is concerned:

  1. Data is split into partitions.
  2. Each partition is filtered. (This is the Map.)
  3. The results of the filtering operation are used by an aggregation phase. (This is the Reduce.)
  4. The results of the aggregation are saved to a file.

It’s pretty simple. That simplicity makes it possible to massively parallelize the analysis of data. If you’re in the RDBMS world, think Vertica, SQL Server Parallel Data Warehouse, or Oracle Exadata. If you are already entrenched and in love with NoSQL, you already know all about MapReduce and probably think I’m an idiot for dumbing it down so much.

The upside to Sawzall’s approach is that rather than write a Map program and a Reduce program and a job driver and maybe some kind of intermediate aggregator, you just write a single program in the Sawzall language and compile it.

… And Then?

I don’t think anyone is sure, yet. One of the problems with internal tools is that they’re part of a larger stack. Sawzall is part of Google’s internal infrastructure. It may emit compiled code, but how do we go about making use of those compiled programs in our own applications? Your answer is better than mine, most likely.

Sawzall uses something called Protocol Buffers – PB is a cross language way to efficiently move objects and data around between programs. It looks like Twitter is already using Protocol Buffers for some of their data storage needs, so it might only be a matter of time before they adopt Sawzall – or before some blogger opines that they might adopt Sawzall ;) .

So far nobody has a working implementation of Sawzall running on top of any MapReduce implementations – Hadoop, for instance. At a cursory glance, it seems like Sawzall could be used in Hadoop Streaming jobs. In fact, Section 10 of the Sawzall paper seems to point out that Sawzall is a record by record analytical language – your aggregator needs to be smart enough to handled the filtered records.

Why Do I Need Another Language?

This is a damn good question. I don’t program as much as I used to, but I can reasonably write code in C#, Ruby, JavaScript, numerous SQL dialects, and Java. I can read and understand at least twice as many languages. What’s the point of another language?

One advantage of a special purpose language is that you don’t have to worry shoehorning domain specific functionality into existing language constructs. You’re free to write the language the way it needs to be written. You can achieve a wonderful brevity by baking features into the language. Custom languages let developers focus on the problems at hand and ignore implementation details.

What Now?

You could download the code from the Google Code repository, compile it, and start playing around with it. It should be pretty easy to get up and running on Linux systems. OS X developers should look at these instructions from a helpful Hacker News reader. Windows developers should install Linux on a VM or buy a Mac.

Outside of downloading and installing Sawzall yourself to see what the fuss is about, the key is to keep watching the sky and see what happens.

Hadoop World Follow Up

I should have written this right when I got back from Hadoop World, instead of a week or so later, but things don’t always happen the way you plan. Before I left to go to Hadoop World (and points in between), I put up a blog post asking for questions about Hadoop. You guys responded with some good questions and I think I owe you answers.

What Is Hadoop?

Hadoop isn’t a simple database; it’s a bunch of different technologies built on top of the Hadoop common utilities, MapReduce, and HDFS (Hadoop Distributed File System). Each of these products serves a simple purpose – HDFS handles storage, MapReduce is a parallel job distribution system, HBase is a distributed database with support for structured tables. You can find out more on the Apache Hadoop page. I’m bailing on this question because 1) it wasn’t asked and 2) I don’t think it’s fair or appropriate for me to regurgitate these answers.

How Do You Install It?

Installing Hadoop is not quite as easy as installing Cassandra. There are two flavors to choose from:

Cloudera flavors

Homemade flavors

  • If you want to run Hadoop natively on your local machine, you can go through the single node setup from the Apache Hadoop documentation.
  • Hadoop The Definitive Guide also has info on how to get started.
  • Windows users, keep in mind that Hadoop is not supported on Windows in production. If you want to try it, you’re completely on your own and in uncharted waters.

What Login Security Model(s) Does It Have?

Good question! NoSQL databases are not renowned for their data security.

Back in the beginning, at the dawn of Hadoopery, it was assumed that everyone accessing the system would be a trusted individual operating inside a secured environment. Clearly this happy ideal won’t fly in a modern, litigation fueled world. As a result, Hadoop has support for a Kerberos authentication (now meaning all versions of Hadoop newer than version 0.22 for the Apache Hadoop distribution, Cloudera’s CDH3 distribution, or the 0.20.S Yahoo! Distribution of Hadoop). The Kerberos piece handles the proper authentication, but it is still up to Hadoop and HDFS(it’s a distributed file system, remember?) to make sure that an authenticated user is authorized to mess around with a file.

In short: Kerberos guarantees that I’m Jeremiah Peschka and Hadoop+HDFS guarantee that I can muck with data.

N.B. As of 2010-10-19 (when I’m writing this), Hadoop 0.22 is not available for general use. If you want to run Hadoop 0.22, you’ll have to build from the source control repository yourself. Good luck and godspeed.

When Does It Make Sense To Use Hadoop Instead of SQL Server, Oracle, or DB2?

Or any RDBMS for that matter. Personally, I think an example makes this easier to understand.

Let’s say that we have 1 terabyte of data (the average Hadoop cluster is reported as being 114.5TB in size) and we need to process this data nightly – create reports, analyze trends, detect anomalous data, etc. If we are using an RDBMS, we’d have to batch this data up (to avoid transaction log problems). We would also need to deal with the limitations of parallelism in your OS/RDBMS combination, as well as I/O subsystem limitations (we can only read so much data at one time). SQL dialects are remarkably bad languages for loop and flow control.

If we were using Hadoop for our batch operations, we’d write a MapReduce program (think of it like a query, for now). This MapReduce program would be distributed across all of the nodes in your cluster and then run in parallel using local storage and resources. So instead of hitting a single SAN across 8 or 16 parallel execution threads, we might have 20 commodity servers all processing 1/20 of the data simultaneously. Each server is going to process 50GB. The results will be combined once the job is done and then we’re free to do whatever we want with it – if we’re using SQL Server to store the data in tables for report, we would probably bcp the data into a table.

Another benefit of Hadoop is that the MapReduce functionality is implemented in Java, C, or another imperative programming language. This makes it easy to solve computationally intensive operations in MapReduce programs. There are a large number of programming problems that cannot be easily solved in a SQL dialect; SQL is designed for retrieving data and performing relatively simple transformations on it, not for complex programming tasks.

Hadoop (Hadoop common + HDFS + MapReduce) is great for batch processing. If you need to consume massive amounts of data, it’s the right tool for the job. Hadoop is being used in production for point of sale analysis, fraud detection, machine learning, risk analysis, and fault/failure prediction.

The other consideration is cost: Hadoop is deployed on commodity servers using local disks and no software licensing fees (Linux is free, Hadoop is free). The same thing with an RDBMS is going to involve buying an expensive SAN, a high end server with a lot of RAM, and paying licensing fees to Microsoft, Oracle, or IBM as well as any other vendors involved. In the end, it can cost 10 times less to store 1TB of data in Hadoop than in an RDBMS.

HBase provides the random realtime read/write that you might be used to from the OLTP world. The difference, however, is that this is still NoSQL data access. Tables are large and irregular (much like Google’s Big Table) and there are no complex transactions.

Hive is a data warehouse toolset that sits on top of Hadoop. It supports many of the features that you’re used to seeing in SQL, including a SQL-ish query language that should be easily learned but also provides support for using MapReduce functionality in ad hoc queries.

In short, Hadoop and SQL Server solve different sets of problems. If you need transactional support, complex rule validation and data integrity, use an RDBMS. If you need to process data in parallel, perform batch and ad hoc analysis, or perform computationally expensive transformations then you should look into Hadoop.

What Tools Are Provided to Manage Hadoop?

Unfortunately, there are not a lot of management tools on the market for Hadoop – the only tools I found were supplied by Cloudera. APIs are available to develop your own management tools. From the sound of discussions that I overheard, I think a lot of developers are going down this route – they’re developing monitoring tools that meet their own, internal, needs rather than build general purpose tools that they could sell to third parties. As the core product improves, I’m sure that more vendors will be stepping up to the plate to provide additional tools and support for Hadoop.

Right now, there are a few products on the market that support Hadoop:

  • Quest’s Toad for Cloud makes it easy to query data stored using HBase and Hive.
  • Quest’s OraOop is an Oracle database driver for Sqoop – you can think of Sqoop as Hadoop’s equivalent of SQL Server’s bcp program
  • Karmasphere Studio is a tool for writing, debugging, and watching MapReduce jobs.

What Is The State of Write Consistency?

As I mentioned earlier, there is no support for ACID transactions. On the most basic level, Hadoop processes data in bulk from files stored in HDFS. When writes fail, they’re retried until the minimum guaranteed number of replicas is written. This is a built-in part of HDFS, you get this consistency for free just by using Hadoop.

As far as eventual consistency is concerned, Hadoop uses an interesting method to ensure that data is quickly and effectively written to disk. Basically, HDFS finds a place on disk to write the data. Once that write has completed, the data is forwarded to another node in the cluster. If that node fails to write the data, a new node is picked and the data is written until the minimum number of replicas have had data written to them. There are additional routines in place that will attempt to spread the data throughout the data center.

If this seems like an overly simple explanation of how HDFS and Hadoop write data, you’re right. The specifics of write consistency can be found in the HDFS Architecture Guide (which is nowhere near as scary as it sounds).

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 531 items have been purified.