Category NoSQL

Getting Faster Writes with Riak

While preparing for an upcoming presentation about Riak for the Columbus Ruby Brigade, I wrote a simple data loader. When I initially ran the load, it took about 4 minutes to load the data on the worst run. When you’re waiting to test your data load and write a presentation, 4 minutes is an eternity. Needless to say, I got frustrated pretty quickly with the speed of my data loader, so I hit up the Riak channel on IRC and started digging into the Ruby driver’s source code.

The Results

              user     system      total        real
defaults 63.660000   3.270000  66.930000 (166.475535)
dw => 1  50.940000   2.720000  53.660000 (128.470094)
dw => 0  52.350000   2.740000  55.090000 (120.151827)
n => 2   52.850000   2.790000  55.640000 (132.023310)

The Defaults

Our default load uses no customizations. Riak is going to write data to three nodes in the cluster (n = 3). Since we’re using the default configuration, we can safely assume that Riak will use quorum for write confirmation (w = n/2 + 1). Finally, we can also assume that the durable write value is to use a quorum, since that’s the default for riak-client.

Because we’re writing to n (3) nodes and we’re waiting for w (2) nodes to respond, writes were slower than I’d like. Thankfully, Riak makes it easy to tune how it will respond to writes.

Changing the N Value

The first change that we can do is change the N value (replication factor). The N value should have a huge improvement for my test machine – Riak is only on one of my hard drives. Even solid state drives can only write to one place at a time. When we create the bucket we can change the bucket’s properties and set the N value. note It’s important that you set bucket properties when you ‘create’ the bucket. Buckets are created when keys are added to them and they are deleted when the last key is deleted.

b1 = client.bucket('animals_dw1',
                   :keys => false)
b1.props = { :n_val => 1, :dw => 1 }

In this chunk of code we set the N value to 1 and set the durable writes to 1. This means that only 1 replica will have to commit the record to durable storage in order for the write to be considered a success.

On the bright side, this approach is considerably faster. Here’s the bummer: by setting the N value to 1, we’ve removed any hope of durability from our cluster – the data will never be replicated. Any server failure will result in data loss. For our testing purposes, it’s okay because we’re trying to see how fast we can make things, not how safe we can make them.

How much faster? Our run with all defaults enabled took 166 seconds. Only writing to 1 replica shaved 38 seconds off of our write time. The other thing that I changed was setting returnbody to false. By default, the Ruby Riak client will return the object that was saved. Turning this setting off should make things faster – less bytes are flying around the network.

Forget About Durability

What happens when we turn down durability? That’s the dw => 0 result in the table at the beginning of the article. We get an 8 second performance boost over our last load.

What did we change? We set both the dw and w parameters to 0. This means that our client has told Riak that we’re not going to wait for a response from any replicas before decided that a write has succeeded. This is a fire and forget write – we’re passing data as quickly as possible to the client and to hell with the consequences.

So, by eliminating any redundancy, ignoring the current record from the database, and refusing to acknowledge any reads from the server, we’re able to get a 46.3 second performance improvement over our default values. This is impressive, but it’s roughly akin to throwing our data at a bucket, not into the bucket.

What if I Care About My Data?

What if you care about your data? After all, we got our performance improvement from setting the number of replicas to 1 and turning off write acknowledgement. The fourth, and final run, that I performed took a look at what would happen if we kept the number of replicas at a quorum (an N value of 2) and ignored write responses. If we’re just streaming data into a database, we may not care if a record gets missed here and there. It turns out that this is only slightly slower than running with scissors. It takes 132 seconds to write the data; only 4 seconds slower than with durable writes set to 1 and still nearly 34.5 seconds faster than using the defaults.

The most recent version of this sample code can be found on github at https://github.com/peschkaj/riak_intro. Sample data was located through Infochimps.com. The data load uses the Taxobox data set.

A Technical Plan for 2011

The Last Ten Years

My career has been particularly interesting. I’ve been very fortunate to work with a variety of different languages, platforms, databases, frameworks, and people. I started off working with Perl on HP-UX. As I started automating more of my job, I added ASP.NET to the mix. Eventually I learned about databases, first with Oracle, then SQL Server, then with PostgreSQL, and finally back to SQL Server. Along the way I’ve held job a variety of different job titles – system administrator, system engineer, developer, consultant, architect, and database administrator.

I’ve worked with a lot of different systems, architectures, and design philosophies. The one thing that’s stuck with me is that there is no one size fits all answer. That extends beyond languages and design patterns – it goes right down to the way we’re storing data. One of the most interesting things going on right now is that it’s easier than ever to pick the right tools for the job.

The Next Twelve Months

Over the next twelves months, I’m going to be digging into hybrid database solutions. Some people call it polyglot persistence. You can call it what you like, but the fact remains that it is no longer necessary to store all of our data in a relational database. Frankly, I’m encouraging people to look into ways to store their data outside of a relational database. Not because RDBMSes are bad or wrong, but because there is a lot of data that doesn’t need to be in a relational database – session data, cached data, and flexible data.

Why Focus on Hybrid Data?

The idea behind hybrid data is that we use multiple databases instead of one database. Let’s say that we have an online store where we sell musical equipment. We want to store customer data in a relational database, but where should we store the rest of our information? Conventional thinking says that we should keep storing our data in a relational database. Sessions might be stored in memory somewhere and shopping carts might get stored in the database, but they’ll end up on faster, more expensive, solid state drives.

There are other ways to store data.

Let’s think about all of this for a minute. Why do we force our databases into existing paradigms? Why aren’t we thinking about new and interesting ways to store our data?

Sessions are a great place to start. Sure, we could use something like memcached, but why not examine Redis or App Fabric Cache? Both of these databases have support for strongly typed data. They both allow the data to be persisted to disk, if needed, and they allow for data to be expired over time. This is perfect for working with any kind of cached data – it stays in a format our applications need but we can expire it or save it as needed.

The flexibility to store our data the way that applications use it is important. Session data should be rapidly accessible. Other applications don’t need to read it. It doesn’t need to be reportable. It merely needs to be fast.

Shopping carts are different. Amazon’s own use cases and needs drove the development of Dynamo to be a durable, eventually consistent, distributed key-value store. Shopping carts are write heavy environments. It’s rare that users need to view everything that’s in a shopping car, but they need to be able to review it quickly when the time comes. Likewise, when the time comes to review a shopping cart, any delay or slowdown means there’s a chance the user will simply abandon the cart. Dynamo fills these requirements quite well.

Since Dynamo is only available inside of Amazon, how are we supposed to work with it ourselves? Riak is a clone of Dynamo that meets our need for a shopping cart. It’s a key/value database; it’s fault tolerant, and it’s fast.

Why not store a shopping cart in a relational database? It is, after all, a pretty simple collection of a user identifier, an item number, an item description, price, and quantity. Shopping carts are highly transient. Once an order has been placed, the shopping cart is cleared out and the data in the cart moves into the ordering system. Most shopping carts will be active for a very short period of time – a matter of minutes at most. Over their short lives shopping carts will almost entirely be written to and only read a few times. Instead of building complex sharding mechanisms to spread load out across a number of database servers, why not use a database designed to handle large load spread across a number of servers?

Where Does This Fit Into the Enterprise?

Enterprises should be adopting these technologies as fast as they can. Not because they are replacing the relational database, but because they free the relation database from things it’s bad at and leave it to perform tasks that it excels at. Relational databases are great for core business logic – they have a lot of baked in functionality like data integrity and validation. As we’ve already discussed, relational databases are not well suited to storing highly volatile data.

By moving volatile data into better suited types of database, enterprises can increase the capacity of their database systems, provide redundancy, and increase scalability by using off the shelf solutions. The trick, of course, lies in integrating them. And that is what I’m going to be playing around with this year.

Querying Hive with Toad for Cloud Databases

I recently talked about using the Toad for Cloud Databases Eclipse plug-in to query an HBase database. After I finished up the video, I did some work loading a sample dataset from Retrosheet into my local Hive instance.

This 7 minute tutorial shows you brand new functionality in the Toad for Cloud Databases Eclipse plug-in and how you can use it to perform data warehousing queries against Hive.

Querying HBase with Toad for Cloud Databases

We recently released a new version of Toad for Cloud Databases as an Eclipse plug-in. While this functionality has been in Toad for Cloud since the last release, this video shows Toad for Cloud running in Eclipse and demonstrates some basic querying against HBase using Toad for Cloud’s ability to translate between ANSI compliant SQL code and native database calls.

[media id=5 width=680 height=560]

If the video above doesn’t work for you, everything is available full size at NoSQLPedia – Querying HBase.

New Uses for NoSQL

We all know that you can use NoSQL databases to store data. And that’s cool, right? After all, NoSQL databases can be massively distributed, are redundant, and really, really fast. But some of the things that make NoSQL database really interesting aren’t just the redundancy, performance, or their ability to use all of those old servers in the closet. Under the covers, NoSQL databases are supported by complex code that makes these features possible – things like distributed file systems.

What’s a Brackup?

Brackup is a backup tool. There are a lot of backup tools on the market, what makes this one special?

First, it’s free.

Second, it’s open source; which means it’s always going to be free.

Third, it can chunk your files – files will be crammed into chunks for faster access and distributed across your backup servers. Did you know that opening a filehandle is one of the single most expensive things you can ever do in programming?

Fourth, it supports different backends.

It Can Backup to Riak

I’ve mentioned Riak a few times around here. Quick summary: Riak is a distributed key-value database.

So?

So, this means that when you take a backup, Brackup is going to split your data into different chunks. These chunks are going to be sent to the backup location. In this case, the backup location is going to be your Riak cluster. As Brackup goes along and does its work, it sends the chunks off to Riak.

Unlike sending your data to an FTP server or Amazon S3, it’s going to get magically replicated in the background by Riak. If you lose a backup server, it’s not a big deal because Riak will have replicated that data across multiple servers in the cluster. Backing up your backups just got a lot easier.

Why Is the NoSQL Part Important?

NoSQL can be used for different things. It’s not a just a potential replacement for an RDBMS (and the beginning of another nerd holy war). Depending on the data store and your purpose, you can use a NoSQL database for a lot of different things – most notably as a distributed file system. This saves time and money since you don’t have to buy a special purpose product, you can use what’s already there.

Open Sourcing Sawzall – What Does It Mean?

For Data Analytics or automotive modification, you will find no finer tool.

While perusing twitter, I saw that Google has open sourced Sawzall, one of their internal tools for data processing. WTF does this mean?

Sawzall, WTF?

Apart from a tool that I once used to cut the muffler off of my car (true story), what is Sawzall?

Sawzall is a procedural language for analyzing excessively large data sets. When I say “excessively large data sets”, think Google Voice logs, utility meter readings, or the network traffic logs for the Chicago Public Library. You could also think of anything where you’re going to be crunching a lot of data over the course of many hours on your monster Dell R910 SQL Server.

There’s a lengthy paper about how Sawzall works, but I’ll summarize it really quickly. If you really want to read up on all the internal Sawzall goodness, you can check it out on Google code – Interpreting the Data: Parallel Analysis with Sawzall.

Spell It Out for Me

At its most basic, Sawzall is a MapReduce engine, although the Google documentation goes to great pains to not use the word MapReduce, so maybe it’s not actually MapReduce. It smells oddly like MapReduce to me.

I’ll go into more depth on the ideas behind MapReduce in the future, but here’s the basics of MapReduce as far as Sawzall is concerned:

  1. Data is split into partitions.
  2. Each partition is filtered. (This is the Map.)
  3. The results of the filtering operation are used by an aggregation phase. (This is the Reduce.)
  4. The results of the aggregation are saved to a file.

It’s pretty simple. That simplicity makes it possible to massively parallelize the analysis of data. If you’re in the RDBMS world, think Vertica, SQL Server Parallel Data Warehouse, or Oracle Exadata. If you are already entrenched and in love with NoSQL, you already know all about MapReduce and probably think I’m an idiot for dumbing it down so much.

The upside to Sawzall’s approach is that rather than write a Map program and a Reduce program and a job driver and maybe some kind of intermediate aggregator, you just write a single program in the Sawzall language and compile it.

… And Then?

I don’t think anyone is sure, yet. One of the problems with internal tools is that they’re part of a larger stack. Sawzall is part of Google’s internal infrastructure. It may emit compiled code, but how do we go about making use of those compiled programs in our own applications? Your answer is better than mine, most likely.

Sawzall uses something called Protocol Buffers – PB is a cross language way to efficiently move objects and data around between programs. It looks like Twitter is already using Protocol Buffers for some of their data storage needs, so it might only be a matter of time before they adopt Sawzall – or before some blogger opines that they might adopt Sawzall ;) .

So far nobody has a working implementation of Sawzall running on top of any MapReduce implementations – Hadoop, for instance. At a cursory glance, it seems like Sawzall could be used in Hadoop Streaming jobs. In fact, Section 10 of the Sawzall paper seems to point out that Sawzall is a record by record analytical language – your aggregator needs to be smart enough to handled the filtered records.

Why Do I Need Another Language?

This is a damn good question. I don’t program as much as I used to, but I can reasonably write code in C#, Ruby, JavaScript, numerous SQL dialects, and Java. I can read and understand at least twice as many languages. What’s the point of another language?

One advantage of a special purpose language is that you don’t have to worry shoehorning domain specific functionality into existing language constructs. You’re free to write the language the way it needs to be written. You can achieve a wonderful brevity by baking features into the language. Custom languages let developers focus on the problems at hand and ignore implementation details.

What Now?

You could download the code from the Google Code repository, compile it, and start playing around with it. It should be pretty easy to get up and running on Linux systems. OS X developers should look at these instructions from a helpful Hacker News reader. Windows developers should install Linux on a VM or buy a Mac.

Outside of downloading and installing Sawzall yourself to see what the fuss is about, the key is to keep watching the sky and see what happens.

Upcoming Talks – Next Week

Next week I’ll be in the San Francisco Bay area. More specifically, I’ll be giving three lightning talks at three separate Cloud Camps. It’s the same talk each time, I’ll be giving a general intro to NoSQL and cloud databases.

Silicon Valley Cloud Camp in Santa Clara, CA.
Cloud Camp Santa Clara also in Santa Clara, CA.
Cloud Camp SF @ QCon in San Francisco, CA.

All of these events start at 6:30PM, the Lightning Talks start at 6:45PM, and I have no idea when I’m going to take the stage, but it promises to be good. If you’re in the area and would like to hang out, hit me up in the comments and we can arrange a time to talk.

Foursquare and MongoDB: What If

People have chimed in and talked about the Foursquare outage. The nice part about these discussions is that they’re focusing on the technical problems with the current set up and Foursquare. They’re picking it apart and looking at what is right, what went wrong, and what needs to be done differently in MongoDB to prevent problems like this in the future.

Let’s play a “what if” game. What if Foursquare wasn’t using MongoDB? What if they were using something else?

Riak

Riak is a massively scaleable key/value data store. It’s based on Amazon’s Dynamo. If you don’t want to read the paper, that just means that it uses some magic to make sure that data is evenly spread throughout the database cluster and that adding a new node makes it very easy to rebalance the cluster.

What would have happened at Foursquare if they had been using Riak?

Riak still suffers from the same performance characteristics around disk access as MongoDB – once you have to page to disk, operations become slow and throughput dries up. This is a common problem in any software that needs to access disks – disks are slow, RAM is fast.

Riak, however, has an interesting distinction. It allocates keys inside the cluster using something called a consistent hash ring – this is just a convenient way to rapidly allocate ranges of keys to different nodes within a cluster. The ring itself isn’t interesting. What’s exciting is that the ring is divided into partitions (64 by default). Every node in the system claims an equal share of those 64 partitions. In addition, each partition is replicated so there are always three copies of a give key/value pair at any time. Because there are multiple copies of the data, it’s unlikely that any single node will fail or become unbalanced. In theory, if Foursquare had used Riak it is very unlikely that we’ll run into a problem were a single node becomes full.

How would this consistent hash ring magical design choice have helped? Adding a new node causes the cluster to redistribute data in the background. The new node will claim and equal amount of space in the cluster and the data will be redistributed from the other nodes in the background. The same thing happens when a node fails, by the way. Riak also only stores keys in memory, not all of the data. So it’s possible to reference an astronomical amount before running out of RAM.

There’s no need to worry about replica sets, re-sharding, or promoting a new node to master. Once a node joins a riak cluster, it takes over its share of the load. As long as you have the network throughput on your local network (which you probably do), then this operation can be fairly quick and painless.

Cassandra

Cassandra, like Riak, is based on Amazon’s Dynamo. It’s a massively distributed data store. I suspect that if Foursquare had used Cassandra, they would have run into similar problems.

Cassandra makes use of range partitioning to distribute data within the cluster. The Foursquare database was keyed off of the user name which, in turn, saw abnormal growth because some groups of users were more active than others. Names also tend to clump around certain letters, especially when you’re limited to a Latin character set. I know a lot more Daves that I know Zachariahs, and I have a lot of friends whose names start with the letter L. This distribution causes data within the cluster to be overly allocated to one node. This would, ultimately, lead to the same problems that happened at Foursquare with their MongoDB installation.

That being said, it’s possible to use a random partitioner for the data in Cassandra. The random partitioner makes it very easy to add additional nodes and distribute data across them. The random partitioner comes with a price. It makes it impossible to do quick range slice queries in Cassandra – you can no longer say “I want to see all of the data for January 3rd, 2010 through January 8th, 2010”. Instead, you would need to build up custom indexes to support your querying and build batch processes to load the indexes. The tradeoffs between the random partitioner and the order preserving partitioner are covered very well in Dominic Williams’s article Cassandra: RandomPartitioner vs OrderPreservingPartitioner.

Careful use of the random partitioner and supporting batch operations could have prevented the outage that Foursquare saw, but this would have lead to different design challenges, some of which may have been difficult to overcome without resorting to a great deal of custom code.

HBase/HDFS/Hadoop

HBase is a distributed column-oriented database built on top of HDFS. It is based on Google’s Bigtable database as described in “Bigtable: A Distributed Storage System for Structured Data”. As an implementation of Bigtable, HBase has a number of advantages over MongoDB – write ahead logging, rich ad hoc querying, and redundancy

HBase is not going to suffer from the same node redistribution problems that caused the Foursquare outage. When it comes time to add a new node to the cluster data will be migrated in much larger chunks, one data file at a time. This makes it much easier to add a new data node and redistribute data across the network.

Just like a relational database, HBase is designed so that all data doesn’t need to reside in memory for good performance. The internal data structures are built in a way that makes it very easy to find data and return it to the client. Keys are held in memory and two levels of caching make sure that frequently used data will be in memory when a client requests it.

HBase also has the advantage of using a write ahead log. In the event of a drive failure, it is possible to recover from a backup of your HBase database and play back the log to make sure data is correct and consistent.

If all of this sounds like HBase is designed to be a replacement for an RDBMS, you would be close. HBase is a massively distributed database, just like Bigtable. As a result, data needs to be logged because there is a chance that a hardware node will fail and will need to be recovered. Because HBase is a column-oriented database, we need to be careful not to treat it exactly like a relational database, but

A Relational Database

To be honest, Foursquare’s data load would be trivial in any relational database. SQL Server, Oracle, MySQL, and PostgreSQL can all handle orders of magnitude more data than the 132GB of data that Foursquare was storing at the time of the outage. This begs the question “How we could handle the constant write load?” Foursquare is a write-intensive application.

Typically, in the relational database world, when you need to scale read and write loads we add more disks. There is a finite amount of space in a server chassis and these local disks don’t provide the redundancy necessary for data security and performance; software RAID is also CPU intensive and slow. A better solution is to purchase a dedicated storage device, either a SAN, NAS, or DAS. All of these devices offer read/write caching and can be configured with in a variety of RAID levels for performance and redundancy.

RDBMSes are known quantities – they are easy to scale to certain points. Judging by the amount of data that Foursquare reports to have, they aren’t likely to reach the point where an RDBMS can no longer scale for a very long time. The downside to this approach is that an RDBMS is much costlier per TB of storage (up to ten times more expensive) than using MongoDB, but if your business is your data, then it’s important to keep the data safe.

Conclusion

It’s difficult to say if a different database solution would have prevented the Foursquare outage. But it is a good opportunity to highlight how different data storage systems would respond in the same situation.

Hadoop World Follow Up

I should have written this right when I got back from Hadoop World, instead of a week or so later, but things don’t always happen the way you plan. Before I left to go to Hadoop World (and points in between), I put up a blog post asking for questions about Hadoop. You guys responded with some good questions and I think I owe you answers.

What Is Hadoop?

Hadoop isn’t a simple database; it’s a bunch of different technologies built on top of the Hadoop common utilities, MapReduce, and HDFS (Hadoop Distributed File System). Each of these products serves a simple purpose – HDFS handles storage, MapReduce is a parallel job distribution system, HBase is a distributed database with support for structured tables. You can find out more on the Apache Hadoop page. I’m bailing on this question because 1) it wasn’t asked and 2) I don’t think it’s fair or appropriate for me to regurgitate these answers.

How Do You Install It?

Installing Hadoop is not quite as easy as installing Cassandra. There are two flavors to choose from:

Cloudera flavors

Homemade flavors

  • If you want to run Hadoop natively on your local machine, you can go through the single node setup from the Apache Hadoop documentation.
  • Hadoop The Definitive Guide also has info on how to get started.
  • Windows users, keep in mind that Hadoop is not supported on Windows in production. If you want to try it, you’re completely on your own and in uncharted waters.

What Login Security Model(s) Does It Have?

Good question! NoSQL databases are not renowned for their data security.

Back in the beginning, at the dawn of Hadoopery, it was assumed that everyone accessing the system would be a trusted individual operating inside a secured environment. Clearly this happy ideal won’t fly in a modern, litigation fueled world. As a result, Hadoop has support for a Kerberos authentication (now meaning all versions of Hadoop newer than version 0.22 for the Apache Hadoop distribution, Cloudera’s CDH3 distribution, or the 0.20.S Yahoo! Distribution of Hadoop). The Kerberos piece handles the proper authentication, but it is still up to Hadoop and HDFS(it’s a distributed file system, remember?) to make sure that an authenticated user is authorized to mess around with a file.

In short: Kerberos guarantees that I’m Jeremiah Peschka and Hadoop+HDFS guarantee that I can muck with data.

N.B. As of 2010-10-19 (when I’m writing this), Hadoop 0.22 is not available for general use. If you want to run Hadoop 0.22, you’ll have to build from the source control repository yourself. Good luck and godspeed.

When Does It Make Sense To Use Hadoop Instead of SQL Server, Oracle, or DB2?

Or any RDBMS for that matter. Personally, I think an example makes this easier to understand.

Let’s say that we have 1 terabyte of data (the average Hadoop cluster is reported as being 114.5TB in size) and we need to process this data nightly – create reports, analyze trends, detect anomalous data, etc. If we are using an RDBMS, we’d have to batch this data up (to avoid transaction log problems). We would also need to deal with the limitations of parallelism in your OS/RDBMS combination, as well as I/O subsystem limitations (we can only read so much data at one time). SQL dialects are remarkably bad languages for loop and flow control.

If we were using Hadoop for our batch operations, we’d write a MapReduce program (think of it like a query, for now). This MapReduce program would be distributed across all of the nodes in your cluster and then run in parallel using local storage and resources. So instead of hitting a single SAN across 8 or 16 parallel execution threads, we might have 20 commodity servers all processing 1/20 of the data simultaneously. Each server is going to process 50GB. The results will be combined once the job is done and then we’re free to do whatever we want with it – if we’re using SQL Server to store the data in tables for report, we would probably bcp the data into a table.

Another benefit of Hadoop is that the MapReduce functionality is implemented in Java, C, or another imperative programming language. This makes it easy to solve computationally intensive operations in MapReduce programs. There are a large number of programming problems that cannot be easily solved in a SQL dialect; SQL is designed for retrieving data and performing relatively simple transformations on it, not for complex programming tasks.

Hadoop (Hadoop common + HDFS + MapReduce) is great for batch processing. If you need to consume massive amounts of data, it’s the right tool for the job. Hadoop is being used in production for point of sale analysis, fraud detection, machine learning, risk analysis, and fault/failure prediction.

The other consideration is cost: Hadoop is deployed on commodity servers using local disks and no software licensing fees (Linux is free, Hadoop is free). The same thing with an RDBMS is going to involve buying an expensive SAN, a high end server with a lot of RAM, and paying licensing fees to Microsoft, Oracle, or IBM as well as any other vendors involved. In the end, it can cost 10 times less to store 1TB of data in Hadoop than in an RDBMS.

HBase provides the random realtime read/write that you might be used to from the OLTP world. The difference, however, is that this is still NoSQL data access. Tables are large and irregular (much like Google’s Big Table) and there are no complex transactions.

Hive is a data warehouse toolset that sits on top of Hadoop. It supports many of the features that you’re used to seeing in SQL, including a SQL-ish query language that should be easily learned but also provides support for using MapReduce functionality in ad hoc queries.

In short, Hadoop and SQL Server solve different sets of problems. If you need transactional support, complex rule validation and data integrity, use an RDBMS. If you need to process data in parallel, perform batch and ad hoc analysis, or perform computationally expensive transformations then you should look into Hadoop.

What Tools Are Provided to Manage Hadoop?

Unfortunately, there are not a lot of management tools on the market for Hadoop – the only tools I found were supplied by Cloudera. APIs are available to develop your own management tools. From the sound of discussions that I overheard, I think a lot of developers are going down this route – they’re developing monitoring tools that meet their own, internal, needs rather than build general purpose tools that they could sell to third parties. As the core product improves, I’m sure that more vendors will be stepping up to the plate to provide additional tools and support for Hadoop.

Right now, there are a few products on the market that support Hadoop:

  • Quest’s Toad for Cloud makes it easy to query data stored using HBase and Hive.
  • Quest’s OraOop is an Oracle database driver for Sqoop – you can think of Sqoop as Hadoop’s equivalent of SQL Server’s bcp program
  • Karmasphere Studio is a tool for writing, debugging, and watching MapReduce jobs.

What Is The State of Write Consistency?

As I mentioned earlier, there is no support for ACID transactions. On the most basic level, Hadoop processes data in bulk from files stored in HDFS. When writes fail, they’re retried until the minimum guaranteed number of replicas is written. This is a built-in part of HDFS, you get this consistency for free just by using Hadoop.

As far as eventual consistency is concerned, Hadoop uses an interesting method to ensure that data is quickly and effectively written to disk. Basically, HDFS finds a place on disk to write the data. Once that write has completed, the data is forwarded to another node in the cluster. If that node fails to write the data, a new node is picked and the data is written until the minimum number of replicas have had data written to them. There are additional routines in place that will attempt to spread the data throughout the data center.

If this seems like an overly simple explanation of how HDFS and Hadoop write data, you’re right. The specifics of write consistency can be found in the HDFS Architecture Guide (which is nowhere near as scary as it sounds).

I’m Going to Hadoop World

Sounds like I’m bragging, right? There is a free book involved. Or maybe you don’t care because it’s all NoSQLs and stuff and you’re a SQL Server DBA. And that’s where we differ.

When I first heard about NoSQL databases, I had the same reaction that a lot of people are having right now: disbelief and mockery. I remember making fun of MySQL when I first ran into it. It was such an odd database: it didn’t have foreign keys, joins didn’t work, it sometimes ate all of your data, and writes put locks on tables. It was no comparison for SQL Server, Oracle, or PostgreSQL. When I hit publish, this blog post is going to get dumped into a MySQL database on my server. I’ll probably hit up slashdot or read some blogs. That’s going to hit MySQL as well. These days, I try to keep a passing familiarity with MySQL because you never know when you’re going to need to use such a cheap, powerful, tool.

I see a filthy commie on the other side of this intertubes

This is what I think about NoSQL databases; especially Hadoop.

So, I’m going to Hadoop World and I’m going to learn as much as I can about Hadoop and the technology that surrounds it. The fun part is that I’m going to take some notes. And I’m going to share those notes.

That’s all well and good. As Senator Joe McCarthy used to say, “sharing is caring and caring is for commies.”

So, instead of just going and learning about what is interesting to me, I want to know what kind of questions DBAs have about this new-fangled Hadoop thing. I want to find the answers to the questions that people have about Hadoop, its place in the enterprise, and how it may or may not change the DBA’s job.

So, sound off in the comments. If you’ve got a question, type in the box and hit send. I’ll give you props when I post my write up of Hadoop World and I’ll do my best to find the answer.

I owe this entire idea to Andy Warren (blog | twitter). He mentioned that he was curious about Hadoop and that sparked this blog post. Here are his questions:

  • How do you install it?
  • What login security model(s) does it have
  • When does it make sense to use Hadoop vs SQL Server?
  • What tools are provided to manage it?

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 226 items have been purified.