November 2010
Mon Tue Wed Thu Fri Sat Sun
« Oct   Dec »
1234567
891011121314
15161718192021
22232425262728
2930  

Month November 2010

Querying Hive with Toad for Cloud Databases

I recently talked about using the Toad for Cloud Databases Eclipse plug-in to query an HBase database. After I finished up the video, I did some work loading a sample dataset from Retrosheet into my local Hive instance.

This 7 minute tutorial shows you brand new functionality in the Toad for Cloud Databases Eclipse plug-in and how you can use it to perform data warehousing queries against Hive.

Querying HBase with Toad for Cloud Databases

We recently released a new version of Toad for Cloud Databases as an Eclipse plug-in. While this functionality has been in Toad for Cloud since the last release, this video shows Toad for Cloud running in Eclipse and demonstrates some basic querying against HBase using Toad for Cloud’s ability to translate between ANSI compliant SQL code and native database calls.

[media id=5 width=680 height=560]

If the video above doesn’t work for you, everything is available full size at NoSQLPedia – Querying HBase.

What I’m Reading – 2010 Birthday Edition

It’s my birthday today, so I’m going to subject you to more things that I’ve been reading recently.

William Gibson says the future is right here, right now – I couldn’t agree with this more. Time gets faster because we’re constantly connected to the world.

The Humble Pencil, The Mighty Computer – I love using pencil and paper to work through things. In fact, I don’t understand people who can draw, sketch, or brainstorm straight into a piece of software. Pencil and paper is so automatic, natural, and fun.

How to Practice a Technology Talk – Just so you know

Index-Only Scans – SQL Server has had the ability to do index-only scans for a long time because of how it handles concurrency and isolation levels. PostgreSQL still doesn’t have this feature, in part due to differences in how PostgreSQL handles concurrency.

NoSQL: Don’t Take the Drug Unless You Have the Symptoms – I get paid to do this and I’ll be the first one to tell you that some kind of NoSQL/non-relational doodad may not meet your needs.

Writing Your Own Storage Engine for Memcached part 1 part 2 part 3. Just in case you wanted to know what goes into building a storage engine.

On Being Wrong

When I first started blogging, I was nervous because I might be wrong. Not just “wrong” like “Gee Ted, I don’t know if I agree with your opinion about using Irish children as forced labor,” but factually wrong. And if I was wrong on the internet, people would see it forever and ever. Well, Google would know about it forever and ever. A friend of mine advised me that I would certainly be wrong on the internet, just like I’ve been fantastically wrong in real life, and that the only thing you can do is get up, admit you were wrong, and keep going again.

I had this opportunity the other day after blogging about Cassandra and HBase. I made a lot of mistakes in that post. At first I couldn’t understand what I had said or done wrong, so I reached out to the people who took issue with my post and said “Hey, help me correct my mistakes, leave a comment.” Todd Lipcon left a comment on my blog post. Ben Black emailed me. I looked at their suggestions and fixed to my post.

Turns out I was still missing something. Ben looked up my contact info again and talked to me directly via instant messenger. He helped me figure out what I was missing and why before walking me through some core concepts.

This wasn’t about SQL Server; this was about HBase, Cassandra, and Facebook. I’m a newcomer in that community. I effectively know every little. Yes, I was wrong, but people took the time to help me understand, learn, and correct my mistakes. That’s what great community is about: helping each other.

Being right or wrong isn’t important. How you react to being right or wrong is.

SQL Saturday #61

SQL Saturday #61 is right around the corner in Washington, D.C. It’s really Reston, VA but nobody knows where that is, right?

Anyway, I’m going to be in attendance as well as speaking about SQL Server internals. This isn’t a deep dive into any single part of SQL Server, but it is a nice whirlwind tour of what’s going on inside SQL Server. Yes, it’s an early session but I promise you that if wake up in time I will keep things entertaining.

What if you don’t want to learn about SQL Server internals at 8:30 in the morning on a Saturday? You’re in luck! There are a variety of other things you could be learning about SQL Server. I’m not going to go over all of the sessions, you can do that for yourself. I will encourage you to head out to SQL Saturday #61 on December 4th. Hope to see you there!

Facebook Messaging – HBase Comes of Age

Update: I want to thank Ben Black, Todd Lipcon, and Kelley Reynolds for pointing out the inaccuracies in the original post. I’ve gone through the references they provided and made corrections.

Facebook, the data giant of the internet, recently unveiled a new messaging system. To be fair, I would normally ignore feature roll outs and marketing flimflam, but Facebook’s announcement is worthy of attention not only because of the underlying database but how the move was done.

Historically Speaking

Long ago, the engineers at Facebook developed a product named Cassandra to meet their needs for inbox search. This was a single purpose application, and a lot of Facebook’s data continued to live in MySQL clusters. Cassandra solves the needs of the original Facebook Inbox search very well.

Building a New Product

Whenever we solve a new problem, it’s a good idea to investigate all of the options available before settling on a solution. By their own admission, the developers at Facebook evaluated at several platforms (including Cassandra, MySQL, and Apache HBase) before deciding to use Apache HBase. What makes this decision interesting is not just the reasons that Apache HBase was chosen, but also the reasons that MySQL and Cassandra were not chosen.

Why Not Use MySQL?

It’s not MySQL can’t scale. MySQL can clearly scale – many high traffic, high data volume sites and applications use MySQL. The problem comes when you are scaling a relational database to a massive size – Facebook estimate that they will be storing well over 135 billion messages every month (15 billion person-to-person messages plus 120 billion chat messages). Even if these messages were limited in size to a single SMS (160 characters), that’s 21,600,000,000,000 bytes of data per month. Or, as we say where I come from, a buttload of data.

When you start storing that amount of data, strange things start to happen. Updates to indexes take a long time, data statistics get updated very rarely. With a use case like tracking email and messaging, prolific users of the messaging platform could cause massive data skew which would cause queries for other users to perform poorly.

Data sharding is a difficult problem to solve with an RDBMS at any scale, but when you move to the massive scales that Facebook is targeting, data sharding moves from a difficult problem to an incredibly difficult problem. Sharding algorithms needs to be carefully chosen to ensure that data is spread evenly across all of the database servers in production and the algorithm needs to be flexible enough to ensure that new servers can be easily added into the database without disrupting the existing set up.

As Todd Lipcon points out in the comments, one of the issues with scaling an RDBMS is the write pattern of B+-tree indexes. Writing data to a B+-tree index with an arbitrarily increasing integer key is fairly trivial – it’s effectively an append to the index. Things get trickier when the index is used to support various searches. Keying by email address or subject line will cause rows to be inserted in the middle of the index leading to all kinds of undesirable side effects like page splits and index fragmentation. Performing maintenance on large indexes is problematic in terms of time to complete, CPU time, and I/O issues.

Why Not Use Cassandra?

If MySQL can’t handle this kind of load, why not use Cassandra? Cassandra was purpose built to handle Facebook’s Inbox Search feature, after all.

Cassandra, unfortunately, has it’s own problems dealing with scale. One of the difficulties scaling Cassandra is a poorly designed partitioning scheme. It is safe to assume that the Facebook engineering team has enough experience with Cassandra to know how to effectively partition their data. If partitioning data isn’t a likely problem, what is?

Is replication the problem?

When a record is written to Cassandra using the data is immediately, and asynchronously, written to every other server that is receiving a replica of the data. Cassandra allows you to tune the replication settings varying from no write confirmation through full write confirmation. Different consistency levels are appropriate for different workloads, but they all carry different risks. For example, using a consistency level of ConsistencyLevel.ALL would require every replica to respond before the write returns to the client as successful. Needless to say, this isn’t a recommended best practice. Benjamin Black (blog | twitter) informs me that the recommended best practice is to use ConsistencyLevel.ONE or ConsistencyLevel.QUORUM with a replication factor of 3.

Of course, thinking about Cassandra’s implementation could be a non-issue because the Facebook implementation of Cassandra is proprietary and may have very few similarities to the current open source Apache Cassandra codebase.

Cassandra’s replication can be configured to meet the needs of a variety of workloads and deployment patterns and we can rule that out as the reason to not choose Cassandra. In this post from Facebook Engineering, the only concrete statement is “We found Cassandra’s eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure.” Cassandra is designed to be constantly available and eventually consistent – reads will always succeed but they may not always return the same data – the inconsistencies are resolved through a read repair process. HDFS (the data store underlying HBase) is designed to be strongly consistent, even if that consistency causes availability problems.

Choosing HBase

Cassandra is a known quantity at Facebook. Cassandra was, after all, developed at Facebook. The operations teams have a great deal of experience with supporting and scaling Cassandra. Why make the switch?

Replication

HBase’s replication model (actually, it’s the HDFS replication model) differs from Cassandra’s replication through replication pipelining. Replication pipelining makes it possible to have guarantees of data consistency – we know that every node will have the same data once a write has completed. We can guarantee write order (which is important for knowing when a message arrives in your inbox).

Because of HDFS’s data replication strategy, it’s possible to gain some automatic fault tolerance. The load will be pipelined through multiple servers – if any single server fails, the query can be served by another HDFS node with the data. Likewise, data replication makes it easy to handle node failures. If a node fails, queries for that specific region of data can be routed to one of the remaining replica servers. A new server can be added and the data from the failed server can easily be replicated to the new server.

In the spirit of full disclosure, this replication is not a feature specific to HBase/HDFS – Cassandra is able to do this as well. Todd Lipcon’s comment includes some information on the differences between HDFS and Cassandra replication. It’s important to remember that HBase/HDFS and Cassandra use two different replication methodologies. Cassandra is an highly available, eventually consistent system. HBase/HDFS is strongly consistent system. Each is appropriate for a certain types of tasks.

Logging

One of the criticisms of many NoSQL databases is that they do not log data before it is written. This is, clearly, not the case. Both HBase and Cassandra use a write ahead logger to make sure that data can be safely written to disk in the log before it is persisted to disk. This allows either data store, much like a relational database, to recover from crashes without having to write to disk every time a new row is inserted.

Existing Expertise

The technology behind HBase – Hadoop and HDFS – is very well understood and has been used previously at Facebook. During the development of their data warehouse, Facebook opted to use Hive. Hive is a set of tools for data analytics built on top of Hadoop and HDFS. While Hive performs very well with analytic queries on largely static data, it does not perform well with rapidly changing data. This makes Hive poorly suited for the new Social Messaging feature at Facebook. Since Hive makes use of Hadoop and HDFS, these shared technologies are well understood by Facebook’s operations teams. As a result, the same technology that allows Facebook to scale their data will be the technology that allows Facebook to scale their Social Messaging feature. The operations team already understands many of the problems they will encounter.

Other Technologies

There are, of course, other technologies involved, but the most interesting part to me is the data store. Facebook’s decision to use HBase is a huge milestone. HBase has come of age and is being used by a prominent customer. Facebook developers are contributing their improvements back to the open source community – they’ve publicly said that they are running the open source version of Hadoop.

New Uses for NoSQL

We all know that you can use NoSQL databases to store data. And that’s cool, right? After all, NoSQL databases can be massively distributed, are redundant, and really, really fast. But some of the things that make NoSQL database really interesting aren’t just the redundancy, performance, or their ability to use all of those old servers in the closet. Under the covers, NoSQL databases are supported by complex code that makes these features possible – things like distributed file systems.

What’s a Brackup?

Brackup is a backup tool. There are a lot of backup tools on the market, what makes this one special?

First, it’s free.

Second, it’s open source; which means it’s always going to be free.

Third, it can chunk your files – files will be crammed into chunks for faster access and distributed across your backup servers. Did you know that opening a filehandle is one of the single most expensive things you can ever do in programming?

Fourth, it supports different backends.

It Can Backup to Riak

I’ve mentioned Riak a few times around here. Quick summary: Riak is a distributed key-value database.

So?

So, this means that when you take a backup, Brackup is going to split your data into different chunks. These chunks are going to be sent to the backup location. In this case, the backup location is going to be your Riak cluster. As Brackup goes along and does its work, it sends the chunks off to Riak.

Unlike sending your data to an FTP server or Amazon S3, it’s going to get magically replicated in the background by Riak. If you lose a backup server, it’s not a big deal because Riak will have replicated that data across multiple servers in the cluster. Backing up your backups just got a lot easier.

Why Is the NoSQL Part Important?

NoSQL can be used for different things. It’s not a just a potential replacement for an RDBMS (and the beginning of another nerd holy war). Depending on the data store and your purpose, you can use a NoSQL database for a lot of different things – most notably as a distributed file system. This saves time and money since you don’t have to buy a special purpose product, you can use what’s already there.

Comparing MongoDB and SQL Server Replication

MongoDB has replication built in. So does SQL Server, Oracle, DB2, PostgreSQL, and MySQL. What’s the difference? What makes each MongoDB a unique and special snowflake?

I recently read a three part series on MongoDB repication (Replication Internals, Getting to Know Your Oplog, Bending the Oplog to Your Will) in an effort to better understand MongoDB’s replication compared to SQL Server’s replication.

Logging Sidebar

Before we get started, it’s important to distinguish between the oplog and MongoDB’s regular log. By default, MongoDB pipes its log to STDOUT… unless you supply the --logpath command line flag. Logging to STDOUT is fine for development, but you’ll want to make sure you log to a file for production use. The MongoDB log file is not like SQL Server’s log. It isn’t used for recovery playback. It’s an activity log. Sort of like the logs for your web server.

What’s The Same?

Both MongoDB and SQL Server store replicated data in a central repository. SQL Server stores transactions to be replicated in the distribution database. MongoDB stores replicated writes in the oplog collection. The most immediate difference between the two mechanisms is that SQL Server uses the transaction as the demarcation point while MongoDB uses the individual command as the demarcation point.

All of our transactions (MongoDB has transactions… they’re just only applied to a single command) are logged. That log is used to ship commands over to a subscriber. Both SQL Server and MongoDB support having multiple subscribers to a single database. In MongoDB, this is referred to as a replica set – every member of the set will receive all of commands from the master. MongoDB adds some additional features: any member of a replica set may be promoted to the master server if the original master server dies. This can be configured to happen automatically.

The Ouroboros

The Ouroboros is a mythical creature than devours its own tail. Like the Ouroboros, the MongoDB oplog devours its own tail. In ideal circumstances, this isn’t a problem. The oplog will happily write away. The replica servers will happily read away and, in general, keep up with the writing to the oplog.

The oplog file is a fixed size so, like the write ahead log in most RDBMSes, it will begin to eat itself again. This is fine… most of the time.

Unfortunately, if the replicas fall far enough behind, the oplog will overwrite the transactions that the replicas are reading. Yes, you read that correctly – your database will overwrite undistributed transactions. DBAs will most likely recoil in horror. Why is this bad? Well, under extreme circumstances you may have no integrity.

Let’s repeat that, just in case you missed it the first time:

There is no guarantee of replica integrity.

Now, before you put on your angry pants and look at SQL Server Books Online to prove me wrong, this is also entirely possible with transactional replication in SQL Server. It’s a little bit different, but the principle still applies. When you set up transactional replication in SQL Server, you also need to set up a retention period. If your replication is down for longer than X hours, SQL Server is going to tell you to cram it up your backside and rebuild your replication from scratch.

Falling Behind

Falling behind is easy to do when a server is under heavy load. But, since MongoDB avoids writing to disk to increase performance, that’s not a problem, right?

Theoretically yes. In reality that’s not always the case.

When servers are under a heavy load, a lot of weird things can happen. Heavy network traffic can result in TCP/IP offloading – the network card can offload work to the CPU. When you’re using commodity hardware with commodity storage, you might be using software RAID instead of hardware RAID to simulate one giant drive for data. Software RAID can be computationally expensive, especially if you encounter a situation where you start swapping to disk. Before you know it, you have a perfect storm of one off factors that have brought your shiny new server to its knees.

In the process, your oplog is happily writing away. The replica is falling further behind because you’re reading from your replica and writing to the master (that’s what we’re supposed to do, after all). Soon enough, your replicas are out of sync and you’ve lost data.

Falling Off a Cliff

Unfortunately, in this scenario, you might have problems recovering because the full resync also uses a circular oplog to determine where to start up replication again. The only way you could resolve this nightmare storm would be to shut down your forward facing application, kill incoming requests, and bring the database back online slowly and carefully.

Stopping I/O from incoming writes will make it easy for the replicas to catch up to the master and perform any shard reallocation that you need to split the load up more effectively.

Climbing Gear, Please

I’ve bitched a lot in this article about MongoDB’s replication. As a former DBA, it’s a scary model. But I’ve bitched a lot in the past about SQL Server’s transactional replication – logs can grow out of control if a subscriber falls behind or dies – but it happens with good reason. The SQL Sever dev team made the assumption that a replica should be consistent with the master. In order to keep a replica consistent, all of the undistributed commands need to be kept somewhere (in a log file) until all of the subscribers/replicas can be brought up to speed. This does result in a massive hit to your disk usage, but it also keeps your replicated databases in sync with the master.

Just like with MongoDB, there are times when a SQL Server subscriber may fall so far behind that you need to rebuild the replication. This is never an easy choice, no matter which platform you’re using, and it’s a decision that should not be taken lightly. MongoDB makes this choice a bit easier because MongoDB might very well eat its own oplog. Once that happens, you have no choice but to rebuild replication.

Replication is hard to administer and hard to get right. Be careful and proceed with caution, no matter what your platform.

At Least There is a Ladder

You can climb out of this hole and, realistically, it’s not that bad of a hole. In specific circumstances you may end up in a situation where you will have to take the front end application offline in order to resync your replicas. It’s not the best option, but at least there is a solution.

Every feature has a trade off. Relational databases trade integrity for performance (in this case) whereas MongoDB trades immediate performance for potential maintenance and recovery problems.

Further Reading

MongoDB

SQL Server

What I’m Reading 2010-11-05

A Bit of Troubleshooting

A client recently asked me for help with their SQL Server environment. It seems that replication was running slowly and was getting further and further behind – replication had been turned off during heavy data modification and was turned on after several days.

Protip: This is why it’s important to have a full checklist for everything that you do on a server.

Check Everyone’s Health

When you have a complicated system you want to take a look at everything, not just the symptoms of the problem. This happens in medicine, economics, and manufacturing. Why shouldn’t we do it in the datacenter?

The very first thing I did was take a look at the health of the publication server. That server was running well within normal parameters – there were no readily apparent disk I/O, memory, or CPU problems. Since the distributor lives on the publication, that was covered as well.

On a lark, I checked all of the other subscribers. They were also functioning normally. I did this to make sure that were weren’t seeing glaring performance problems on one subscriber that were really a symptom of a problem with the replication set up.

Everything was healthy… except one subscription.

The Problem Child

Having ruled out an unknown problem on the other servers, I took a look at the rest of the issues on the problem server. I found a few underlying issues and was quickly able to figure out that the poorly performing replication was only a symptom of the problem.

Oh I/O

When I started digging deeper and looked at the wait stats and I/O activity, I was in for a huge shock- there were queries that had been running for close to a day!

Digging deeper, there were two queries that were causing major performance problems. The first was a daily bulk data load. It read from the replicated tables, so if there was going to be heavy contention on those tables, this might be part of the problem. Luckily, the bulk load had been re-written long ago to use small batches so that the transaction log didn’t grow out of control. Rampant transaction log growth had been a huge problem when the server had tiny log drives – the longer running jobs were re-written using a WHILE loop to read blocks of data and produce smaller, explicit transactions. This design also makes it possible to stop and restart the job whenever you want.

I immediately killed the bulk load job and looked into the second query. This was the nightly index maintenance script. It had been happily chugging away for over 24 hours and was chewing through more disk that I thought was possible (probably because I was never awake at two in the morning to watch the job run). Figuring that bad indexes were a better option than thrashing disks, I killed the index defragment query and moved on to the next problem.

My Memory’s Not What It Used To Be

Turns out that the server was running low on memory. This server has two purposes – it’s both an ad hoc reporting server and runs regular reports. As a result, SQL Server Reporting Services was installed and the SQL Server had been configured with a max memory setting of 4GB out of the 8GB available. I dug deeper into the memory and I discovered that over half of SQL Server’s memory structure was being used to manage locking. The rest was going to plan cache and a few other internal structures, but at no point was memory being used as a cache for data. The server’s page life expectancy was effectively 0 – every read was going to disk.

My immediate recommendation was to double the RAM in the server and increase SQL Server’s max memory setting from 4GB to 12GB. As a longer term recommendation, I cautioned my client that they should invest in a new server since this reporting server was 4 years old and well past its expected lifespan.

Back to the I/O Again

As I was wrapping up, the other production servers started having I/O problems. This was right around the same time that business normally picks up for this client. On a lark I said, “Wouldn’t it be great if this was a hardware problem?”

Five minutes later we had great news: it was a hardware problem. One of the power supplies in the SAN had died. Although the SAN had four power supplies, losing a single one caused the SAN to power down the battery backed cache and perform all reads and writes straight from disk. This more than explained the strange I/O we had been seeing on the reporting server. A new power supply was immediately ordered from EMC and the problem was eventually solved.

Wrapping Up

Have a set of canned scripts ready to help you figure out what kind of performance problems you might have on your systems. I started with Glenn Berry’s diagnostic scripts and customized them over time to give me the information that I want to see. If I weren’t so lazy, I would probably make this into something that I could throw into Management Studio’s canned reports with pretty colors to tell me when there was a problem. I’ve also gotten used to scanning over the output and looking for potential problems. Learn which problems are really just symptoms of a bigger issue. It doesn’t do you any good to troubleshoot slow queries only to find out that the SAN is experiencing horrible performance issues.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.