Category syndication

Hadoop World Follow Up

I should have written this right when I got back from Hadoop World, instead of a week or so later, but things don’t always happen the way you plan. Before I left to go to Hadoop World (and points in between), I put up a blog post asking for questions about Hadoop. You guys responded with some good questions and I think I owe you answers.

What Is Hadoop?

Hadoop isn’t a simple database; it’s a bunch of different technologies built on top of the Hadoop common utilities, MapReduce, and HDFS (Hadoop Distributed File System). Each of these products serves a simple purpose – HDFS handles storage, MapReduce is a parallel job distribution system, HBase is a distributed database with support for structured tables. You can find out more on the Apache Hadoop page. I’m bailing on this question because 1) it wasn’t asked and 2) I don’t think it’s fair or appropriate for me to regurgitate these answers.

How Do You Install It?

Installing Hadoop is not quite as easy as installing Cassandra. There are two flavors to choose from:

Cloudera flavors

Homemade flavors

  • If you want to run Hadoop natively on your local machine, you can go through the single node setup from the Apache Hadoop documentation.
  • Hadoop The Definitive Guide also has info on how to get started.
  • Windows users, keep in mind that Hadoop is not supported on Windows in production. If you want to try it, you’re completely on your own and in uncharted waters.

What Login Security Model(s) Does It Have?

Good question! NoSQL databases are not renowned for their data security.

Back in the beginning, at the dawn of Hadoopery, it was assumed that everyone accessing the system would be a trusted individual operating inside a secured environment. Clearly this happy ideal won’t fly in a modern, litigation fueled world. As a result, Hadoop has support for a Kerberos authentication (now meaning all versions of Hadoop newer than version 0.22 for the Apache Hadoop distribution, Cloudera’s CDH3 distribution, or the 0.20.S Yahoo! Distribution of Hadoop). The Kerberos piece handles the proper authentication, but it is still up to Hadoop and HDFS(it’s a distributed file system, remember?) to make sure that an authenticated user is authorized to mess around with a file.

In short: Kerberos guarantees that I’m Jeremiah Peschka and Hadoop+HDFS guarantee that I can muck with data.

N.B. As of 2010-10-19 (when I’m writing this), Hadoop 0.22 is not available for general use. If you want to run Hadoop 0.22, you’ll have to build from the source control repository yourself. Good luck and godspeed.

When Does It Make Sense To Use Hadoop Instead of SQL Server, Oracle, or DB2?

Or any RDBMS for that matter. Personally, I think an example makes this easier to understand.

Let’s say that we have 1 terabyte of data (the average Hadoop cluster is reported as being 114.5TB in size) and we need to process this data nightly – create reports, analyze trends, detect anomalous data, etc. If we are using an RDBMS, we’d have to batch this data up (to avoid transaction log problems). We would also need to deal with the limitations of parallelism in your OS/RDBMS combination, as well as I/O subsystem limitations (we can only read so much data at one time). SQL dialects are remarkably bad languages for loop and flow control.

If we were using Hadoop for our batch operations, we’d write a MapReduce program (think of it like a query, for now). This MapReduce program would be distributed across all of the nodes in your cluster and then run in parallel using local storage and resources. So instead of hitting a single SAN across 8 or 16 parallel execution threads, we might have 20 commodity servers all processing 1/20 of the data simultaneously. Each server is going to process 50GB. The results will be combined once the job is done and then we’re free to do whatever we want with it – if we’re using SQL Server to store the data in tables for report, we would probably bcp the data into a table.

Another benefit of Hadoop is that the MapReduce functionality is implemented in Java, C, or another imperative programming language. This makes it easy to solve computationally intensive operations in MapReduce programs. There are a large number of programming problems that cannot be easily solved in a SQL dialect; SQL is designed for retrieving data and performing relatively simple transformations on it, not for complex programming tasks.

Hadoop (Hadoop common + HDFS + MapReduce) is great for batch processing. If you need to consume massive amounts of data, it’s the right tool for the job. Hadoop is being used in production for point of sale analysis, fraud detection, machine learning, risk analysis, and fault/failure prediction.

The other consideration is cost: Hadoop is deployed on commodity servers using local disks and no software licensing fees (Linux is free, Hadoop is free). The same thing with an RDBMS is going to involve buying an expensive SAN, a high end server with a lot of RAM, and paying licensing fees to Microsoft, Oracle, or IBM as well as any other vendors involved. In the end, it can cost 10 times less to store 1TB of data in Hadoop than in an RDBMS.

HBase provides the random realtime read/write that you might be used to from the OLTP world. The difference, however, is that this is still NoSQL data access. Tables are large and irregular (much like Google’s Big Table) and there are no complex transactions.

Hive is a data warehouse toolset that sits on top of Hadoop. It supports many of the features that you’re used to seeing in SQL, including a SQL-ish query language that should be easily learned but also provides support for using MapReduce functionality in ad hoc queries.

In short, Hadoop and SQL Server solve different sets of problems. If you need transactional support, complex rule validation and data integrity, use an RDBMS. If you need to process data in parallel, perform batch and ad hoc analysis, or perform computationally expensive transformations then you should look into Hadoop.

What Tools Are Provided to Manage Hadoop?

Unfortunately, there are not a lot of management tools on the market for Hadoop – the only tools I found were supplied by Cloudera. APIs are available to develop your own management tools. From the sound of discussions that I overheard, I think a lot of developers are going down this route – they’re developing monitoring tools that meet their own, internal, needs rather than build general purpose tools that they could sell to third parties. As the core product improves, I’m sure that more vendors will be stepping up to the plate to provide additional tools and support for Hadoop.

Right now, there are a few products on the market that support Hadoop:

  • Quest’s Toad for Cloud makes it easy to query data stored using HBase and Hive.
  • Quest’s OraOop is an Oracle database driver for Sqoop – you can think of Sqoop as Hadoop’s equivalent of SQL Server’s bcp program
  • Karmasphere Studio is a tool for writing, debugging, and watching MapReduce jobs.

What Is The State of Write Consistency?

As I mentioned earlier, there is no support for ACID transactions. On the most basic level, Hadoop processes data in bulk from files stored in HDFS. When writes fail, they’re retried until the minimum guaranteed number of replicas is written. This is a built-in part of HDFS, you get this consistency for free just by using Hadoop.

As far as eventual consistency is concerned, Hadoop uses an interesting method to ensure that data is quickly and effectively written to disk. Basically, HDFS finds a place on disk to write the data. Once that write has completed, the data is forwarded to another node in the cluster. If that node fails to write the data, a new node is picked and the data is written until the minimum number of replicas have had data written to them. There are additional routines in place that will attempt to spread the data throughout the data center.

If this seems like an overly simple explanation of how HDFS and Hadoop write data, you’re right. The specifics of write consistency can be found in the HDFS Architecture Guide (which is nowhere near as scary as it sounds).

Upcoming Presentations!

So, it’s only the one presentation, but it is still a presentation.

On October 26th, I’ll be presenting for the Application Development Virtual Chapter. If you weren’t able to attend the Columbus Code Camp, October 26th will be your lucky day! I’ll be revisiting my presentation, Refactoring SQL. Live Meeting will be the only way you can get a hold of this gem, so make sure you’re ready to rock and roll. The party starts at 12PM Eastern, so make sure you’re there on time.

Also, if you’re going to be in the San Francisco Bay area for the first week in November, look me up. I’ll be speaking at three cloud camps out there. Good times.

Lightning Talk Moderators Needed

In case you haven’t heard, we’re adding Lightning Talks to the PASS Community Summit this year. The idea of Lightning Talks is pretty simple:

  • Every speaker has 5 minutes
  • Slides are optional
  • No demos
  • When the 5 minutes are up, the speaker is done.

So, in order to pull this off, we need three volunteers from the audience. All you have to do is work an egg timer. Whenever a speaker starts talking, you start the egg timer. When they’re done, you cut them off, shoo them off the stage, and then introduce the next speaker. You can think of the job as being like an M.C. because that’s what it is (parachute pants will not be provided by PASS). One bonus is that you’ll get to hobnob with the speakers and/or embarrass when you introduce them. It’s all good fun until someone falls off the stage!

If you’d like to volunteer to moderate one of these sessions, let me know. Send an email to jeremiah.peschka@sqlpass.org. There are only three sessions, supplies are limited. Act now!

Update: I want to thank everyone who got back in touch with me about this, and there were a lot of you. We have our moderators selected – Matt Velic, Lance Harra, and Noel McKinney will be emceeing our Lightning Talks

I’m Going to Hadoop World

Sounds like I’m bragging, right? There is a free book involved. Or maybe you don’t care because it’s all NoSQLs and stuff and you’re a SQL Server DBA. And that’s where we differ.

When I first heard about NoSQL databases, I had the same reaction that a lot of people are having right now: disbelief and mockery. I remember making fun of MySQL when I first ran into it. It was such an odd database: it didn’t have foreign keys, joins didn’t work, it sometimes ate all of your data, and writes put locks on tables. It was no comparison for SQL Server, Oracle, or PostgreSQL. When I hit publish, this blog post is going to get dumped into a MySQL database on my server. I’ll probably hit up slashdot or read some blogs. That’s going to hit MySQL as well. These days, I try to keep a passing familiarity with MySQL because you never know when you’re going to need to use such a cheap, powerful, tool.

I see a filthy commie on the other side of this intertubes

This is what I think about NoSQL databases; especially Hadoop.

So, I’m going to Hadoop World and I’m going to learn as much as I can about Hadoop and the technology that surrounds it. The fun part is that I’m going to take some notes. And I’m going to share those notes.

That’s all well and good. As Senator Joe McCarthy used to say, “sharing is caring and caring is for commies.”

So, instead of just going and learning about what is interesting to me, I want to know what kind of questions DBAs have about this new-fangled Hadoop thing. I want to find the answers to the questions that people have about Hadoop, its place in the enterprise, and how it may or may not change the DBA’s job.

So, sound off in the comments. If you’ve got a question, type in the box and hit send. I’ll give you props when I post my write up of Hadoop World and I’ll do my best to find the answer.

I owe this entire idea to Andy Warren (blog | twitter). He mentioned that he was curious about Hadoop and that sparked this blog post. Here are his questions:

  • How do you install it?
  • What login security model(s) does it have
  • When does it make sense to use Hadoop vs SQL Server?
  • What tools are provided to manage it?

Upcoming Training – Columbus Code Camp

Good news, central Ohio residents! I’m going to be speaking at the Columbus Code Camp. I’ll be joining some amazing speakers and I’m really excited to be speaking locally again.

Original photo (http://www.flickr.com/photos/jetalone/121048480/), some rights reserved (http://creativecommons.org/licenses/by/2.0/deed.en)

Proof that you can refactor anything

The best part is, this is a brand new talk. That’s right, Columbus, you get the brand new sexy! I’m going to be talking about Refactoring SQL Applications. Here’s the abstract:

Refactoring SQL is not like refactoring application code. This talk will demonstrate proven SQL refactoring techniques that will help you identify where performance gains can be made, apply quick fixes, improve readability, and help you quickly locate places to make sweeping performance improvements. Jeremiah Peschka has years of hands on experience tuning SQL applications for performance, throughput, and concurrency.

This isn’t going to be an in-depth dissertation. The purpose of the presentation is to lay a solid foundation that developers can use to build their own refactoring tools and techniques. This is the capstone of the last two years of my professional life and I’m really excited that I get to attempt to wad up what I learned, run it through a grinder, and the share it with the world in the form of Delicious SQL Sausage. Jimmy Dean would be so proud of me. So, if you’re in or around Columbus on October 16th, come and check out my talk. You can get more info at http://columbuscodecamp.com

Talking Is Good, Where Can I Do More?

We can’t all just pick a street corner and start spouting off about whatever strikes our fancy. That’s not how things work, unless you’re a crazy street preacher. Very few people down at the local university will appreciate you getting on a soap box with a megaphone and ranting about normalization. Trust me on this one.

A few days ago, Jen McCown (blog | twitter) put up a blog post about Breaking Into SQL Showbiz. It’s a great post and Jen encouraged speakers to get started with their local user group. Which I wholeheartedly encourage, btw.

What happens next? Let’s say that you’ve given your first presentation at your local user group or for your co-workers and everything went well. Where do you go from there?

Other User Groups

Find other user groups in your area. You can find them through a few websites:

  • PASS – The Professional Association for SQL Server maintains a list and interactive map of SQL Server user groups.
  • INETA – INETA is a bunch of .NET focused user groups. They have an interactive map as well as a traditional location based search.
  • Meetup.com – Meetup.com is a great place to find people who are interested in talking about the same things that you are.

Conferences

When you’ve decided that it’s time to move beyond user group meetings, or you want a bigger thrill, where do you go? Conferences, of course. For a long time I had a problem finding about conferences. It’s not as much of a problem these days.

  • SpeakNET This is an aggregation of other user groups, code camps, and local/regional conferences that are looking for speakers.
  • SQLSaturday There are so many great local SQL Server events that it’s hard to keep track. SQL Saturday does that for me.
  • Community Megaphone This is an aggregator built by a Microsoft Developer Evangelist. It makes it easy to find events to attend or speak at. There’s even an RSS feed.
  • O’Reilly Media O’Reilly produce some mighty fine books. They also put on some mighty fine conferences. If you’re interested in mingling with the not-so-Microsoft crowd, this is a great place to find events to attend and speak at.

You can also find a list of conferences and speaking events on twitter http://twitter.com/peschkaj/conferences.

Three Things to Watch with NoSQL

Last week I painted a rosy picture of NoSQL databases. Before you deploy any kind of NoSQL database, you need to be aware of the potential pitfalls of NoSQL databases.

1) Architecture

NoSQL databases have different architectural concerns than traditional RDMBSes.

If you’re wokring with a BigTable or Dynamo database, your network backbone will need to be able to handle the increased traffic from replication and materializing MapReduce data. Dynamo databases are very chatty – reads and writes are only guaranteed when a specific number of servers respond to a query.

MapReduce queries run across many nodes in a cluster. One stage processing a MapReduce query is temporarily materializing intermediate results from each node. The intermediate results may then be shipped to another server for additional processing, aggregation, or just be combined into the final result set.

2) Querying Strategy

RDBMSes are able to perform a mix of batch and ad hoc querying. They produce reasonably fast results for doing online aggregation and analysis of data. It’s just as easy to write a query that returns a single row from a relational database as it is to write a query that aggregates data from several thousand rows.

Not all NoSQL databases are created equally. Hadoop and Cassandra are built to perform large scale analytical queries. Their implementation of MapReduce is designed to process batches of analytical results. Frequently these results are going to be pre-aggregated rather than delivered in real time. Riak and MongoDB, on the other hand, are designed to handle interactive querying.

3) Data Modeling

Data modeling is a tricky field of study in the world of relational databases. A huge number of books have been written on the subject and they deal with both the theoretical and practical aspects of modeling data. In short, while data modeling is tricky it is also a well understood field of study.

Data modeling in the non-relational world is an entirely new field. Many problems that have been solved in relational data modeling are new and require a different way of thinking. BigTable databases, while their structure may initially appear familiar, require different modeling techniques to get the most out of the data store. MongoDB’s document oriented approach makes a great deal of sense to developers, but it still poses a number of data modeling questions about how to link related objects and how deeply nested object graphs should be in the database. Key/Value stores, like Riak, have very little structure and give the application developer a huge amount of freedom to model data.

Summation & Further Reading

NoSQL Databases hold a lot of promise to help businesses rapidly respond to data growth. Whenever you’re planning to make a major infrastructure decision, you need to weigh the pros and cons before determining which solution is right for you. While NoSQL databases provide answers to a lot of questions, they also pose a different set of questions that require a different set of skills to answer.

NoSQL Glossary
MongoDB – Sharding with the Fishes
Hadoop for Batch Processing (from the Yahoo! Developer Network)
Schema Design in Riak
Schema Design for Riak (presentation slides)

Five Reasons to Use NoSQL

You’re not using a NoSQL database right now? I’m not all that surprised. Many IT shops are still evaluating moving from SQL Server 2000 to SQL Server 2005, much less these non-relational databases. A lot of people don’t even know what they would use a NoSQL database for. Does it replace the RDBMS? Work alongside it? Do something else?

1) Analytics

One reason to consider adding a NoSQL database to your corporate infrastructure is that many NoSQL databases are well suited to performing analytical queries. Developers can use the same querying languages to perform analytical queries that they’re using to perform atomic queries. Typically this will be some variation of a MapReduce query, but it’s also possible to query data using Pig or Hive. Don’t worry too much about these weird language terms, MapReduce is a fancy way of saying “SELECT and then GROUP BY” and doing it in a way that is entirely confusing to people who are used to SQL.

Many NoSQL systems boast phenomenal write performance. When you combine high write performance with batch processing it is easy to pre-aggregate data, summarize results, and still guarantee ad hoc query performance.

2) Scale

NoSQL databases are designed to scale; it’s one of the primary reasons that people choose a NoSQL database. Typically, with a relational database like SQL Server or Oracle, you scale by purchasing larger and faster servers and storage or by employing specialists to provide additional tuning. Unlike relational databases, NoSQL databases are designed to easily scale out as they grow. Data is partitioned and balanced across multiple nodes in a cluster, and aggregate queries are distributed by default. Scaling is as easy as racking a new server and executing a few commands to add the new server to the cluster (yeah, it really is that easy). Data will start flowing and you’ll back in business in no time.

3) Redundancy

In addition to rapid scaleability, NoSQL databases are also designed with redundancy in mind. These databases were designed and built at massive scales where the rarest hardware problems go from being freak events to eventualities. Hardware will fail. Rather than treat hardware failure as an exceptional event, NoSQL databases are designed to handle it. While hardware failure is still a serious concern, this concern is addressed at the architectural level of the database, rather than requiring developers, DBAs, and operations staff to build their own redundant solutions. Cassandra uses a number of heuristics to determine the likelihood of node failure. Riak takes a different approach and can survive network partitioning (when one or more nodes in a cluster become isolated) and repair itself.

4) Flexibility

What’s the use of a database if it’s not flexible? While the data modeling issues are completely different in NoSQL, there is a large amount of flexibility in how data is stored for performance.

Databases modeled like Bigtable and Cassandra provide flexibility around how data is stored on disk. It’s possible to create derived column families. In plain English: you can design your database to duplicate frequently accessed data for rapid query response. This is, of course, based on the assumption that writes and storage space are cheap.

Databases based on the Bigtable model also have another benefit – outside of key structure it’s possible to store a variety of disparate data in the same table. Structure is largely irrelevant. Relational databases have adopted features to solve similar problems (such as sparse columns in SQL Server), but they carry overhead. Storing wildly different columns in multiple rows of the same column family is so cheap as to be invisible in a NoSQL database.

Lastly, key-value stores provide an incredible level of flexibility. Data is arbitrarily stored as a value. Key-value databases make it possible to store images, word documents, strings, integers, and serialized objects within the same database. This requires more responsibility and creative thinking on the part of application developers and architects but it also lets the people designing the system build custom a completely custom solution that fills their needs.

5) Rapid Development

Let’s face facts: everyone wants their application to be faster, have more features, and they want it yesterday. NoSQL databases make it easy to change how data is stored or change the queries you’re running. Massive changes to data can be accomplished with simple refactoring and batch processing rather than complex migration scripts and outages and it’s even easier to take nodes in a cluster offline for changes and add them back into a cluster as the new master server – replication features will take care of syncing up data and propagating the new data design out to the other servers in a cluster.

T-SQL Tuesday – Indexes

arbitrary logo because a world without rules is a world with fun

When Michael J Swart asked me to take part in T-SQL Tuesday #10 – Indexes, I was incredibly flattered. Nobody’s ever asked me to do anything since a cop asked me to stop doing that one thing (speeding). I had to say yes. Here’s my contribution to T-SQL Tuesday #10.

An index is a horological lever-like something or other.

WTF is an index?

My 1967 Children’s World Book Encyclopedia doesn’t provide a definition for an index. That’s probably because it’s an encyclopedia and not a dictionary Thanks to the local gypsy garage sale going on in my neighborhood, I was able to acquire a dictionary from 1934 which defines an index as “a leverlike regulator for a hairspring.” Apparently this has something to do with clocks. If you try to verify this, you’ll notice that it’s a horological term. Apparently that has something to do with clocks and not adult entertainment.

Obviously neither my Children’s World Book Encyclopedia nor my Book of Learnin’ and Such was going to help me define an index.

In the spirit of true science, I realized that I would have to make something up. Or do my homework. One of the two.

SQL Server and a Drought of Options

SQL Server doesn’t have a wealth of indexing options available. In fact, there are fundamentally only two options – indexed and STFU. Sure, there are multiple indexing techniques, but this is my story.

At the core of SQL Server’s indexing strategy is the self-balancing binary search tree or, for those who don’t like to sound like pompous assfaces, the b-tree. B-trees are full of special magic. Well, not really. A b-tree is based on an algorithmic model that strives to keep the height of the search tree (the number of nodes between the root and the leaf nodes) as small as possible. This means that finding any single piece of information should be equally as cheap/expensive as finding any other piece of information.

That sounds great, right? Well, one of the fundamental indexing problems is that the data in the index has to be kept up to date as the data in the table changes. You can’t have data flying around willy nilly.

This is where we get to start sub-typing our b-trees.

Oh, So There ARE Options

Yes, I lied to you. Sort of. SQL Server has multiple types of indexes, just not multiple… types of index. There’s no way to control the indexing mechanism that SQL Server uses, you’re stuck with the pompous assfacery of self-balancing binary search trees (it’s not really a bad thing).

So, getting back to the topic at hand, in SQL Server we do have options. Two of them: clustered and non-clustered indexes.

Cluster This!

A clustered index is an ordered collection of nuts stored for winter and categorized appropriately… or data. One of the two. When we create a clustered index, all of the nuts/data is sorted and written to disk in the appropriate logical sort order. From this point forward, no physical order is guaranteed. Once the clustered index is created and the data is placed in physical order all bets are off. The data will be stored in logical order, but there is a chance that the data may not be physically in order due to the vagaries of disk access, deleting rows and/or tables and other assorted whatnot. (I think the horologists might have something to do with it.)

What goes in a clustered index? Everything. I told you the data was copied in. Didn’t I? Well, it is. Our clustered index is all of the data in the table and it’s in order based on the clustering key value.

The choice of the b-tree algorithm actually has some implications for how we choose the index columns for the clustered index. (I’m going to call these columns the clustering key. That’s what they are. Deal with it.) Because we’re using a B+ tree, instead of another data structure, to store and sort indexed data we have to take a few things into consideration. B+ trees do really well when every indexed value is unique. This is by design – it’s easier to look up any given row if you know that there is only one row that corresponds to that key. If there’s more than one, things get tricky and I’m not going to talk about that.

B+ trees have other characteristics that make them optimal for use in databases – it’s really easy to find any single element in the tree (even compared to other binary tree mechanisms) and they’re wonderfully optimized for read access when data is being read in sequence across ranges of data.

Searching Is Good

Sometimes you won’t know the key to reference a particular piece of data. Sometimes you’ll want to pull back records for everyone who lives in Tuscaloosa, AL. If you haven’t ordered your data by city and state, this is going to be really difficult to find without scanning the entire table. How do we get that data back?

More indexes!

In addition to clustered indexes, which define how the data should be ordered on disk, we also have non-clustered indexes. We can use these guys to help speed up our search. These are really just supplementary b-trees that point back to the clustered b-tree. You do have a clustered index, right?

That’s it?

No. Of course it can get more complicated than all of this, you can have composite indexes with multiple columns, non-unique indexes, indexes with computed columns, and even full-text indexes. Unfortunately the hookers and their lever actuated hairsprings are not going to let me keep talking.

In case you’re wondering what would happen to your database if you didn’t have any indexes, take a look at this:

i can't find any indexes!

Moving Away From Relational Storage – Introduction

Don’t. Fooled you, didn’t I?

If you’re already using a relational database, keep using it. If it’s scaling just fine with your hardware and workload, keep using it. If you aren’t running into any complexity problems, keep using it.

There’s no reason to change the way you’re storing your data just because you read an article about how BrandNewStartup.com was able to increase uptime and throughput eleventy-four percent by utilizing a new key-value storage solution. That’s like re-writing your core product every time you read an article about how Ruby has excellent metaprogramming features or how Python’s use of significant whitespace can lead to more readable code. These are reasons to choose a language for a new product, but they are not reasons to change the language of an existing project.

If you’re not supposed to move away from relational storage, what is this all about?

When Should I Switch?

When you add new features to an application, or when you rewrite a feature, take a look at what you need. When we needed to make use of large query result caching in a web application, we started using Velocity, which later became AppFabric. There’s some overhead serializing and deserializing objects into the storage mechanism, but could you imagine trying to dump that data to the database? The write overhead would be tremendous! We looked around at products that would work well with our existing application infrastructure, ASP.NET and SQL Server, and chose something that would play well in that garden, Windows Server AppFabric.

Before you start new you should ask yourself questions about what you’re trying to build. Question your assumptions about how the new feature or product will work. Question your assumptions about the existing infrastructure. Make sure that you aren’t shoehorning existing technology into a solution because you are familiar with it. Just because something feels familiar and safe doesn’t mean that it’s the best solution for a problem.

What Questions Should I Ask?

This is the trickiest part. It’s the part that I’ve struggled with and gotten wrong on more than one occasion. It’s okay to screw up, it’s how we learn.

Here are the questions I’ve started asking myself when I start a new project/feature that needs additional data storage:

  • Why are we storing this data?
  • How much data will I collect in a week? A month? A year?
  • How long will this data need to live?
  • How will this data be used?
  • How structured does this data need to be?
  • How available does this data need to be?

Why Are We Storing This Data?

You need to understand how you’ll be using the data before you can figure out how you want to store it. Will you be doing ad hoc reporting? Will the data be aggregated and consumed by other applications? How often will I write this data? How often will I read my data?

Different types of databases have different use case profiles. The way that you’re using the data will make a big difference for how you’re going to store data. You don’t want to store session state in a relational database – you’ll spend a lot of time writing transient data to disk. Likewise, you don’t want to store financial transactions in an in-memory cache.

How Much Data Will I Collect?

The volume of data that you’re working with will influence the way that you’re storing the data. Terabytes, and even petabytes, of data require different storage techniques and management strategies in a relational database, why would it be different anywhere else? Handling huge quantities of data often requires splitting the load across multiple servers or purchasing a SAN. Either way, you’ll want to consider long term budget and how the availability of those budget dollars might change over time. Just as important as the long term capacity of your data is the speed you’re collecting data. The faster you need to collect data, the more you need to look at how you’re storing that data.

Every database engine employs different strategies to maximize I/O throughput. The problem here is that they all use different strategies based on their use case.

How Long Will This Data Need to Live?

The lifespan of data is incredibly important. Short term data need never touch disk – it can live in memory. If this is something like session state data, it’s possible to use a 100% in memory storage solution (like memcached or AppFabric or even riak’s riakkvcache_backend) to solve this problem. Disks are slow, memory is fast.

If the data needs to live longer than a few seconds, then it’s time to consider how long it will really need to stick around. You need to look at different forms of persistence and how the strengths and weaknesses of those systems play into your long term choices. Some next-generation data stores will store data in memory and persist to disk in the background. This speeds up the ability to write data but it does bring up some data life concerns: what happens if the power fails? Other data stores use a write-ahead log, like relational databases, to make sure that the data is safe.

How Will This Data Be Used?

The way that end users are going to use our data is important for our decision making. Some data stores (such as CouchDB) do not allow ad hoc querying. Others (graph databases) make it possible to easily navigate deeply nested and complex data structures.

Relational databases are phenomenal general purpose data stores. They make it possible to store data in a variety of formats but there may be a variety of complications as a result of the general nature of SQL and the relational model. Massive volumes of data stored for statistical analysis have different storage and indexing requirements than data that needs to be instantly available from a variety of locations for atomic reads and writes.

How Structured Does This Data Need To Be?

Data structure can be important for a variety of reasons. Hierarchical data stores make it possible to traverse deeply nested category trees – think of how species are classified. Likewise, graph databases make it possible to navigate through complex relationships, much like the relationships that can be found in strongly object-oriented designs.

Considering how the structure of the data can make it very easy to make a decision on the data store that you’ll be using. If a lot of flexibility is required, it may even be possible that a Big Table derived database, such as Cassandra or HBase might meet your storage needs.

How Available Does This Data Need To Be?

One of the benefits of a relational database is that once a transaction commits, the data is immediately available to everyone querying that database (ignoring things like replication and log shipping). When you start working with distributed data stores, the data may not be immediately available. You need to ask yourself about how available the data needs to be. In some data stores, the delay may be only a few milliseconds. In others, it may be longer. Network latency and hardware utilization play a large role in the latency of data replication.

The other side of the availability question is “How fault tolerant does this data need to be?” With a relational database, unless you’ve implemented a clustering solution, the data is dependent on a single server functioning appropriately. If that single server goes down, business could stop until a replacement is brought back online. Some of the distributed data stores are massively fault tolerant – Riak is designed to be tolerant of hardware failure. If nodes fail, data will be replicated to new nodes, or even to new clusters, automatically. Replacement servers can be brought online and they will begin receiving new data immediately.

The immediately availability of data after a write as well as the long term (fault tolerance) availability of data is an important aspect of choosing a data storage solution.

What’s Next?

Ask yourself these questions the next time that you’re implementing new functionality or an entirely new application. The answers to your questions might very well surprised you.

In the next few weeks, I’ll be putting more thoughts together and showing examples of realistic scenarios of moving away from a relational database and utilizing a NoSQL solution.

Further Reading

NoSQL Summer
NoSQL Heroku and You
What is Big Data?
Big Data Is Less About Size, And More About Freedom
NoSQL: The Dawn of Polyglot Persistence

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.