December 2010
Mon Tue Wed Thu Fri Sat Sun
« Nov   Jan »
 12345
6789101112
13141516171819
20212223242526
2728293031  

Month December 2010

Predictions for 2011

It's gonna be the future soon

Better watch out, future's here!

These are my predictions for 2011. I suspect that they will easily be as accurate as anything else you read this year. Maybe even more so.

Cloud computing will take over. The shift in cloud computing will be slow at first, but uptake will dramatically increase over the course of the year.

Sick of slow adoption of their technologies, Microsoft, Google, Amazon, and Salesforce.com will create a shaky alliance and secretly build underground factories to assemble an army of clouds. Realizing that the cloud is too ephemeral for many applications, Microsoft will devise the concept of the hybrid cloud and license it to its allies.

Oh sweet merciful god, the hybrid cloud is coming for us!

Boxbot needs you to join the fight.

Amazon has to pull out of the war because of an autonomous hegemonizing swarm arises within their japanese shipping center.

While Amazon’s boxbot swarm is easily defeated by rain or a thorough soaking with water the camouflage, cheapness, and near ubiquity of the swarm make it difficult to destroy. Amazon’s is unable to put down the rebellion and is slowly consumed by a pile of soggy cardboard boxes.

Everyone is doomed.


Clearly, I have no idea what I’m talking about. All of these images are copyright other people through the Creative Commons license. If you’d like to look at the source material, feel free.

Boxbot
Cross stitching
Hybrid Cloud 1
Hybrid Cloud 2

What You’re Missing About Stateless Computing

Once, long ago, men carved their knowledge into the walls of caves so that it would be available for all time. Unfortunately, their knowledge was tied to once place. Eventually, a forward thinking cave dweller thought about carving his knowledge into a clay tablet. He was savagely beaten to death and his family burned as witches. Eventually the other cave dwellers realized that it was probably a good idea to have a more portable way to store their knowledge and they too adopted this portable clay-based knowledge transfer system.

Fast forward to the tail end of 2010 and people are saying that Microsoft has got it wrong with the Azure VM role. People are already lambasting it as a laughable concept that’s needlessly complex to patch. I can’t argue with that, it is complex to patch, but there’s a reason for that complexity and it’s called stateless computing.

It’s like the switch from procedural/object-oriented programming to functional programming. When you first switch, you get pissed off that you can’t reassign variables and that functions can’t have side effects. You get used to that pretty quickly and start doing crazy things with tail recursion and other functional paradigms that ultimately save you memory. remove debugging headaches, and give you an incredible amount of computing stability.

With the last paragraph in mind, let’s look at Azure VMs again – we can’t patch the VM directly. It’s stateless. What does a stateless VM buy you?

  • It’s easy to spin up additional, identical VMs. There’s no worrying if some master image is the same: it is.
  • It’s easy to back out incompatible patches – just remove the differencing VHD.
  • There are no side effects because of errant garbage living on the C: drive.
  • Security – if a virus infects your VM, just reboot.
  • Complex, time consuming patches can be applied once and quickly moved into place.
  • There’s a load balancer in front of every Azure instance. Operations must be idempotent, even when executed against different instances.

Managing state isn’t a component of your operating system in Azure, it’s a component of the storage tier. New paradigms require new ways of thinking. Sometimes a new way of thinking seems broken, wrong, or foolish.

If you’re looking to customize your Azure deployment stack without sacrificing the flexibility of using Azure, then Azure VM roles are for you.

If you’re looking for a replacement for your current VM Ware installation, Microsoft’s Azure VM roles aren’t for you. But while you’re fiddling around with VM settings, I’m going to be playing Scrabble. 

The times they are a changin’

Last week, I sent an email to the PASS Board of Directors. It said, in short, that I was stepping down from my seat on the board. In fact, here’s the email:

A few months ago I made a huge change in my career and stepped out of my role as a production DBA and into a new career working with new databases. The more time I spend with these databases, the more I realize that they need an exciting, vibrant community like we have here in PASS. I want to help these communities grow and thrive, but there are only so many hours in the day. I have enjoyed the time that I’ve served on the PASS Board of Directors and I will continue to be involved in the community, but there are new communities that need what we have at PASS.

This has not been an easy decision to make, but I am stepping down from the PASS Board of Directors.

Just so you don’t think it sounds contrived, here it is again:

I love PASS and I love the SQL Server community. Being a part of PASS has given me a phenomenal sense of community. The SQL Server community has readily welcomed new members and new technologies and continues to grow the meaning of being a SQL Server professional. The community has embraced knowledge sharing in a way that many communities should envy. I keep learning so much every day from all of your blogs, presentations, books, and white papers.

And this is where things get difficult. I love what I’m doing with PASS, but I also love what I’m doing outside of SQL Server. There is a lot of knowledge there, but it’s silo’d and stuck in people’s heads. That knowledge needs to get out into the wild. One of the easiest ways to spread knowledge is to help it spread through communities. It’s difficult building a community from the ground up. Hell, maintaining an existing community is difficult. Doing both at the same time is impossible.

I asked myself how I could help when I started with PASS several years ago. I jumped in and started a chapter, then I took over a virtual chapter, and then I ran for the Board of Directors. Along the way, I met a ton of people who were kind, helpful, thoughtful, and more than ready to help out. As I slowly came to the realization that I couldn’t build a new community and grow an existing community at the same time, I also came to the realization that there were a lot of people ready to step up and help PASS grow.

Knowing that so many of you care and want to help made it a lot easier to come to this decision. I know that no matter what, there are a lot of people who will steer PASS and help it grow.

PASS showed me a wonderful community and I want to share that community with other people. I’m not running away and I’m not leaving, I’m just spreading the love around.

What I’m Reading – 2010-12-17 Edition

Secondary Indexes – How Would You Do It (In Riak)?

Every database has secondary indexes, right? Not quite. It turns out that some databases don’t support them. Secondary indexes are important because they make it possible to perform more, quick, queries on a given chunk of data. What if we want to add secondary indexes to a database, how would we go about doing it?

Looking at queries, there are a few basic ways that we actually query data:

  • Equality predicates
  • Inequality predicates
  • Multi-predicate queries

We’re going to be using Riak as our example database

Equality Predicates

The easiest type of query is an equality predicate. With an equality predicate, we’re just matching any given search term. Looking at an example, what if we want to find every user born on July 4, 1976?

function(value, keyData, arg){
  var data = Riak.mapValuesJson(value)[0];

  if (data.Birthdate == '1976-07-04') {
    return [ data ];
  }
}

With an RDBMS, we’d just create an index on the table to support our queries. But with a key-value store, we don’t have that ability. How do we do it?

To create an index in Riak, we’d create another bucket, users_by_birthdate. The keys for this bucket would be the value of the birthdate. We could use links (this will fall apart pretty rapidly above about 1000 links) or store a list of keys in the users bucket. To satisfy our MapReduce query we can use the users_by_birthdate bucket to get the IDs of our users for retrieval rather than scanning the entire bucket. Depending on the amount of data we have, being able to use equality predicates could reduce the number of reads we have to perform.

Multi-Predicate Equality Searches

I’m going to skip inequality predicates for a second and talk about multi-predicate equality searches. This is easy to accomplish. If we’re searching for everyone with the last name of ‘Smith’ who was born on July 4th, 1976 we might use the following MapReduce function:

function(value, keyData, arg){
  var data = Riak.mapValuesJson(value)[0];
  if (data.Birthdate == '1976-07-04' && data.LastName = 'Smith') {
    return [ data ];
  }
}

With a relational database, we could create a multi-column index. That isn’t possible here – with Riak we only have a bucket name (the index name), a key (the key), a value.

Two Equality Buckets

One option would be to use two equality buckets and then perform a bitwise comparison of the results, only keeping user keys that exist in both buckets. In this case, we would have two buckets users_by_birthdate and users_by_lastname. The problem is that we might have large data skew on one of these key-value combinations. So, if there are only 1,000 users born on July 4th, 1976 and 1,000,000 Smiths in our database, this could cause some inefficiencies in our queries when we go to merge the two result sets. Worse, there could be 1,000,000 users that match our search conditions in each bucket, but only 10 in common.

Using a bitwise combination of two buckets is, from a development standpoint, the easiest way to implement multi-predicate comparison. However, this approach creates problems with any significant volume of data.

MapReduce Over An Index Bucket

What if instead of using two or more buckets and performing a bitwise comparison of the keys that we find, instead we create a single index bucket. Inside that bucket, we’ll still use one value as the key for each key-value pair. This would look like the users_by_birthdate bucket – every key-value pair has the birthdate as the key.

Each key contains a list of values. Unlike our first example we’re going to store more than a simple list of keys. Instead we’re going to store a list of lists of tuples – a multi-dimensional data structure. If we’re using the birthdate, last name example, it might look like this:

1976-07-04    [
                [176, {last_name: Smith}],
                [195, {last_name: Harrison}]
              ]

What does this buy us? Well, for starters we don’t have to perform bitwise comparisons between two or more potentially massive lists of keys. Instead we can examine the MapReduce query that’s coming in and re-write it on the fly to give us a MapReduce job over our new index. The upside of this index is that it can perform double duty; we can search for users by birthdate or by birthdate and last name.

One of the things that we need to consider when we’re designing our key-value database is the queries that we’re going to be running, not the way that the data should be structured. When we’re structuring our data, it may make more sense to use a complex data structure, like this index bucket, instead of many single column indexes. Updates to the data will require fewer updates with this model. Only one write is needed when a user signs up, not two; one for the compound index as opposed to two for the simple indexes – one for users_by_birthdate and one for users_by_lastname.

Once we start updating data using an index bucket can reduce index maintenance. When we update a record we’ll need to update any index buckets. This could lead to a large number of data writes, tombstone writes, and compactions over time. Using a single index bucket with more complex structure in the index removes the need for a large number of writes. We only have to write to the original key-value pair as well as writing to the index bucket.

Inequality Predicates

Inequality predicates are much trickier than equality predicates. At first glance it seems like it would be easy to perform a MapReduce over the keys and determine which key-value pairs meet the search conditions. And we would be right: that is an easy solution when there are a small number of keys. What happens when there are a large number of keys? We need to read more data to get to the results we need. Digging further, what happens when very little of our data actually matches? The benefits of using this approach become very slim; we need to perform a large number of reads to retrieve very little data.

There’s a trick to implementing inequality predicates: in some cases it will be faster to retrieve all keys that match our inequality predicate – birthdate between 1976-07-01 and 1976-07-31. In other cases, a scan is going to be faster – users with a last name starting with ‘A’.

How do we figure out which approach is going to be faster? Let’s take a look at a concrete example using users and birthdates for our example. We already know how to find every user with a specific birthdate using an index; that’s a simple seek. How do we find all users whose birthday falls between a range of dates? That’s what we’ve been asking all along, right?

What if we maintain some simple information about our users_by_birthdate index? Let’s say we track the minimum value, the maximum value, and the number of keys in the bucket. Just from this information we can guess roughly how many keys will be in any given range. This is pretty easy with dates; we can assume that there’s a an even distribution of birthdates in any range of dates.

Things get trickier with names. How many distinct last names are there between Aaronson and Bartleby? What about between Xara and Yoder? Knowing the min and max last names and the number of keys won’t help us figure out the distribution; the data isn’t uniformly distributed. We need a way to make educated guesses about the data in a range. One way to do this is to create a histogram of our data. Histograms are data samplings that let us make better guesses about the number of specific values we can expect to find in a given range.

Relational databases accomplish this by maintaining separate statistics tables. We can accomplish something similar in Riak by creating a new bucket of statistics where the stats key is the high value of a range of keys. The value of the key-value pair would be data about the range of values – number of rows, number of keys, and the average number of distinct values per key. We want to collect statistics so we can easily determine when we need to seek a small number of keys and when we need to perform a MapReduce over the bucket.

So far I’ve proposed two range search approaches that work well with different use cases. Ideally we could create an indexing mechanism in Riak that lets us state an assumption about the distribution of our data in Riak. If we know there’s a constant distribution of keys, we can skip any kind of heuristics and just assume that for July, we’ll be reading 31 keys. Likewise, if we want to pull back every user whose last name is between Scalia and Smith, then we’ll need to MapReduce over the index bucket to get the data that we need.

Creating Indexes – Simple Indexes

Simple, single column indexes are easy to create. The syntax to create an index with SQL is pretty straightforward:

CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table [ USING method ]
    ( { column | ( expression ) } [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
    [ WITH ( storage_parameter = value [, ... ] ) ]
    [ TABLESPACE tablespace ]
    [ WHERE predicate ]

With Riak, we won’t need that much information about the index. Working from the simplest use case possible, we only need

  • index name
  • bucket
  • value name to index
  • data storage mechanism

We need to uniquely name our index, otherwise we can’t create a new bucket for it. Likewise, we have to know which bucket we’re indexing. It also makes sense that we want to know which value we’ll be indexing.

We need to define the data storage mechanism for indexes so we are able to index the data. It’s for the same reason that we have to tell Riak Search how the data is stored; unless we know what we’re indexing, it’s impossible to index it.

Looking at riak_search, it should be possible to re-use the basic principles behind riak_search_kv_extractor and riak_search_kv_hook to define how indexes are created and how data is retrieved. When we create a full-text index with Riak search, we supply the datatype (JSON, XML, plain-text, or something else). This gives us an incredible amount of flexibility without having to define a schema for the documents. We’re only indexing the fields we want. This can be incredibly useful if we’re storing larger objects but we know we’ll only perform look ups on a few fields – indexing time and space is reduced drastically.

Re-using the full-text index model for secondary indexes makes it easy to maintain consistent indexing behavior throughout Riak – all of our indexes (full-text or otherwise) can be defined using the same commands.

Creating Indexes – Complex Indexes

Complex indexes shouldn’t be much more difficult to implement than simple indexes. We only need one additional piece of information to collect our index: additional fields to index. Ideally the additional fields would be a text representation of how we access the fields in a program – “parent.child.property”. Supporting XML could make this trickier, but it should be easy to use XQuery/XPath to determine which XML elements, attributes, and values should be indexed. The intelligence of knowing how to perform the query will come later; indexes don’t need to know how to query themselves.

Staying up to Date

Right now, this index will only be accurate once: when we create it. Once again, we can borrow from the Riak Search implementation and create hooks to make sure the index is kept up to date as data is written. Keeping our indexes in sync with the data will cause some overhead on the cluster, but it’s worth it to know that we can accurately retrieve data. Like any other database, we need to take care not to create too many indexes: every write to the bucket means we end up writing to each index.

Ideally the index maintenance hooks would be the last hooks to run in a sequence of pre-commit hooks. If there are a number of pre-commit hooks to validate data we need to make sure that the data is valid before it is written to an index. If our primary worry is data validity, it is possible to implement the index maintenance code as a post-commit hook in Erlang (rather than as a pre-commit hook in either Erlang or JavaScript).

The Trouble with Data Locality

One potential problem we can run into is data locality. Depending on how the data is distributed in the key space, we could end up writing key-value and index data on the same physical node. The problem is that we could double, or more, the physical I/O that we’re performing on those nodes. If an index is not on the same node as the data we might incur tremendous network overhead querying remote indexes before querying data on the local node. On a busy network, this increased bandwidth could have an adverse effect on operations.

Driving Toward Our Goal

We’re clearly heading toward something. Let’s take stock of what we’ve talked about so far:

There are two types of search predicate to support

  • Equality predicates – easy to index
  • Inequality predicates – trickier to index

There several ways to use indexes

  • A single index
  • Multiple indexes with a bitwise comparison
  • Complex indexes

We can read from indexes through

  • Key seeks
  • Key scans (MapReduce)
  • Key/value scans (MapReduce)

We can figure out what type of read to use by using data statistics

  • Uniformly distributed keys are easy
  • Randomly distributed keys are harder
  • A small number of seeks can be more efficient than a single MapReduce

Astute readers will have probably guess what we need: a query engine.

This Ain’t No Car, It’s a Key-Value Store!

Why do we need a query engine? Key-value stores are simple; data is stored in key-value pairs and retrieved by a simple key lookup. A query engine of some type just adds complexity, right? Well, without a query engine, we can’t use secondary indexes without explicitly writing queries to use them.

In a hypothetical Riak query engine, we could process a MapReduce query like this:

Parse map phases and extract search conditions

Examine existing indexes

  1. Do indexes exist?
  2. Examine index type – evaluate statistics if needed
  3. Evaluate data access path needed to retrieve data – this is where we determine if we can use one index, multiple indexes with a bitwise compare, or use a multi-column index
  4. If no indexes exist, does a full-text index exist?

Re-write MapReduce functions on-the-fly

  1. Early Map phases are added to query use indexes
  2. Early Map phase output is used for key selection in MapReduce functions

While Riak doesn’t need anything as complex as an RDBMS’s query engine, this is still a gross simplification of how a query engine would function – additional functionality is needed to perform costing estimates of seeks vs MapReduce scans, statistics will need analyzed, and the query engine will need to be aware of vast portions of Riak in order to make good guesses so it can re-write MapReduce queries on the fly.

Features, Features, Features

I left some topics out of this discussion intentionally. They don’t directly apply to implementing indexes, but they would be very helpful to have down the road.

  • Checking updated values to avoid needless tombstones. When a client updates a key with the same value it had before, there’s no reason to overwrite the stored value: nothing changed.
  • Saving query plans. We are re-writing MapReduce queries on the fly, why not save them off internally and opt to store them for re-use when the cluster comes online?
  • Missing indexes. Frequently used queries should be optimized, either with secondary indexes for faster querying or as standalone entities in Riak. THe easiest way to monitor for this is to store metadata about the query history.
  • Index use data. How often are our indexes being used? Do they meet our needs or should we replace them with different indexes?
  • Index only queries. If a MapReduce can be satisfied by the data in the index, don’t query the main bucket.
  • Running statistics. Riak is a long running process. Over time we will be able to collect a lot of data about Riak. Why not collect data about how our Riak is being used so we can tune our code and hardware to get better performance?

Summing it all Up

Indexing is a tricky thing to think about and an even trickier thing to get right. When I first started on this as a thought experiment, I thought that it was going to be easy. Assuming that indexes are simple led to a lot of dead ends, frustration, and restarts. This came about from talking to Mark Phillips (blog | twitter) and Dan Reverri (twitter) last week at the Basho offices about Riak, HTML, and the sound of farm machinery pulling itself apart.

Sound off in the comments.

TL;DR version – indexes are hard, let’s go shopping.

T-SQL Tuesday #13 – What the Business Says Is Not What the Business Wants

T-SQL Tuesday logo

Steve Jones created this month’s topic: What the business says is not what the business wants. Despite the inflammatory title, it’s really about interacting with business: What issues have you had in interacting with the business to get your job done?

I’m going to piss a lot of you off right now: it’s your fault.

Ready to get more angry? Shut up and listen. Stop. Shut up. Listen.

When you start having those gut reactions about terminology, proper technique, or anything else: shut up and listen. You don’t make cars or DVDs or coffee makers. The business users don’t write code. There’s a knowledge gap. The users are probably talking like that because they want you to take them seriously. STFU and listen.

When you’re talking to business users, turn off the thinking part of your brain that is already designing an elegant solution in BCNF that uses the latest .NET IFlangingSkrillAdapter and a combination of Silverlight, HTML5, and buttplugs to solve the problem. Stop it. Stop it right now.

Just listen to what the users are saying to you. Listen to how they’re saying it. Underneath that project they’re asking you to build, there’s a real problem that you need to solve. Sometimes, they aren’t even asking you to solve the real problem. Sometimes they’re asking you to solve a problem that wouldn’t exist if someone had listened three years ago.

When I started out in IT, I wanted to build awesome things. I wanted to change the world and make smart solutions and have everyone look at me and say “Oh, look at how clever he is!” Clever software isn’t going to solve your users’ problems. Time after time I had to go back and forth with the users. Every time we met, we’d both be more frustrated – things were going slower than expected and features didn’t work as designed. I’d stalk back to my desk after those meetings, grumbling the whole way about how the users didn’t know what they wanted and how they should let me design what they wanted.

You know what happens when you let computer geeks design something for humans to use? You end up with elegant, useless, software that doesn’t help anyone.

This really hit home a few years ago. A buddy and I were working at a client solving some of their needs. When he was talking to one of the users, he noticed that she seemed to be holding back. He stopped for the day and said, “Let’s meet again tomorrow at 9:00 and take a look at this fresh.” She agreed. The next day, my buddy comes into the office, grabs a fresh pad of paper and walks off to his meeting. He showed up three hours later demanding more paper. I gave him one of my many spare legal pads and kept typing away, working on my elegant code that would interface between six disparate systems. At the end of the day, my buddy had two legal pads of notes.

Think about that it: in 8 hours of listening to a user, my friend was able to fill two legal pads with factual notes. There was nothing speculative in there. It was all facts about processes, business rules, and the information that was moving around inside the company.

The next day, we talked about the underlying problems. At first we started talking about the systems we could build and the consulting money that was there and the commission we could make. We stopped. We looked at the papers again. The problem wasn’t that there was software to be written and rates to be raised. The problem was that there was software in the first place. Sure, the software had been written to fill some nominal business need, but it really existed to satisfy a developer’s visceral urge to try out something new. In solving their programming problems, our predecessors had removed the business needs and created their own elegant, but useless, software.

Listen to your users. They know what they want and they know what they need. They may not always have the words and sometimes you just need to listen.

Twelve Days of SQL – Day (2 – 1)

The Story So Far

Brent Ozar (blog | twitter) asked me to pick a favorite blog post for the year. Since I couldn’t pick anything I wrote (yes, I love myself that much), I had to pick one from the community. Since just about everyone in Brent’s crazy list of crazy blogs about SQL, I had to pick someone from the SQL Server community.

My Favorite Blog Post This Year

Earlier this year, Mladen Prajdić posted SQL Server – Undelete a Table and Restore a Single Table from Backup. I love this post for a couple of reasons. First, it’s completely crazy. Mladen had a strange idea and then ran with it. Rather than accept conventional thinking that it isn’t possible to restore a single table from a backup, Mladen opened up SSMS and started prodding at the inside of SQL Server. The second reason I love this post is because the explanation is clear and the code well documented. Mladen ran the code by me before he published the post. Normally, I can’t read other people’s code without a tremendous amount of time and energy. His code was clear enough to stand on its own.

Coming Up Next

Grant Fritchey (blog | twitter) is up next. Grant has been a huge inspiration to me – he’s humble, intelligent, and genuinely interested in helping out. When he’s not terrifying developers at a large insurance company, he is a scout leader, father, geek, and kilt connoisseur. Grant’s post drops tomorrow – December 10 – so be on the look out.

Default Values, Triggers, and You

A friend of mine sent me an email the other day asking about default values in SQL Server. I realized that I’ve had to think about this a few times over the years and I’ve been asked about it more than once, too.

Setup

We need a table first, right? We’ll also want a few sample rows in there.

CREATE TABLE Employees (
  emp_id INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
  emp_name varchar(50) NULL);
GO

INSERT INTO Employees (emp_name) VALUES ('a');
INSERT INTO Employees (emp_name) VALUES ('b');
GO

SELECT * FROM Employees;

/*
emp_id      emp_name
----------- --------------------------------------------------
1           a
2           b
*/

How Defaults Work

So far we just have two rows in our two column table. It’s pretty boring. Let’s add a default value:

ALTER TABLE Employees
ADD last_modified DATETIME NULL CONSTRAINT Employees_last_modified
DEFAULT CURRENT_TIMESTAMP;
GO

We might as well add some new rows while we’re having fun with our employees, right?

INSERT INTO Employees (emp_name) VALUES ('c');
INSERT INTO Employees (emp_name) VALUES ('d');

What’s it look like now?

SELECT * FROM Employees;

/*
emp_id      emp_name                                           last_modified
----------- -------------------------------------------------- -----------------------
1           a                                                  NULL
2           b                                                  NULL
3           c                                                  2010-12-06 18:21:37.787
4           d                                                  2010-12-06 18:21:37.787
*/

Hold up. Employees 1 and 2 don’t have a last_modified value. Why not? Well, that’s because we’ve told SQL Server that our last_modified column can allow NULLs. They’re allowable in our table. If we wanted to automatically provide a default value when we added the constraint, we could do so by specifying the datatype as DATETIME NOT NULL. A best practice would be to add the column as a NULLable data type, add a value for all NULL rows, and set the column to NOT NULL.

If we do want to update a NULLable column and set it to the default value, we just issue an update using the DEFAULT keyword for the value. If that makes no sense, perhaps this example will help:

UPDATE Employees
SET last_modified = DEFAULT
WHERE last_modified IS NULL ;

What About Updates?

UPDATE Employees
SET emp_name = 'zzz'
WHERE emp_id = 3;

SELECT *
FROM Employees
WHERE emp_id = 3;

/*
emp_id      emp_name                                           last_modified
----------- -------------------------------------------------- -----------------------
3           zzz                                                2010-12-06 18:21:37.787
*/

As you can see, when we update employee 3, it doesn’t change the value of last_modified. That’s because the default value is only set on insert. We could specify DEFAULT in our UPDATE statement, but then we’d need to specify that every time we update the table. What can we do?

The Answer is Triggers

That’s right: triggers. If we want to track the modification timestamp of an object in the database, we need to use a trigger to keep things updated:

CREATE TRIGGER TR_Employees$AfterUpdate ON dbo.Employees
AFTER UPDATE
AS
BEGIN
  UPDATE  e
  SET     e.last_modified = CURRENT_TIMESTAMP
  FROM    dbo.Employees e
          JOIN inserted i ON e.emp_id = i.emp_id;
END
GO

UPDATE Employees
SET emp_name = 'asdf'
WHERE emp_id = 1;

SELECT * FROM Employees;

/*
emp_id      emp_name                                           last_modified
----------- -------------------------------------------------- -----------------------
1           asdf                                               2010-12-06 18:34:04.340
2           b                                                  NULL
3           zzz                                                2010-12-06 18:21:37.787
4           d                                                  2010-12-06 18:21:37.787
*/

And that, my friends, is how we keep a modification timestamp up to date.

What I’m Reading – 2010-12-03

Getting Started with Hive

This is a follow up to the last two posts I made about querying HBase and Hive. The set up for this was a bit trickier than I would have liked, so I’m documenting my entire process for three reasons.

  1. To remind myself for the next time I have to do this.
  2. To help someone else get started.
  3. In the hope that someone will know a better way and help me improve this.

Installing Hadoop and Hive

This was the easiest part of the installation process. I used the Cloudera distribution of Hadoop for everything in this stack. There are a few ways to go about getting the Cloudera Distribution of Hadoop (CDH for short) on your machine. The Hadoop Quick Start is a good place to get started.

Java

I’m pretty sure we’ve covered this before, but you need to have Java 1.6 installed in order to use any part of the Hadoop ecosystem.

Windows

If you’re using Windows, I recommend using the Hadoop VMWare image that Cloudera make available. If you want to run Hadoop natively on Windows, you need to set up Cygwin and creating a UNIX-like environment on your local machine. Setting. up Cygwin can be a bit tricky the first time you do it. Luckily, you’re not alone and documentation does exist: Setting up a Single-Node Hadoop “Cluster” on Windows XP.

Linux

Linux users have it the easiest out of any platform, in my opinion. Most of the documentation assumes that you’re running some UNIX-based operating system. If you’re willing to use the CDH version of Hadoop, you can tell your package manager to talk to the Cloudera repository. Cloudera have thoughtfully provided documentation for you in the installation guide.

You issue a few commands and you’ll have a working Hadoop set up on your machine in no time.

Other *NIX

This is the route that I took with OS X. You could also do the same on Linux if you want to use the stock distribution of Hadoop. The Hadoop Quick Start guide covers this in some detail, but here’s what I did:

wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2+737.tar.gz
wget http://archive.cloudera.com/cdh/3/hive-0.5.0+32.tar.gz
wget http://archive.cloudera.com/cdh/3/pig-0.7.0+9.tar.gz

I copied these three files into my /opt folder and uncompressed them. As you can see in the screenshot, I renamed the hadoop, hive, and pig folders to use a cdh_ prefix. This makes it easier for me to identify where the installation came from, just in case I need to upgrade or raise a stink with someone.

Aliasing Hadoop for Easy Upgrades

Configuration

Nothing is ever quite as easy as it seems at first, right? Now that we have our software installed, we need to supply some configuration information. Hadoop has three modes.

  1. Standalone In standalone mode – nothing runs as a service or daemon. This requires no configuration on our part, but it does make things trickier with Hive because only one Hive database connection can exist at a time in standalone mode.
  2. Pseudo-distributed This is what we’re aiming for. In pseudo-distributed mode everything is running on one machine as a service process. What we’re actually doing (sort of) is configuring our Hadoop installation to a be a one node cluster. It’s not reliable or failure proof (or a good idea), but it does let us develop on our workstations and laptops instead of buying a lot of servers for the house.
  3. Fully distributed This is a real live cluster running on many servers. Books are being written about this.

Hadoop is, by default, set up to run in standalone mode. Having a standalone Hadoop/HDFS installation is nice, but it doesn’t do what we need in the long-term.

There are Linux packages for CDH that will set up your server to run in pseudo-distributed mode. If you used one of those, you can skip ahead. Personally, I’d stick around just so you can read up on what you’re missing and so you make me feel better (I’ll know if you skip ahead).

core-site.xml

The core-site.xml file is used for machine specific configuration details. The default configuration is set up in core-default.xml. Settings in core-default.xml are overridden by core-site.xml, so there’s no reason to go messing around in there. If you want to check out the contents of the file, it’s located in build/classes/core-default.xml.

Once you’ve opened, or created your own core-site.xml, you want to put appropriate information in there. Mine looks like this:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
    <description>The name of the default file system.  A URI whose
    scheme and authority determine the FileSystem implementation.  The
    uri's scheme determines the config property (fs.SCHEME.impl) naming
    the FileSystem implementation class.  The uri's authority is used to
    determine the host, port, etc. for a filesystem.</description>
  </property>

  <property>
    <name>mapred.tasktracker.tasks.maximum</name>
    <value>8</value>
    <description>The maximum number of tasks that will be run simultaneously by a
    a task tracker
    </description>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
    <description>Default block replication.
    The actual number of replications can be specified when the file is created.
    The default is used if replication is not specified in create time.
    </description>
  </property>
</configuration>

I’ll summarize these three settings here. A full explanation of the core-site.xml file can be found in the Hadoop wiki: http://hadoop.apache.org/common/docs/current/cluster_setup.html.

fs.default.name

This setting is telling Hadoop to use HDFS running on the local host for storage. This could also be a local file or something in Amazon S3.

mapred.tasktracker.tasks.maximum

We want to set this to a sane value because this is the maximum number of MapReduce tasks that Hadoop will fire up. If we set this value too high, Hadoop could theoretically start up a large number of tasks and overwhelm our CPU.

dfs.replication

Non-relational databases like Hadoop get their resiliency by replicating data throughout the database cluster. Since we’re running in pseudo-distributed mode (it’s a cluster with only one node) we want to set this to 1.

hadoop-env.sh

I changed a few variables in my hadoop-env.sh file based on settings I found in the Hadoop Wiki article Running Hadoop on OS X 10.5 64-bit). You can set these environment variables anywhere in your operating system, but since they’re Hadoop-specific, it’s probably best to set them in the hadoop-env.sh file. Based on my own experience, I did not change the HADOOP_CLASSPATH variable. Here’s what the first 15 lines of my hadoop-env.sh looks like:

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home/

# Extra Java CLASSPATH elements.  Optional.
# export HADOOP_CLASSPATH="<extra_entries>:$HADOOP_CLASSPATH"

# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=2000

I’ve also added some additional configuration to my .profile that make working with Hadoop and Hive a lot easier:

export CLASSPATH=$CLASSPATH:/Library/PostgreSQL/9.0/share/java/postgresql-9.0-801.jdbc4.jar

export JAVA_HOME=$(/usr/libexec/java_home)
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
export PATH=$HADOOP_HOME/bin:$PATH
export HIVE_HOME=/opt/hive
export PATH=$HIVE_HOME/bin:$PATH
export PIG_HOME=/opt/pig
export PIGDIR=$PIG_HOME
export PATH=$PIG_HOME/bin:$PATH
export HIVE_AUX_JARS_PATH=/Library/PostgreSQL/9.0/share/java/

# CLASSPATH variables don't seem to like symbolic links on OS X
# setting up the CLASSPATH here makes it easy to change things around
# in the future as I upgrade between Hadoop/Hive/HBase/Pig versions
hadoop_dir=$HADOOP_HOME;
hadoop_version=0.20.2+737;
pig_dir=$PIG_HOME;
pig_version=0.7.0+16;

export CLASSPATH=$CLASSPATH:${hadoop_dir}/hadoop-${hadoop_version}-core.jar:${hadoop_dir}/hadoop-${hadoop_version}-tools.jar:${hadoop_dir}/hadoop-${hadoop_version}-ant.jar:${hadoop_dir}/lib/commons-logging-1.0.4.jar:${pig_dir}/pig-${pig_version}.jar:${pig_dir}/pig-${pig_version}-core.jar

One Last Step

Before we can start anything running on Hadoop we have to get our namenode ready. The namenode is part of Hadoop that manages information for the distributed filesystem. A namenode knows where all of the data is stored across our distributed filesystem. Without a running namenode, we can’t find the files we’re storing.

Before we get started, we have to format our filesystem. This process creates empty directories to be used for the namenode’s data. Datanodes aren’t involved because they can be added and removed from the cluster dynamically.

hadoop namenode -format

You’ll see some output from this command and, once it’s all said and done, you’ll have a formatted namenode that’s ready to go.

Starting Hadoop

Once you have everything set up and installed, you should be able to start Hadoop by running start-all.sh from the command prompt. You should see some messages that tell you a bunch of nodes are being started with output going to a specific directory.

Starting Hadoop on the Command Prompt

Verifying That It’s Running

You can verify that everything is configured and running hadoop dfs -ls / on the command line. You should see a basic directory listing that looks something like this image.

Verifying that Hadoop is Running

What About Hive?

Astute readers will notice that we’ve only configured Hadoop and haven’t done anything to get Hive up and running. This is, unfortunately, a separate task. Hadoop and Hive are separate tools, but they’re part of the same tool chain.

Configuring the Metastore

Hive, like Hadoop, uses XML configuration files. But, since Hive is a database, it also needs a place to store metadata about the database – it’s called the metastore. By default, Hive uses what’s called an embedded metastore. The problem with the default configuration is that it only allows for one user or process at a time to connect to a Hive database. This isn’t that good for production databases. It’s also no good if you’re like me and forget which virtual desktop you left that query window open on. We’re going to fix this.

It’s possible to store the metastore’s information in a persistent external database. Hive comes configured to use the Derby database, however any JDBC-compliant database can be used by setting a few options. This is precisely what we’re going to do. There’s a delicious irony in storing one database’s configuration settings in another database. Since I’m a huge fan of PostgreSQL, that’s what we’re going to use to get started.

Creating the Metastore User

I don’t want Hive talking to PostgreSQL as me, or any other random user account, so we’re going to set up a new user.

CREATE USER hadoop WITH PASSWORD 'hadoop';
Creating the Metastore Database

Now that we have our user, let’s create the database:

CREATE DATABASE hadoop WITH OWNER = hadoop;
Database Driver

Of course, we also need a database driver. Up in the changes to my .profile I set up a HIVE_AUX_JARS_PATH variable. This is a : delimited path where we can tell Hive to look for extra functionality to load. This folder specifically contains the PostgreSQL JDBC driver. You can download the appropriate PostgreSQL JDBC driver from the PostgreSQL JDBC driver site. Appropriate documentation is also there. It’s pretty much as simple as copying the JAR file into an appropriate directory.

hive-site.xml

Now that we’ve created our metastore user and database as well as installed a JDBC driver, the last thing to do is to make some changes in hive-site.xml. The hive-site.xml overrides the settings in hive-default.xml – just like the Hadoop configuration we did earlier. One thing to keep in mind is that you can also override Hadoop configuration in this file as well. We’re not going to, but it’s good to know that it’s possible.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
        <name>hive.root.logger</name>
        <value>INFO,console</value>
    </property>
    <property>
        <name>org.jpox.autoCreateSchema</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.metastore.local</name>
        <value>true</value>
        <description>controls whether to connect to remote metastore server or open a new metastore server in Hive Client JVM</description>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:postgresql://localhost:5432/hadoop</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>org.postgresql.Driver</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hadoop</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>hadoop</value>
    </property>
</configuration>

The javax.dbo.option.* settings are where we set up our metastore database connection. Everything else should be, largely, self-explanatory.

Creating Tables

Let’s go ahead and create a table in Hive:

CREATE TABLE leagues (
  league_id STRING,
  league_name STRING
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t';

This creates a table named leagues with two columns: league_id and league_name. The last two lines might look a bit funny to people familiar with SQL. They are specific to the Hive Query Language (HiveQL). These lines say that each row in the underlying data file will be stored as tab-delimited text.

Getting Data Into Hive

We’ve configured Hadoop and Hive and we’ve created a table. How do we get data into the table? That’s a trickier matter.

We’re going to be using sample data from Retrosheet to populate our Hive database. You should get started downloading and keep reading, it takes a bit of time.

Loading Data With Pig

Pig is a way to execute data flows written in a language called Pig Latin. In this use case, I’ve written very primitive data clean-up scripts using Pig Latin (it would have been just as easy, if not easier, to write them in Ruby or Perl).

The first section of this Pig Latin script registers the PiggyBank contributed functionality and creates convenient aliases for the functions. We use DEFINE to set up the aliases so we don’t have to type the fully qualified function name every time we want to use it. It’s a lot easier to type replace(state, 'Ohio', 'OH') than it is to type org.apache.pig.piggybank.evaluation.string.REPLACE(state, 'Ohio', 'OH').

REGISTER /opt/pig/contrib/piggybank/java/piggybank.jar;
-- Create a function alias because nobody wants to type in the full function
-- name over and over again.
DEFINE replace org.apache.pig.piggybank.evaluation.string.REPLACE();
DEFINE substring org.apache.pig.piggybank.evaluation.string.SUBSTRING();
DEFINE s_split org.apache.pig.piggybank.evaluation.string.Split();
DEFINE reverse org.apache.pig.piggybank.evaluation.string.Reverse();

-- Read the first CSV file and dump it as a tab-separated file
league_codes = LOAD 'league-codes.csv'
USING PigStorage(',')
AS (league_id:chararray, league_name);

STORE league_codes
INTO 'league-codes'
USING PigStorage('\t');

names = LOAD 'retrosheet-CurrentNames.csv'
USING PigStorage(',')
AS (franchise_id:chararray,
    historical_franchise_id:chararray,
    league_id:chararray,
    division:chararray,
    location_name:chararray,
    nickname:chararray,
    alt_nickname:chararray,
    first_game_date:chararray,
    last_game_date:chararray,
    city:chararray,
    state:chararray);

-- We don't need to explicitly specify the PigStorage mechanism.
-- Pig will use tab separated files by default.
STORE names
INTO 'team_names';

-- It's our first transformation:
-- 1) Load the data from the text file.
-- 2) Remove double-quote characters from the data.
-- 3) Write the data to the filesystem.
ballparks = LOAD 'retrosheet-ballparkcodes.txt'
USING PigStorage(',')
AS (park_id:chararray,
    park_name:chararray,
    park_nickname:chararray,
    city:chararray,
    state:chararray,
    start_date:chararray,
    end_date:chararray,
    league_id:chararray,
    notes:chararray);

formatted_parks = foreach ballparks GENERATE
  park_id,
  park_name,
  park_nickname,
  city,
  state,
  start_date,
  end_date,
  league_id,
  replace(notes, '"', '');

STORE formatted_parks
INTO 'ballparks';

personnel = LOAD 'retrosheet-umpire.coach.ids.txt'
USING PigStorage(',')
AS (last_name:chararray,
    first_name:chararray,
    person_id:chararray,
    debut_date:chararray
    );

-- the debut date is the only weirdly formatted date in here, so we'll reformat
-- it to be sortable in YYYYMMDD format
--
-- I had originally intended to do something with this, but I never
-- got around to writing it. As it stands, it's a good way to show
-- how the string splitting function works (it returns an array)
-- and how that can be combined with flatten to turn it into a tuple
split_personnel = FOREACH personnel GENERATE
  last_name,
  first_name,
  person_id,
  flatten(s_split(debut_date, '/'))
AS (
    month:chararray,
    day:chararray,
    year:chararray);

-- These subsequent x_, y_, z_ stages load our personnel
-- relations into intermediate relations while we're doing
-- work on the data. This could have been done in a single
-- step, but it's much easier to read this way.
x_personnel = FOREACH split_personnel GENERATE
  last_name,
  first_name,
  person_id,
  reverse(CONCAT('000', month)) AS month,
  reverse(CONCAT('000', day)) AS day,
  year;

y_personnel = FOREACH x_personnel GENERATE
  last_name,
  first_name,
  person_id,
  reverse(substring($3, 0, 2)) AS month,
  reverse(substring($4, 0, 2)) AS day,
  year;

formatted_personnel = FOREACH y_personnel GENERATE
  last_name,
  first_name,
  person_id,
  CONCAT(CONCAT(year, month), day) AS debut;

STORE formatted_personnel
INTO 'personnel';

You might think that each of the STORE statements would produce a file named personnel or ballparks. Unfortunately, you’d be wrong. You see, Pig turns our Pig Latin script into a MapReduce job on the fly. The final output is written into the appropriately named folder in a part file. The part files have names like part-m-00000. This tells us that it’s part of a MapReduce job, specifically the map phase, and it’s the first part. If there were multiple map operations that were performed (if we had a full cluster), there would be multiple files with incrementing part numbers.

There is a second Pig Latin program that will load the standings, I’ve left that out of the article due to the sheer length of the code. However, it’s a relatively trivial script and should be easy enough to follow. The interesting thing to note is that the Pig Latin program to load the standings makes use of variable substitution. The driver script for this whole process, load_standings.sh, passes parameters to Pig on the command line and they are interpreted at run time.

For more information on Pig and Pig Latin, you should check out the Pig project page or the Pig Latin Cookbook.

Since we’ve formatted our data, how do we load it? It’s pretty simple to do with a Hive LOAD statement:

LOAD DATA LOCAL INPATH 'personnel/part-m-00000'
OVERWRITE INTO TABLE personnel ;

LOAD DATA LOCAL INPATH 'team_names/part-m-00000'
OVERWRITE INTO TABLE teams ;

LOAD DATA LOCAL INPATH 'ballparks/part-m-00000'
OVERWRITE INTO TABLE ballparks ;

LOAD DATA LOCAL INPATH 'league-codes/part-m-00000'
OVERWRITE INTO TABLE leagues ;

Because the data for the standings is separated out by year in the Retrosheet data set, we have to drive the insertion through a shell script:

echo "  loading Box Scores..."
for d in `ls -d stats/*`
  do
    year=${d:6:4}

    echo "    Loading box scores for $year"
    for f in `ls $d/part*`
    do
      `hive -e "LOAD DATA LOCAL INPATH '$f' INTO TABLE raw_standings PARTITION (year='$year'); "`;
    done;
  done;

This loops through the output folders for the standings one at a time and loads the data into Hive. The raw_standings uses partitioning to separate out the data by year. Similar to partitioning in a relational database, this lets us store data with the same partition key in the same physical file. This improves query performance when we’re querying on a limited range of data. In this case, we’re only partitioning by year, but it’s possible to partition on multiple columns for greater granularity.

To improve long term performance further, we’ve gone ahead and created a second table, rc_standings. The rc_standings table is partitioned by year, but it’s also using a ROW FORMAT setting to create an RCFile – a columnar files. The data in this table will be stored in a column-oriented layout. This makes it possible to skip reading from columns that aren’t used in our query. With a table as wide as the rc_standings table, that’s going to be beneficial for our query performance.

In order to get the data into the RCFile table, we have to copy it using a second driver in the shell script:

for d in `ls -d stats/*`
  do
    year=${d:6:4}
    # This only overwrites the $year partition in the standings table
    # We also make use of the REGEX Column Specification to avoid pulling back the
    #  partitioning column(s) in the query.
    #  See http://wiki.apache.org/hadoop/Hive/LanguageManual/Select#REGEX_Column_Specification
    #  for more about the REGEX Column Specification.
    hive -S -e "INSERT OVERWRITE TABLE rc_standings PARTITION (year=$year) SELECT \`(year)?+.+\` FROM raw_standings WHERE year = $year;"
  done;

Loading the standings took an incredible amount of time on my test system, despite using an SSD and having 8GB of memory available. I’m not sure why, and I would love to know what I did wrong, so feel free to sound off in the comments so we can make this process faster for people to get started.

Querying Hive

Once we have the data loaded, querying becomes possible. To connect to hive, open up a command prompt and type hive. You should see blank hive prompt waiting for your commands.

The Hive Prompt

Let’s take a look at the contents of a single table. Type select * from leagues; and hit enter. You should see something like the following:

A Simple Query

Let’s try something trickier and use a JOIN.

select l.league_name, t.franchise_id, t.first_game_date, t.last_game_date, t.nickname
from leagues l
join teams t on l.league_id = t.league_id;

Query Results - With a Join

So far, so good. These have both been very small queries. What if we want to look at some aggregated data. What then? Turns out that it’s just as easy to do as any other query.

select l.league_name, s.year, avg(s.home_score)
from rc_standings s
join leagues l on s.home_league_id = l.league_id
group by l.league_name, s.year;

Now We're Using GROUP BY

Once again, Hive is responding exactly as we would expect. That’s because HiveQL was designed to be as close to SQL as possible. The Hive engine is transforming our HiveQL into MapReduce jobs in the back end. Depending on the type of queries that we’re running, Our queries may execute on as few as one node in the cluster or as many as all nodes in the cluster.

While HiveSQL looks like SQL, some additions have been made that make sense for Hive – SORT BY guarantees that the order of results within a single reducer while ORDER BY guarantees the order of results in the total output. There is also a DISTRIBUTE BY command determines how our data will be distributed to the reducers that are handling the query.

Resources

The pig latin, shell script, and HiveSQL files used in this script can be found here, on Amazon S3.

The source data can be found here. Please note that you’ll need 7-zip to get to the source data. Which, by the way, was obtained free of charge and is copyrighted by Retrosheet. For more information, please contact www.retrosheet.org.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 230 items have been purified.