September 2010
Mon Tue Wed Thu Fri Sat Sun
« Aug   Oct »
 12345
6789101112
13141516171819
20212223242526
27282930  

Month September 2010

Cassandra Tutorial – Installing Cassandra

I was feeling a little bit crazy the other day and I installed Cassandra on my home computer. Twice. Just because. It wasn’t really a “just because” moment. I installed Cassandra twice so I could get the hang of it in case anyone asked me how they could go ahead and get started. The process was relatively painless, but there are some prerequisites that you need to have installed before you can get started. I thought that it would be good to create this short Cassandra installation tutorial to help you get started.

This video walks through everything you need to do get Cassandra installed and running on your computer. Apart from installing the OS. If you don’t have an OS, I don’t know how you’re reading this, probably witchcraft.

[media id=3 width=800 height=600]

What I’m Reading – 2010-09-24

Where Good Ideas Come From It’s more watching than reading, but whatever. There are words in there!

Storyselling 101 It’s a marketing piece, sure, but it applies to everything you do. The most compelling way to educate is to do it with a story. People identify with and engage in stories. Ever watch a good movie and end up on the edge of your seat for most of the movie? That’s engagement. Engage your readers with a story.

10 Hadoop-able Problems (a summary) WTF would you want to use Hadoop for? Here are ten problems you could use Hadoop to solve.

14 Ways To Have Fun Coding Ruby Even though it says it’s all about Ruby, these are great ways you can remind yourself that writing software is something that should be fun, not a chore. If you’ve always thought writing software was a chore, you should quit your job and sell shoes. Now.

Read Ruby A free book that covers Ruby 1.9. It’s still an early draft, but it’s a free learning resource that you can read and contribute to. The author is releasing the book under a Creative Commons license and the code is up on github so you can contribute back to the project.

Relational Database Design and the Optimizers I haven’t read this book yet, but I think it should be on everyone’s reading list.

Talking Is Good, Where Can I Do More?

We can’t all just pick a street corner and start spouting off about whatever strikes our fancy. That’s not how things work, unless you’re a crazy street preacher. Very few people down at the local university will appreciate you getting on a soap box with a megaphone and ranting about normalization. Trust me on this one.

A few days ago, Jen McCown (blog | twitter) put up a blog post about Breaking Into SQL Showbiz. It’s a great post and Jen encouraged speakers to get started with their local user group. Which I wholeheartedly encourage, btw.

What happens next? Let’s say that you’ve given your first presentation at your local user group or for your co-workers and everything went well. Where do you go from there?

Other User Groups

Find other user groups in your area. You can find them through a few websites:

  • PASS – The Professional Association for SQL Server maintains a list and interactive map of SQL Server user groups.
  • INETA – INETA is a bunch of .NET focused user groups. They have an interactive map as well as a traditional location based search.
  • Meetup.com – Meetup.com is a great place to find people who are interested in talking about the same things that you are.

Conferences

When you’ve decided that it’s time to move beyond user group meetings, or you want a bigger thrill, where do you go? Conferences, of course. For a long time I had a problem finding about conferences. It’s not as much of a problem these days.

  • SpeakNET This is an aggregation of other user groups, code camps, and local/regional conferences that are looking for speakers.
  • SQLSaturday There are so many great local SQL Server events that it’s hard to keep track. SQL Saturday does that for me.
  • Community Megaphone This is an aggregator built by a Microsoft Developer Evangelist. It makes it easy to find events to attend or speak at. There’s even an RSS feed.
  • O’Reilly Media O’Reilly produce some mighty fine books. They also put on some mighty fine conferences. If you’re interested in mingling with the not-so-Microsoft crowd, this is a great place to find events to attend and speak at.

You can also find a list of conferences and speaking events on twitter http://twitter.com/peschkaj/conferences.

Three Things to Watch with NoSQL

Last week I painted a rosy picture of NoSQL databases. Before you deploy any kind of NoSQL database, you need to be aware of the potential pitfalls of NoSQL databases.

1) Architecture

NoSQL databases have different architectural concerns than traditional RDMBSes.

If you’re wokring with a BigTable or Dynamo database, your network backbone will need to be able to handle the increased traffic from replication and materializing MapReduce data. Dynamo databases are very chatty – reads and writes are only guaranteed when a specific number of servers respond to a query.

MapReduce queries run across many nodes in a cluster. One stage processing a MapReduce query is temporarily materializing intermediate results from each node. The intermediate results may then be shipped to another server for additional processing, aggregation, or just be combined into the final result set.

2) Querying Strategy

RDBMSes are able to perform a mix of batch and ad hoc querying. They produce reasonably fast results for doing online aggregation and analysis of data. It’s just as easy to write a query that returns a single row from a relational database as it is to write a query that aggregates data from several thousand rows.

Not all NoSQL databases are created equally. Hadoop and Cassandra are built to perform large scale analytical queries. Their implementation of MapReduce is designed to process batches of analytical results. Frequently these results are going to be pre-aggregated rather than delivered in real time. Riak and MongoDB, on the other hand, are designed to handle interactive querying.

3) Data Modeling

Data modeling is a tricky field of study in the world of relational databases. A huge number of books have been written on the subject and they deal with both the theoretical and practical aspects of modeling data. In short, while data modeling is tricky it is also a well understood field of study.

Data modeling in the non-relational world is an entirely new field. Many problems that have been solved in relational data modeling are new and require a different way of thinking. BigTable databases, while their structure may initially appear familiar, require different modeling techniques to get the most out of the data store. MongoDB’s document oriented approach makes a great deal of sense to developers, but it still poses a number of data modeling questions about how to link related objects and how deeply nested object graphs should be in the database. Key/Value stores, like Riak, have very little structure and give the application developer a huge amount of freedom to model data.

Summation & Further Reading

NoSQL Databases hold a lot of promise to help businesses rapidly respond to data growth. Whenever you’re planning to make a major infrastructure decision, you need to weigh the pros and cons before determining which solution is right for you. While NoSQL databases provide answers to a lot of questions, they also pose a different set of questions that require a different set of skills to answer.

NoSQL Glossary
MongoDB – Sharding with the Fishes
Hadoop for Batch Processing (from the Yahoo! Developer Network)
Schema Design in Riak
Schema Design for Riak (presentation slides)

What I’m Reading 2010-09-17

Facebook Privacy Settings for Bigamists

The Texas Sharpshooter Fallacy Your brain plays tricks on you and looks for patterns everywhere. Don’t be that guy.

Cognitive Load A world with more heads up displays and instantly available information carries a lot of distraction. I don’t want more information constantly streaming into my brain. I want the right information instantly available, when I choose it.

Peak MHz Scale out, young man. Scale out.

Online Schema Change for MySQL Facebook have special problems – they have mountains of data. This document is a collection of internal documents describing how the engineers at facebook make changes to their massive databases.

Performance implications using NewID() in Random order

Is Cassandra to Blame For Digg v4s Technical Failures?

Lost World’s Fairs Beautiful HTML and CSS work that showcases the newer features in IE9 (also WebKit, Chrome, Safari, Opera, and FF). This will not be good for you if you have an older browser. You have been warned. The various people involved have blogged about the process, too. It’s quite interesting to read. Frank Chimero, Jason Santa Maria, Trent Walton, Naz Hamid, and Dave Rupert

Five Reasons to Use NoSQL

You’re not using a NoSQL database right now? I’m not all that surprised. Many IT shops are still evaluating moving from SQL Server 2000 to SQL Server 2005, much less these non-relational databases. A lot of people don’t even know what they would use a NoSQL database for. Does it replace the RDBMS? Work alongside it? Do something else?

1) Analytics

One reason to consider adding a NoSQL database to your corporate infrastructure is that many NoSQL databases are well suited to performing analytical queries. Developers can use the same querying languages to perform analytical queries that they’re using to perform atomic queries. Typically this will be some variation of a MapReduce query, but it’s also possible to query data using Pig or Hive. Don’t worry too much about these weird language terms, MapReduce is a fancy way of saying “SELECT and then GROUP BY” and doing it in a way that is entirely confusing to people who are used to SQL.

Many NoSQL systems boast phenomenal write performance. When you combine high write performance with batch processing it is easy to pre-aggregate data, summarize results, and still guarantee ad hoc query performance.

2) Scale

NoSQL databases are designed to scale; it’s one of the primary reasons that people choose a NoSQL database. Typically, with a relational database like SQL Server or Oracle, you scale by purchasing larger and faster servers and storage or by employing specialists to provide additional tuning. Unlike relational databases, NoSQL databases are designed to easily scale out as they grow. Data is partitioned and balanced across multiple nodes in a cluster, and aggregate queries are distributed by default. Scaling is as easy as racking a new server and executing a few commands to add the new server to the cluster (yeah, it really is that easy). Data will start flowing and you’ll back in business in no time.

3) Redundancy

In addition to rapid scaleability, NoSQL databases are also designed with redundancy in mind. These databases were designed and built at massive scales where the rarest hardware problems go from being freak events to eventualities. Hardware will fail. Rather than treat hardware failure as an exceptional event, NoSQL databases are designed to handle it. While hardware failure is still a serious concern, this concern is addressed at the architectural level of the database, rather than requiring developers, DBAs, and operations staff to build their own redundant solutions. Cassandra uses a number of heuristics to determine the likelihood of node failure. Riak takes a different approach and can survive network partitioning (when one or more nodes in a cluster become isolated) and repair itself.

4) Flexibility

What’s the use of a database if it’s not flexible? While the data modeling issues are completely different in NoSQL, there is a large amount of flexibility in how data is stored for performance.

Databases modeled like Bigtable and Cassandra provide flexibility around how data is stored on disk. It’s possible to create derived column families. In plain English: you can design your database to duplicate frequently accessed data for rapid query response. This is, of course, based on the assumption that writes and storage space are cheap.

Databases based on the Bigtable model also have another benefit – outside of key structure it’s possible to store a variety of disparate data in the same table. Structure is largely irrelevant. Relational databases have adopted features to solve similar problems (such as sparse columns in SQL Server), but they carry overhead. Storing wildly different columns in multiple rows of the same column family is so cheap as to be invisible in a NoSQL database.

Lastly, key-value stores provide an incredible level of flexibility. Data is arbitrarily stored as a value. Key-value databases make it possible to store images, word documents, strings, integers, and serialized objects within the same database. This requires more responsibility and creative thinking on the part of application developers and architects but it also lets the people designing the system build custom a completely custom solution that fills their needs.

5) Rapid Development

Let’s face facts: everyone wants their application to be faster, have more features, and they want it yesterday. NoSQL databases make it easy to change how data is stored or change the queries you’re running. Massive changes to data can be accomplished with simple refactoring and batch processing rather than complex migration scripts and outages and it’s even easier to take nodes in a cluster offline for changes and add them back into a cluster as the new master server – replication features will take care of syncing up data and propagating the new data design out to the other servers in a cluster.

T-SQL Tuesday – Indexes

arbitrary logo because a world without rules is a world with fun

When Michael J Swart asked me to take part in T-SQL Tuesday #10 – Indexes, I was incredibly flattered. Nobody’s ever asked me to do anything since a cop asked me to stop doing that one thing (speeding). I had to say yes. Here’s my contribution to T-SQL Tuesday #10.

An index is a horological lever-like something or other.

WTF is an index?

My 1967 Children’s World Book Encyclopedia doesn’t provide a definition for an index. That’s probably because it’s an encyclopedia and not a dictionary Thanks to the local gypsy garage sale going on in my neighborhood, I was able to acquire a dictionary from 1934 which defines an index as “a leverlike regulator for a hairspring.” Apparently this has something to do with clocks. If you try to verify this, you’ll notice that it’s a horological term. Apparently that has something to do with clocks and not adult entertainment.

Obviously neither my Children’s World Book Encyclopedia nor my Book of Learnin’ and Such was going to help me define an index.

In the spirit of true science, I realized that I would have to make something up. Or do my homework. One of the two.

SQL Server and a Drought of Options

SQL Server doesn’t have a wealth of indexing options available. In fact, there are fundamentally only two options – indexed and STFU. Sure, there are multiple indexing techniques, but this is my story.

At the core of SQL Server’s indexing strategy is the self-balancing binary search tree or, for those who don’t like to sound like pompous assfaces, the b-tree. B-trees are full of special magic. Well, not really. A b-tree is based on an algorithmic model that strives to keep the height of the search tree (the number of nodes between the root and the leaf nodes) as small as possible. This means that finding any single piece of information should be equally as cheap/expensive as finding any other piece of information.

That sounds great, right? Well, one of the fundamental indexing problems is that the data in the index has to be kept up to date as the data in the table changes. You can’t have data flying around willy nilly.

This is where we get to start sub-typing our b-trees.

Oh, So There ARE Options

Yes, I lied to you. Sort of. SQL Server has multiple types of indexes, just not multiple… types of index. There’s no way to control the indexing mechanism that SQL Server uses, you’re stuck with the pompous assfacery of self-balancing binary search trees (it’s not really a bad thing).

So, getting back to the topic at hand, in SQL Server we do have options. Two of them: clustered and non-clustered indexes.

Cluster This!

A clustered index is an ordered collection of nuts stored for winter and categorized appropriately… or data. One of the two. When we create a clustered index, all of the nuts/data is sorted and written to disk in the appropriate logical sort order. From this point forward, no physical order is guaranteed. Once the clustered index is created and the data is placed in physical order all bets are off. The data will be stored in logical order, but there is a chance that the data may not be physically in order due to the vagaries of disk access, deleting rows and/or tables and other assorted whatnot. (I think the horologists might have something to do with it.)

What goes in a clustered index? Everything. I told you the data was copied in. Didn’t I? Well, it is. Our clustered index is all of the data in the table and it’s in order based on the clustering key value.

The choice of the b-tree algorithm actually has some implications for how we choose the index columns for the clustered index. (I’m going to call these columns the clustering key. That’s what they are. Deal with it.) Because we’re using a B+ tree, instead of another data structure, to store and sort indexed data we have to take a few things into consideration. B+ trees do really well when every indexed value is unique. This is by design – it’s easier to look up any given row if you know that there is only one row that corresponds to that key. If there’s more than one, things get tricky and I’m not going to talk about that.

B+ trees have other characteristics that make them optimal for use in databases – it’s really easy to find any single element in the tree (even compared to other binary tree mechanisms) and they’re wonderfully optimized for read access when data is being read in sequence across ranges of data.

Searching Is Good

Sometimes you won’t know the key to reference a particular piece of data. Sometimes you’ll want to pull back records for everyone who lives in Tuscaloosa, AL. If you haven’t ordered your data by city and state, this is going to be really difficult to find without scanning the entire table. How do we get that data back?

More indexes!

In addition to clustered indexes, which define how the data should be ordered on disk, we also have non-clustered indexes. We can use these guys to help speed up our search. These are really just supplementary b-trees that point back to the clustered b-tree. You do have a clustered index, right?

That’s it?

No. Of course it can get more complicated than all of this, you can have composite indexes with multiple columns, non-unique indexes, indexes with computed columns, and even full-text indexes. Unfortunately the hookers and their lever actuated hairsprings are not going to let me keep talking.

In case you’re wondering what would happen to your database if you didn’t have any indexes, take a look at this:

i can't find any indexes!

What I’m Reading – 2010-09-10 Edition

ISO Time Formatting Standards Read it. Love it. Use it.

Reducing Bloat Without Locking Just in case you wanted to know all about ghost record clean up in PostgreSQL land.

Links and Link Walking Links are crazy awesome fun and make riak a wonderful thing to behold. BEHOLD THE LINKS!

Cassandra Data Model Cheat Sheet New terminology is confusing. Pictures are helpful.

Setting the record straight – “So, What is an MVP Anyway?” Awards like Microsoft’s MVP award are ways for companies to recognize the help that people selflessly give to the community at large. Like many other awards, it’s an acknowledgement of something you’ve done and continue to do without seeking any recognition or monetary reward. My good friend Jonathan shares his thoughts much better than I could.

A Brief History of Programming

Quotes on Car Design Ford’s chief designer sounds off.

Tiny Pony Stampede Scattered thoughts from a visitor to the post office.

Moving Away From Relational Storage – Introduction

Don’t. Fooled you, didn’t I?

If you’re already using a relational database, keep using it. If it’s scaling just fine with your hardware and workload, keep using it. If you aren’t running into any complexity problems, keep using it.

There’s no reason to change the way you’re storing your data just because you read an article about how BrandNewStartup.com was able to increase uptime and throughput eleventy-four percent by utilizing a new key-value storage solution. That’s like re-writing your core product every time you read an article about how Ruby has excellent metaprogramming features or how Python’s use of significant whitespace can lead to more readable code. These are reasons to choose a language for a new product, but they are not reasons to change the language of an existing project.

If you’re not supposed to move away from relational storage, what is this all about?

When Should I Switch?

When you add new features to an application, or when you rewrite a feature, take a look at what you need. When we needed to make use of large query result caching in a web application, we started using Velocity, which later became AppFabric. There’s some overhead serializing and deserializing objects into the storage mechanism, but could you imagine trying to dump that data to the database? The write overhead would be tremendous! We looked around at products that would work well with our existing application infrastructure, ASP.NET and SQL Server, and chose something that would play well in that garden, Windows Server AppFabric.

Before you start new you should ask yourself questions about what you’re trying to build. Question your assumptions about how the new feature or product will work. Question your assumptions about the existing infrastructure. Make sure that you aren’t shoehorning existing technology into a solution because you are familiar with it. Just because something feels familiar and safe doesn’t mean that it’s the best solution for a problem.

What Questions Should I Ask?

This is the trickiest part. It’s the part that I’ve struggled with and gotten wrong on more than one occasion. It’s okay to screw up, it’s how we learn.

Here are the questions I’ve started asking myself when I start a new project/feature that needs additional data storage:

  • Why are we storing this data?
  • How much data will I collect in a week? A month? A year?
  • How long will this data need to live?
  • How will this data be used?
  • How structured does this data need to be?
  • How available does this data need to be?

Why Are We Storing This Data?

You need to understand how you’ll be using the data before you can figure out how you want to store it. Will you be doing ad hoc reporting? Will the data be aggregated and consumed by other applications? How often will I write this data? How often will I read my data?

Different types of databases have different use case profiles. The way that you’re using the data will make a big difference for how you’re going to store data. You don’t want to store session state in a relational database – you’ll spend a lot of time writing transient data to disk. Likewise, you don’t want to store financial transactions in an in-memory cache.

How Much Data Will I Collect?

The volume of data that you’re working with will influence the way that you’re storing the data. Terabytes, and even petabytes, of data require different storage techniques and management strategies in a relational database, why would it be different anywhere else? Handling huge quantities of data often requires splitting the load across multiple servers or purchasing a SAN. Either way, you’ll want to consider long term budget and how the availability of those budget dollars might change over time. Just as important as the long term capacity of your data is the speed you’re collecting data. The faster you need to collect data, the more you need to look at how you’re storing that data.

Every database engine employs different strategies to maximize I/O throughput. The problem here is that they all use different strategies based on their use case.

How Long Will This Data Need to Live?

The lifespan of data is incredibly important. Short term data need never touch disk – it can live in memory. If this is something like session state data, it’s possible to use a 100% in memory storage solution (like memcached or AppFabric or even riak’s riakkvcache_backend) to solve this problem. Disks are slow, memory is fast.

If the data needs to live longer than a few seconds, then it’s time to consider how long it will really need to stick around. You need to look at different forms of persistence and how the strengths and weaknesses of those systems play into your long term choices. Some next-generation data stores will store data in memory and persist to disk in the background. This speeds up the ability to write data but it does bring up some data life concerns: what happens if the power fails? Other data stores use a write-ahead log, like relational databases, to make sure that the data is safe.

How Will This Data Be Used?

The way that end users are going to use our data is important for our decision making. Some data stores (such as CouchDB) do not allow ad hoc querying. Others (graph databases) make it possible to easily navigate deeply nested and complex data structures.

Relational databases are phenomenal general purpose data stores. They make it possible to store data in a variety of formats but there may be a variety of complications as a result of the general nature of SQL and the relational model. Massive volumes of data stored for statistical analysis have different storage and indexing requirements than data that needs to be instantly available from a variety of locations for atomic reads and writes.

How Structured Does This Data Need To Be?

Data structure can be important for a variety of reasons. Hierarchical data stores make it possible to traverse deeply nested category trees – think of how species are classified. Likewise, graph databases make it possible to navigate through complex relationships, much like the relationships that can be found in strongly object-oriented designs.

Considering how the structure of the data can make it very easy to make a decision on the data store that you’ll be using. If a lot of flexibility is required, it may even be possible that a Big Table derived database, such as Cassandra or HBase might meet your storage needs.

How Available Does This Data Need To Be?

One of the benefits of a relational database is that once a transaction commits, the data is immediately available to everyone querying that database (ignoring things like replication and log shipping). When you start working with distributed data stores, the data may not be immediately available. You need to ask yourself about how available the data needs to be. In some data stores, the delay may be only a few milliseconds. In others, it may be longer. Network latency and hardware utilization play a large role in the latency of data replication.

The other side of the availability question is “How fault tolerant does this data need to be?” With a relational database, unless you’ve implemented a clustering solution, the data is dependent on a single server functioning appropriately. If that single server goes down, business could stop until a replacement is brought back online. Some of the distributed data stores are massively fault tolerant – Riak is designed to be tolerant of hardware failure. If nodes fail, data will be replicated to new nodes, or even to new clusters, automatically. Replacement servers can be brought online and they will begin receiving new data immediately.

The immediately availability of data after a write as well as the long term (fault tolerance) availability of data is an important aspect of choosing a data storage solution.

What’s Next?

Ask yourself these questions the next time that you’re implementing new functionality or an entirely new application. The answers to your questions might very well surprised you.

In the next few weeks, I’ll be putting more thoughts together and showing examples of realistic scenarios of moving away from a relational database and utilizing a NoSQL solution.

Further Reading

NoSQL Summer
NoSQL Heroku and You
What is Big Data?
Big Data Is Less About Size, And More About Freedom
NoSQL: The Dawn of Polyglot Persistence

…And This Is Why I Love Open Source Software

From the daily Riak Recap:

10) So check this out:

At 17:40 yesterday, solidsnack asked, “Does Riak support HTTPS?”

At 22:28, benblack made it work —>
https://gist.github.com/4413fafc14fc18696643

The repo for his branch is here —> http://github.com/b/riak/tree/https

A user requested SSL support. 4 hours and 40 minutes later, it was implemented and checked in to source control. It will probably be available to the public in a short period of time.

Happy Friday!

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 214 items have been purified.