Tag Database

Stir Trek: Thor Edition (the Aftermath)

I spent the weekend at Stir Trek: Thor Edition in Columbus, OH. While I had a blast speaking about databases, I had even more fun attending and learning.

Programming the Cloud with HTTP/REST

I knew about REST before I attended this talk and I’ve done a bit of REST programming (right before I decided to nerd out on data), but Mike Amundsen did a great job of convincing me why I should care about REST as a programming paradigm for web developers. I’m not going to go out and start writing code to build my own REST services, but I have a better grasp of how I can work with developers to build robust data driven services and applications and do some incredibly cool things in the process.

CSS 3: Something for Everyone

John Keyes delivered a presentation about the basics of CSS 3. I knew there were some great features of CSS 3, but I also knew that the spec is incredibly broad. I was a bit worried that John’s presentation was going to be a whirlwind tour with very little depth. Instead of a shallow romp through CSS 3, John delivered a solid presentation that worked through some core features of CSS 3 in a practical manner and built up on them to demonstrate new techniques with real world value. Except for maybe the demo that made a rectangle swirl into a circle, that was just cool.

John gets mad props for filling his slide deck with Jurassic Park references. I had some laughs while I got my learn on, and that’s a good thing. I’ve worked with John on presentations in the past and he’s become a phenomenal presenter, programmer, and web developer in the time that I’ve known him.

Real World jQuery

Matt Nowack had the difficult job of speaking right after lunch. He gave a great talk about jQuery 1.5 and 1.6. It turns out that Matt wrote the talk for jQuery 1.5 and did a great job of introducing some of the newest features of my favorite JavaScript library. jQuery 1.6 came out recently and it made parts of Matt’s talk irrelevant. Matt took it in stride and wrote new content earlier in the week and delivered a top notch presentation that was educational and entertaining. I overheard one attendee say that they were rushing off to Matt’s presentation because it was bound to be good. They were right and I was also right to pay them 15 bucks and a box of Milk Duds to yell it at the top of their lungs. Good job, Matt.

The Rest of My Time

I spent the rest of my time preparing and giving my talk The Database is Dead, Long Live the Database. If you attended, the resources page will always be online at http://brentozar.com/stir-trek-thor-edition. If you missed it and you wanted to see it, you’ll be able to catch it on June 28th at 8:00 PM Eastern through the PASS Application Development Virtual Chapter.

I was incredibly flattered when Matt Casto asked me to speak at this event, way back at Code Mash. I’m glad that he was clearly drunk and thought I would make a great speaker. You can’t imagine how happy I am that he accepted the bribes I sent his way, just in case he sobered up and didn’t remember asking me to speak. Luckily, most of that’s a lie. I did, however, have an awesome time in Columbus hanging out with old friends, hopefully making a few new ones, and only telling one STD joke during my presentation; nice try, guy in row three!

In summation: thanks Matt, thanks Stir Trek, and thanks Columbus!

Building Our Own Federated Database

We’ve already talked about The Promise and Failure of Federated Databases and Why Don’t We Have Federated Databases. At the end of the second post I concluded that the only real way to solve this problem is to build the federated database ourselves. Before you ask, “Does he really want us to roll our own database” take a deep breath and relax; nobody is going to be writing a database.

What is a Federated Database?

When I was looking at the definition for a federated database, it dawned on me that a federated database doesn’t have to be an actual RDBMS, or any other type of existing DBMS. A federated database is a meta-database management system (or so Wikipedia claims). Looking at the other major explanation of a federated database we end up with “A federated database architecture is described in which a collection of independent database systems are united into a loosely coupled federation in order to share and exchange information.”

I’ve already talked about the implications of the first definition – it’s leading us down the path of a monolithic master server that must be aware of the other servers in the federation. New servers won’t be a part of the federation until we make the federating server aware of them. The other road, a loose collection of independent servers, is beginning to gain ground as companies bring more databases online in their data centers. When I say “more databases” I’m not just talking about a larger number of databases from a vendor, I’m also talking about databases from different vendors. Many people are exploring this route right now, some of them have attached the name of polyglot persistence to this approach.

Why Do I Want a Federated Database?

There are a couple of reasons that you’d want to roll your own federated database. I touched on them in the first part of this series: you may want to query across databases, you might have legacy systems, you may have merged with another company, or you might be using the most appropriate database for the job. Whatever you’re doing, you probably have a number of databases and you need to stitch them together.

Where Do I Get Started?

There are a number of ways we could go about creating a federated database. A lot of the ways to build a federated database solution are incredibly complex and involve creating meta-data databases as well as devising ways to link the databases together in an easily query-able way. I’m going to propose something different. Instead of designing something on your own, use the technology you already have and that your programming platform already comes with and understands: TCP/IP.

Nearly every programming language is capable of talking to other programs over TCP/IP. Instead of creating custom databases and worrying about meta-data management and cross server querying, create common services that answer common questions. Break your monolithic application down into manageable services and write those services using the most appropriate technology. Over my career, I’ve found that very few users need the ability to run ad hoc reports over the entire corporate data set. If users don’t need to be able to interactively query the entirety of their corporate data set, what do they need?

Almost all users need a small set of reports and data. Even when we expand the definition of “users” to include applications, services, APIs, and protocols, most activities are incredibly limited. Our users are asking the same sets of questions: How many accounts receivable have aged more than 30 days? What do the sales figures for the New England region look like for the last three years? Even when users are adding data to our databases they’re still performing a limited set of actions like saving an entire order, signing up for a new account, or adding a new accordion to their shopping cart. The activities that users perform data are very limited.

Knowing that users only perform a few activities with our live data, we can safely make some assumptions about the type of data access people will need. Keeping that in mind, it’s a lot easier to see how we can build our own federated database: we’re not going to. We’re going to build our own system using what many people call polyglot persistence.

Designing for Polyglot Persistence

The idea behind polyglot persistence is that we keep our data in the best database for storing that particular kind of data. Achieving this goal is achievable, but if that were the end game, it wouldn’t be the most useful goal for the business – business users want to see reports and combine data across applications and business units.

Going one step beyond the basics of polyglot persistence, we want to add another layer – a caching/service layer. It’s in this layer that we can start to really add rich functionality to the data that the business needs. Instead of having to replicate data across multiple data sources, we can query two separate servers and combine the data together before we return it to the client.

We’ve been doing this for years – it’s nothing new. The only thing that is remotely new is storing our data in the most suitable database. Well, that and telling our caching/service layer to cache as much data as possible while writing in the background. If we keep most of our data in cache, we don’t have to worry as much about write performance in the back end. We can queue writes to make sure they commit during idle times, we can spread them across many servers, and we can write to many reporting databases at once to make sure that reports are up to date. By moving application and reporting logic into an application and reporting tier, we free the database to focus on the tasks that databases excel at: storing and retrieving data. Complex logic and strange data mucking can be handled in the application layer by simple (or highly specialized) algorithms.

Polyglot persistence becomes incredibly valuable when we build mechanisms to load data from all of our disparate line of business systems into a single enterprise data warehouse. Once we have all of our data in a single warehouse, we’re able to write queries across business boundaries. The enterprise warehouse doesn’t need to be in a single monolithic RDBMS server; it could use Microsoft SQL Server Parallel Data Warehouse, Oracle ExaData, Postges-XC, HBase, Cassandra, or any other database that is up to the task.

Wrapping it up

Polyglot persistence seems to be the best answer to building a federated database. It doesn’t provide any kind of automated meta-data management or support for distributing queries automatically across many servers. Instead, polyglot persistence makes it easier to build a robust system that answers the questions business users both want and need while remaining fast and flexible. Is it the be all end all solution? No. Is it a step in the right direction? Yes.

Why Don’t We Have Federated Databases?

Federated databases are a dream that have not materialized. The SQL/MED extension to the ANSI/ISO SQL specification is a step in the right direction. In addition, both SQL Server and Oracle have proprietary extensions that make it possible to query external data sources. If all of this technology is available today, why aren’t more people using it?

Why Don’t We Have a Federated Database?

If federated databases are such a powerful thing, why hasn’t anyone built one? Surely this is something that many businesses are clamoring for, or at least in need of.

There are a number of problems facing anyone attempting to implement a federated database. Frankly, federated databases are fraught with technical difficulties. Every database vendor supports a different subset of the ANSI/ISO SQL standard, different vendors use different data types and metadata, concurrency is a huge concern with dealing with remote resources, and technology is a moving target.

Different Dialects of SQL

Anyone who has attempted to port an application from one database engine to another knows about the pain involved in translating queries. Different vendors adhere to the SQL standard to varying degrees. This problem can be solved by creating wrappers to translate sub-queries between different querying languages, but it’s still a problem that exists. This problem could be partially solved in a federated database by limiting the database vendors to a small subset of the ANSI/ISO SQL standard, but this doesn’t solve the problem, it merely avoids it by limiting functionality.

Different Metadata

Different databases may have different metadata – different data types are used to represent the same data and different structures are used to describe data. To solve this problem it becomes necessary to create elaborate mappings between columns that represent the same data. There may be situations where such a mapping becomes computationally intensive or functionally impossible. SQL Server allows the creation of .NET data types with custom methods for data searching, access, and manipulation. PostgreSQL features several data types (notably tsvector, hstore, and arrays) that would be difficult to convert to data types in other databases.

Metadata differences don’t just stop at the data type level. It’s possible to model data in a number of different ways; the type of an address could be indicated using an integer key value that references a lookup table in one database or as a VARCHAR column with values constrained by the database. It’s even possible for something as simple as Unicode text encodings to cause problems: SQL Server uses the NVARCHAR data type for storing Unicode strings while other databases do not use a separate data type.

Concurrency

Concurrency, depending on your database, may be a concern. Managing concurrent operations within a single database is a difficult task, much less managing concurrency across multiple databases. Unfortunately, correct handling of concurrency across all components of a federated database is critical.

Many potential problems of a federated database can be solved through different trade offs. However, managing concurrency is a nearly impossible task. To properly and effectively manage concurrency across multiple databases is to ask the impossible. Not only would this require the federated database vendor be able to account for all possible concurrency issues in relational databases, but they would need to be able to account for potential concurrency issues in any database that integrates with the federated database.

The Moving Target

Even were a database vendor to take on this task, they’d be consistently aiming for a moving target. New features are added to relational databases all the time, and there are enough major players in the market to make it difficult for users to keep up to date, much less a federated database vendor. Once you factor in the wealth of other, non-relational databases, the idea of creating a federated database system to handle metadata mapping, concurrency control, and query language resolution trends toward impossible.

The State of The Industry

Where we stand now, there is almost no chance of any independent software vendor creating a true federated database. There is hope, but not from where you would expect it.

Enterprise data warehouses can fulfill much of the function of a federated database, but they still require complex ETL and data mapping to be truly useful. Adding additional information to an enterprise data warehouse can require extra work to prepare the data warehouse and ETL processes for the new data. Unfortunately, enterprise data warehouses require too much manual intervention to be a candidate for a federated database.

As we’ve discussed, SQL Server and Oracle provide ways to reference remote database servers. These methods have their own problems. SQL Server linked tables are prone to problems with some objects not being remoteable. When we’re querying a remote server, we need to make sure that the parts of our query going to the remote server are handled on the remote server. This is difficult to get right. On some occasions we might even see an entire remote table be streamed across the wire to be filtered on the originating server. This is something that we don’t want to see happening. For a federated database to be a tenable product there must be an easy way to offload queries to a remote table and a guarantee of adequate performance.

Properly remoting queries is incredibly complex. Assume, for a moment, that we have a report that queries data on the sales department’s database server and we also need to include data from human resources data. Our query might look something like this:

SELECT sp.first_name,
       sp.last_name,
       eh.employment_duration,
       ts.year,
       ts.total_sales_by_year,
       r.average_review_score
FROM   public.sales_person sp
       JOIN HRDB01.employee_info.public.employee_history eh
            ON sp.employee_id = eh.employee_id
       JOIN ( SELECT o.employee_id,
                     o.year,
                     SUM(o.total) total_sales_by_year
              FROM   public.orders AS o
              GROUP BY o.employee_id, o.year
            ) AS ts ON sp.employee_id = ts.employee_id
       JOIN ( SELECT r.year,
                     r.employee_id,
                     AVG(r.score) AS average_review_score
              FROM   HRDB01.employee_info.reviews.review r
              GROUP BY r.year, r.employee_id
       ) AS rv ON sp.employee_id = rv.employee_id
                  AND ts.year = rv.year;

Looking at this query we’re hitting two separate remote objects in one remote database. In order for this query to be effective, our query optimizer must b able to re-write the query in a way that lets it build an intelligent query for the two remote tables HRDB01.employee_info.public.employee_history and HRDB01.employee_info.reviews.review. But, in order to effectively re-write the local query to properly reference remote objects, we need to know everything about the remote objects – our calling server must be aware of as much metadata as possible so the remoted query can be re-written before being sent to the remote server. While this is doable, it puts additional load on the calling server. This server now has to maintain information about remote database objects. But that’s not all! If we want our queries to be truly optimal, our federating sever will need to be aware of how data types will behave on the remote servers and how the remote data types will interact with local data types.

Once you examine the intricacies of a federated database, it becomes obvious why the federated database, as a boxed product, is beyond our reach. It’s not that the task is impossible; on the contrary such a task is very possible. The difficulty lies in coordinating all of the information available and using it to deliver data quickly. There are enough moving and potentially unknown parts that it’s non-trivial to create heterogeneous systems capable of filling out the promise of federated databases. Faced with this situation, the only viable solution is to build your own solution that answers the needs of the business.

http://railstips.org/blog/archives/2011/01/27/data-modeling-in-performant-systems/

The Promise and Failure of Federated Data

One of the biggest problems facing businesses today is integrating data from multiple sources. The holy grail of data integration is called a federated database system. Basically, a federated database stores meta data about other databases and makes it easier to integrate them through a single interface. Many relational databases have features that support integrating with other relational databases through SQL Server’s linked servers or Oracle’s database links. One of the problems with these features is that they only allow relational databases to talk to other relational databases. As the volumes of data we collect every day increase, more and more of that data is being stored outside of relational databases in CSVs, spreadsheets, log files, PDFs, and plain text as well as in a variety of non-relational databases like MongoDB, HBase, Riak, and Cassandra.

If the only thing we’re looking for is access to meta data for external data, SQL Server will provide the [FileTable][8] data type in SQL Server 2011. Admittedly, FileTable isn’t an acceptable solution because it’s really intended to make it possible to reference files in the database that are being managed by external applications through the filesystem and Win32 APIs. Clearly, this doesn’t suit our need for querying external data.

It’s also possible to use ETL tools to move data into relational databases. One of the problems with ETL tools (SSIS, Pentaho Data Integration/Kettle, or Oracle Data Integrator) is that they are effectively batch operations. New data insertions will have to be triggered by some external event; the data isn’t available until it’s migrated into some master system.

Luckily, there’s an extension to the SQL Standard to help us: SQL/MED. MED stands for Management of External Data. This is a way to link up any external data source to a database server. It doesn’t have to be another relational database – there’s already a twitter foreign data wrapper library. Unfortunately, PostgreSQL is currently the only major database player on the market with any potential for an implementation for SQL/MED.

The implementation of SQL/MED just isn’t here, yet.

Another promising project is HadoopDB. HadoopDB is a project coming out of Yale University. The aim is to make it possible to run analytical workloads in parallel across many commodity RDBMS servers. One of the goals of HadoopDB is to excel in areas where parallel data warehouses simple do not perform well. Many of these situations are outline in the paper HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. To summarize, parallel data warehouses provide near linear scaling up to several hundred nodes running on homogeneous hardware. Parallel data warehouse also operate under the assumption that failures are rare. Google and others have demonstrated that hardware failure is inevitable at scale. HadoopDB presents a phenomenal way to scale databases and integrate disparate technologies.

Despite its promise of scaling databases, HadoopDB still doesn’t solve the problems that we face when trying to build a federated database system. The truth is a depressing one – there is currently no solution for building federated databases that incorporate data from across the enterprise. Database vendors, DBAs, and more traditional corporate IT departments will tell you that this is a Good Thing™. I’ve mentioned before that you should choose the database that is best suited for the task at hand.

Where do we go from here? SQL/MED doesn’t meet its own promises – only one vendor is implementing the SQL/MED standard and that support is going to depend on third parties releasing drivers. HadoopDB isn’t a federated database so much as it is a way to avoid scaling a relational database into thousands of cheap nodes and paying millions of dollars in licensing fees for Teradata, Microsoft’s Parallel Data Warehouse, or Oracle’s Exadata. The unfortunate truth is that if we want a federated database we’re going to have to build it ourselves.

What sounds like a Sisyphean task isn’t as difficult as it sounds. If we’re collecting data in multiple databases, the problem is already almost solved. Some of those technologies are already here. LINQ lets us treat all data sources equally; we can query an array of objects as easily as we can query a database. ARel is a relational algebra for Ruby. While ARel is specifically focused on working with relational databases, it could be extended to work with many different data sources. Business intelligence vendor Jaspersoft recently announced support for a number of non-relational databases to complement their existing business intelligence products. Quest Software, makes Toad for Cloud Databases – a tool for querying both relational and non-relational databases.

A federated database may never materialize, but it’s already possible to build a hybrid database solution today.

References

PostgreSQL Extensions

PostgreSQL Extensions

Dimitri Fontaine has been working on adding extension packaging functionality to PostgreSQL for a while. His hard work has paid off and PostgreSQL 9.1 will have extensions.

How will extensions work? At a cursory glance, extensions are similar to libraries or DLLs in the programming world. Extensions are packaged units of code that can be installed using the CREATE EXTENSION command. This command will execute a script file that installs any of the extensions objects, including any C code that needs to be built or loaded. If this was the only thing that extensions provided, it would be a nice feature. Interestingly, there’s more.

PostgreSQL’s extensions may consist of any number of database level objects – tables, functions, views, etc. These objects are versioned together as part of a discrete package. While PostgreSQL won’t let you drop any database object that’s part of an extension, you could go ahead and break an extension using CREATE OR REPLACE FUNCTION. Those changes won’t be tracked as part of the extension.

Once our extension has been installed, it’s going to be managed as a versioned chunk. The best way to make changes it to upgrade the extension using code supplied by the extension developer. The extension code lets developers supply a version number, as well as some other metadata. To be fair, the extension code in PostgreSQL requires that extension developers supply a version number; extensions without a version number will fail on creation with some kind of helpful error message about an invalid version name. But, once we’ve created an extension with a version number we get a few niceties.

  1. We can tell which version of an extension in installed in every database on our server by looking in the pg_extension table of each database.
  2. We can perform updates by comparing the current version to update scripts. Admittedly, ALTER EXTENSION looks like it should handle this for us, but it’s still cool.

What else do we get from the new extensions?

Configuration! Extensions have to ways of setting up their configuration: control files and configuration tables. Extension developers can supply limited configuration through a control file. The control file lets an extension developer list things like the name, a description, and any other extensions that are needed for proper functionality.

In addition to the control file, extension specific configuration information through configuration tables. Not only can we supply additional functionality, but it’s possible to make our functionality flexible. Like all important data, it’s important to back up the configuration. To make sure that users’ configuration information is backed up, extension developers can use the pg_extension_config_dump function in their extension installation code to mark extension tables that should be a part of back ups; PostgreSQL will ignore extension tables when pg_dump is run. Interestingly, pg_extension_config_dump takes a second parameter that will be used as a search predicate when pg_dump is run. This can be used to only backup user supplied parameters or work through whatever convoluted scheme you’d like (parameters are only backed up on bank holidays).

We also get extra control about where extensions live. Have you ever installed software from a vendor and been a bit annoyed about where it ends up in your database? Both the CREATE EXTENSION and ALTER EXTENSION commands make it easy to relocate an extension to a new schema. DBAs can opt to install extensions into one schema and move them into another at a later date. We could install a vendor library and move it to a safe schema at another time, for example ALTER EXTENSION magic_vendor_security_audit_tools SET SCHEMA vendor_audit;. ALTER EXTENSION will even update the definition of tables that already exist.

I suspect that the extension mechanism in PostgreSQL 9.1 is going to have far reaching effects for PostgreSQL. The PostgreSQL development team are already moving procedural languages into extensions and bundling additional functionality (formerly living in the contrib folder) into extensions. This addition to PostgreSQL should also make it much easier for software vendors to supply additional functionality for PostgreSQL.

Database Restores – Where’s my Transaction Log Backup?

Developers! DBAs! Has this ever happened to you?

Surprise! It's a database migration error!

You’re chugging along on a Friday night getting ready for your weekend deployment. Your 2 liter of Shasta is ice cold, you have your all Rush mix tape, and you’re wearing tube socks with khakis. Things are looking up. You open up your deployment script. You’re confident because you’ve tested it in the QA environment and everything worked. You press F5 and lean back in your chair, confident that the script is going to fly through all of the changes. Suddenly, there’s an error and you’re choking in surprise on Shasta.

In an ideal world, you could pull out your trusty log backups and do a point in time restore, right? What if you’ve never taken a transaction log backup? What if you only have full database backups? Can you still recover from this situation? The answer, thankfully, is yes.

Let’s break something!

USE ftgu;
GO

-- at midnight, we took our initial back up
BACKUP DATABASE ftgu TO DISK = 'C:\ftgu-1.bak'
GO

-- customer data from the business is inserted
-- more customer data is inserted

-- some kind of migration goes here

-- insert a bad value
INSERT INTO Bins (Shelf, Bin)
VALUES ('B', 9)
GO

SELECT GETDATE();

SELECT * FROM Bins WHERE Shelf = 'B' ORDER BY BinID DESC;
GO

-- wait for a bit
WAITFOR DELAY '00:01:00';
GO

-- do something dumb
DELETE p
FROM Products p
JOIN Bins b ON p.BinID = b.BinID
WHERE b.Shelf = 'B';

DELETE FROM Bins WHERE Shelf = 'B'
GO

SELECT GETDATE();
GO

We have a starting backup, no t-log backups, and we’ve gone and deleted some important data from the production database. How do we get it back? If we restored the database from our first backup we might lose a lot of data. Who knows when the last database backup was taken? Oh, midnight. So, in this case, we’d lose a day of data. Well, bugger. In a panic, we save the state of our broken database.

-- ack!
BACKUP DATABASE ftgu TO DISK = 'C:\ftgu-2.bak';

And then we realize that we also need our transaction log:

-- ah crap, I need to back up my log to get point in time recovery!
BACKUP LOG ftgu
TO DISK = 'C:\ftgu-log-1.trn';
GO

Here’s the kicker – the transaction log has never been backed up. (In my experience, this is all too common.) This database has been running for a week or a year or three years without any kind of transaction log backups. We’re screwed right? I mean, wouldn’t we have to apply all of the transactions from the log to the very first full backup we have? No.

Let’s get started and restore our last good backup. We always have our backup with missing data, just in case we need it for some reason.

-- switch to master (need to make sure nobody else is using that database)
USE master;
GO

-- restore the last full backup with known good data
-- make sure to specify NORECOVERY so we can
-- apply our transaction log backup
RESTORE DATABASE ftgu
FROM DISK = 'C:\ftgu-1.bak'
WITH REPLACE, NORECOVERY;
GO

SQL Server is cunning and records the log sequence number (LSN) from the last full backup (technically it’s the start and end LSN from the last full backup). If we have a log backup that encompasses the relevant LSNs, we’re good to go. Since our transaction logs were never backed up before today, we’re safe.

We’re going to use something called

-- restore the log backup until right before we started
-- this is called "point in time recovery"
RESTORE LOG ftgu
FROM DISK = 'C:\ftgu-log-1.trn'
WITH STOPAT = '2011-02-13 10:03:55.653';

Even though we never took a transaction log backup before today, we’re able to take a backup and recover from what initially seemed like a bad situation.

Data Durability

A friend of mine half-jokingly says that the only reason to put data into a database is to get it back out again. In order to get data out, we need to ensure some kind of durability.

Relational databases offer single server durability through write-ahead logging and checkpoint mechanisms. These are tried and true methods of writing data to a replay log on disk as well as caching writes in memory. Whenever a checkpoint occurs, dirty data is flushed to disk. The benefit of a write ahead log is that we can always recover from a crash (so long as we have the log files, of course).

How does single server durability work with non-relational databases? Most of them don’t have write-ahead logging.

MongoDB currently has limited single server durability. While some people consider this a weakness, it has some strengths – writes complete very quickly since there is no write-ahead log that needs to immediately sync to disk. MongoDB also has the ability to create replica sets for increased durability. There is one obvious upside to replica sets – the data is in multiple places. Another advantage of replica sets is that it’s possible to use getLastError({w:...}) to request acknowledgement from multiple replica servers before a write is reported as complete to a client. Just keep in mind that getLastError is not used by default – application code will have to call the method to force the sync.

Setting a w-value for writes is something that was mentioned in Getting Faster Writes with Riak. Although, in that article we were decreasing durability to increase write performance. In Amazon Dynamo inspired systems writes are not considered complete until multiple clients have responded. The advantage is that durable replication is enforced at the database and clients have to elect to use less security for the data. Refer to the Cassandra documentation on Writes and Consistency or the Riak Replication documentation for more information on how Dynamo inspired replication works. Datastores using HDFS for storage can take advantage of HDFS’s built-in data replication.

Even HBase, a column-oriented database, uses HDFS to handle data replication. The trick is that rows may be chopped up based on columns and split into regions. Those regions are then distributed around the cluster on what are called region servers. HBase is designed for real-time read/write random-access. If we’re trying to get real-time reads and writes, we can’t expect HBase to immediately sync files to disk – there’s a commit log (RDBMS people will know this as a write-ahead log). Essentially, when a write comes in from a client, the write is first written to the commit log (which is stored using HDFS), then it’s written in memory and when the in-memory structure fills up, that structure is flushed to the filesystem. Here’s something cunning: since the commit log is being written to HDFS, it’s available in multiple places in the cluster at the same time. If one of the region servers goes down it’s easy enough to recover from – that region server’s commit log is split apart and distributed to other region servers which then take up the load of the failed region server.

There are plenty of HBase details that have been grossly oversimplified or blatantly ignored here for the sake of brevity. Additional details can be found in HBase Architecture 101 – Storage as well as this Advanced HBase presentation. As HBase is inspired by Google’s big table, additional information can be found in Chang et al. Bigtable: A distributed storage system for structured data and The Google File System.

Interestingly enough, there is a proposed feature for PostgreSQL 9.1 to add synchronous replication to PostgreSQL. Current replication in PostgreSQL is more like asynchronous database mirroring in SQL Server, or the default replica set write scenario with MongoDB. Synchronous replication makes it possible to ensure that data is being written to every node in the RDBMS cluster. Robert Haas discusses some of the pros and cons of replication in PostgreSQL in his post What Kind of Replication Do You Need?.

Microsoft’s Azure environment also has redundancy built in. Much like Hadoop, the redundancy and durability is baked into Azure at the filesystem. Building the redundancy at such a low level makes it easy for every component of the Azure environment to use it to achieve higher availability and durability. The Windows Azure Storage team have put together an excellent overview. Needless to say, Microsoft have implemented a very robust storage architecture for the Azure platform – binary data is split into chunks and spread across multiple servers. Each of those chunks is replicated so that there are three copies of the data at any given time. Future features will allow for data to be seamlessly geographically replicated.

Even SQL Azure, Microsoft’s cloud based relational database, takes advantage of this replication. In SQL Azure when a row is written in the database, the write occurs on three servers simultaneously. Clients don’t even see an operation as having committed until the filesystem has responded from all three locations. Automatic replication is designed into the framework. This prevents the loss of a single server, rack, or rack container from taking down a large number of customers. And, just like in other distributed systems, when a single node goes down, the load and data are moved to other nodes. For a local database, this kind of durability is typically only obtained using a combination of SAN technology, database replication, and database mirroring.

There is a lot of solid technology backing the Azure platform, but I suspect that part of Microsoft’s ultimate goal is to hide the complexity of configuring data durability from the user. It’s foreseeable that future upgrades will make it possible to dial up or down durability for storage.

While relational databases are finding more ways to spread load out and retain consistency, there are changes in store for MongoDB to improve single server durability. MongoDB has been highly criticized for its lack of single server durability. Until recently, the default response has been that you should take frequent backups and write to multiple replicas. This is still a good idea, but it’s promising to see that the MongoDB development team are addressing single server durability concerns.

Why is single server durability important for any database? Aside from guaranteeing that data is correct in the instance of a crash, it also makes it easier to increase adoption of a database at the department level. A durable single database server makes it easy to build an application on your desktop, deploy it to the server under your desk, and move it into the corporate data center as the application gains importance.

Logging and replication are critical technologies for databases. They guarantee data is durable and available. There are also just as many options as there are databases on the market. It’s important to understand the requirements of your application before choosing mechanisms to ensure durability and consistency across multiple servers.

References

Goals for 2011 – Early Update

It’s a bit early to be updating my goals for 2011, but I’m really excited about this one. Over the course of last week, I wrote an article about loading data into Riak. I had a brief conversation with Mark Phillips (blog | twitter) about adding some of the code to the Riak function contrib.

This is where a sane person would say “Yeah, sure Mark, do whatever you want with my code.” Instead I said something like “I’d be happy to share. How about I make a generic tool?” About 40 minutes later I had a working chunk of code. 30 minutes after that I had refactored the code into a driver and a library. I wrote up some documentation and sent everything off to be included in the main Riak function contrib repository. A couple of days and a documentation correction later and you can now see my first code contribution to the open source world on the internet: Importing YAML.

While I’m really excited about this, and it’s very important to me, there’s more to take away from this than just “Yay, I did something!” We’re all able to give something back to our community. In this case I took code I had written to perform benchmarks and extracted a useful piece of demonstration code from it. Share your knowledge with the world around you – it’s how we get smarter.

Secondary Indexes – How Would You Do It (In Riak)?

Every database has secondary indexes, right? Not quite. It turns out that some databases don’t support them. Secondary indexes are important because they make it possible to perform more, quick, queries on a given chunk of data. What if we want to add secondary indexes to a database, how would we go about doing it?

Looking at queries, there are a few basic ways that we actually query data:

  • Equality predicates
  • Inequality predicates
  • Multi-predicate queries

We’re going to be using Riak as our example database

Equality Predicates

The easiest type of query is an equality predicate. With an equality predicate, we’re just matching any given search term. Looking at an example, what if we want to find every user born on July 4, 1976?

function(value, keyData, arg){
  var data = Riak.mapValuesJson(value)[0];

  if (data.Birthdate == '1976-07-04') {
    return [ data ];
  }
}

With an RDBMS, we’d just create an index on the table to support our queries. But with a key-value store, we don’t have that ability. How do we do it?

To create an index in Riak, we’d create another bucket, users_by_birthdate. The keys for this bucket would be the value of the birthdate. We could use links (this will fall apart pretty rapidly above about 1000 links) or store a list of keys in the users bucket. To satisfy our MapReduce query we can use the users_by_birthdate bucket to get the IDs of our users for retrieval rather than scanning the entire bucket. Depending on the amount of data we have, being able to use equality predicates could reduce the number of reads we have to perform.

Multi-Predicate Equality Searches

I’m going to skip inequality predicates for a second and talk about multi-predicate equality searches. This is easy to accomplish. If we’re searching for everyone with the last name of ‘Smith’ who was born on July 4th, 1976 we might use the following MapReduce function:

function(value, keyData, arg){
  var data = Riak.mapValuesJson(value)[0];
  if (data.Birthdate == '1976-07-04' && data.LastName = 'Smith') {
    return [ data ];
  }
}

With a relational database, we could create a multi-column index. That isn’t possible here – with Riak we only have a bucket name (the index name), a key (the key), a value.

Two Equality Buckets

One option would be to use two equality buckets and then perform a bitwise comparison of the results, only keeping user keys that exist in both buckets. In this case, we would have two buckets users_by_birthdate and users_by_lastname. The problem is that we might have large data skew on one of these key-value combinations. So, if there are only 1,000 users born on July 4th, 1976 and 1,000,000 Smiths in our database, this could cause some inefficiencies in our queries when we go to merge the two result sets. Worse, there could be 1,000,000 users that match our search conditions in each bucket, but only 10 in common.

Using a bitwise combination of two buckets is, from a development standpoint, the easiest way to implement multi-predicate comparison. However, this approach creates problems with any significant volume of data.

MapReduce Over An Index Bucket

What if instead of using two or more buckets and performing a bitwise comparison of the keys that we find, instead we create a single index bucket. Inside that bucket, we’ll still use one value as the key for each key-value pair. This would look like the users_by_birthdate bucket – every key-value pair has the birthdate as the key.

Each key contains a list of values. Unlike our first example we’re going to store more than a simple list of keys. Instead we’re going to store a list of lists of tuples – a multi-dimensional data structure. If we’re using the birthdate, last name example, it might look like this:

1976-07-04    [
                [176, {last_name: Smith}],
                [195, {last_name: Harrison}]
              ]

What does this buy us? Well, for starters we don’t have to perform bitwise comparisons between two or more potentially massive lists of keys. Instead we can examine the MapReduce query that’s coming in and re-write it on the fly to give us a MapReduce job over our new index. The upside of this index is that it can perform double duty; we can search for users by birthdate or by birthdate and last name.

One of the things that we need to consider when we’re designing our key-value database is the queries that we’re going to be running, not the way that the data should be structured. When we’re structuring our data, it may make more sense to use a complex data structure, like this index bucket, instead of many single column indexes. Updates to the data will require fewer updates with this model. Only one write is needed when a user signs up, not two; one for the compound index as opposed to two for the simple indexes – one for users_by_birthdate and one for users_by_lastname.

Once we start updating data using an index bucket can reduce index maintenance. When we update a record we’ll need to update any index buckets. This could lead to a large number of data writes, tombstone writes, and compactions over time. Using a single index bucket with more complex structure in the index removes the need for a large number of writes. We only have to write to the original key-value pair as well as writing to the index bucket.

Inequality Predicates

Inequality predicates are much trickier than equality predicates. At first glance it seems like it would be easy to perform a MapReduce over the keys and determine which key-value pairs meet the search conditions. And we would be right: that is an easy solution when there are a small number of keys. What happens when there are a large number of keys? We need to read more data to get to the results we need. Digging further, what happens when very little of our data actually matches? The benefits of using this approach become very slim; we need to perform a large number of reads to retrieve very little data.

There’s a trick to implementing inequality predicates: in some cases it will be faster to retrieve all keys that match our inequality predicate – birthdate between 1976-07-01 and 1976-07-31. In other cases, a scan is going to be faster – users with a last name starting with ‘A’.

How do we figure out which approach is going to be faster? Let’s take a look at a concrete example using users and birthdates for our example. We already know how to find every user with a specific birthdate using an index; that’s a simple seek. How do we find all users whose birthday falls between a range of dates? That’s what we’ve been asking all along, right?

What if we maintain some simple information about our users_by_birthdate index? Let’s say we track the minimum value, the maximum value, and the number of keys in the bucket. Just from this information we can guess roughly how many keys will be in any given range. This is pretty easy with dates; we can assume that there’s a an even distribution of birthdates in any range of dates.

Things get trickier with names. How many distinct last names are there between Aaronson and Bartleby? What about between Xara and Yoder? Knowing the min and max last names and the number of keys won’t help us figure out the distribution; the data isn’t uniformly distributed. We need a way to make educated guesses about the data in a range. One way to do this is to create a histogram of our data. Histograms are data samplings that let us make better guesses about the number of specific values we can expect to find in a given range.

Relational databases accomplish this by maintaining separate statistics tables. We can accomplish something similar in Riak by creating a new bucket of statistics where the stats key is the high value of a range of keys. The value of the key-value pair would be data about the range of values – number of rows, number of keys, and the average number of distinct values per key. We want to collect statistics so we can easily determine when we need to seek a small number of keys and when we need to perform a MapReduce over the bucket.

So far I’ve proposed two range search approaches that work well with different use cases. Ideally we could create an indexing mechanism in Riak that lets us state an assumption about the distribution of our data in Riak. If we know there’s a constant distribution of keys, we can skip any kind of heuristics and just assume that for July, we’ll be reading 31 keys. Likewise, if we want to pull back every user whose last name is between Scalia and Smith, then we’ll need to MapReduce over the index bucket to get the data that we need.

Creating Indexes – Simple Indexes

Simple, single column indexes are easy to create. The syntax to create an index with SQL is pretty straightforward:

CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ name ] ON table [ USING method ]
    ( { column | ( expression ) } [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
    [ WITH ( storage_parameter = value [, ... ] ) ]
    [ TABLESPACE tablespace ]
    [ WHERE predicate ]

With Riak, we won’t need that much information about the index. Working from the simplest use case possible, we only need

  • index name
  • bucket
  • value name to index
  • data storage mechanism

We need to uniquely name our index, otherwise we can’t create a new bucket for it. Likewise, we have to know which bucket we’re indexing. It also makes sense that we want to know which value we’ll be indexing.

We need to define the data storage mechanism for indexes so we are able to index the data. It’s for the same reason that we have to tell Riak Search how the data is stored; unless we know what we’re indexing, it’s impossible to index it.

Looking at riak_search, it should be possible to re-use the basic principles behind riak_search_kv_extractor and riak_search_kv_hook to define how indexes are created and how data is retrieved. When we create a full-text index with Riak search, we supply the datatype (JSON, XML, plain-text, or something else). This gives us an incredible amount of flexibility without having to define a schema for the documents. We’re only indexing the fields we want. This can be incredibly useful if we’re storing larger objects but we know we’ll only perform look ups on a few fields – indexing time and space is reduced drastically.

Re-using the full-text index model for secondary indexes makes it easy to maintain consistent indexing behavior throughout Riak – all of our indexes (full-text or otherwise) can be defined using the same commands.

Creating Indexes – Complex Indexes

Complex indexes shouldn’t be much more difficult to implement than simple indexes. We only need one additional piece of information to collect our index: additional fields to index. Ideally the additional fields would be a text representation of how we access the fields in a program – “parent.child.property”. Supporting XML could make this trickier, but it should be easy to use XQuery/XPath to determine which XML elements, attributes, and values should be indexed. The intelligence of knowing how to perform the query will come later; indexes don’t need to know how to query themselves.

Staying up to Date

Right now, this index will only be accurate once: when we create it. Once again, we can borrow from the Riak Search implementation and create hooks to make sure the index is kept up to date as data is written. Keeping our indexes in sync with the data will cause some overhead on the cluster, but it’s worth it to know that we can accurately retrieve data. Like any other database, we need to take care not to create too many indexes: every write to the bucket means we end up writing to each index.

Ideally the index maintenance hooks would be the last hooks to run in a sequence of pre-commit hooks. If there are a number of pre-commit hooks to validate data we need to make sure that the data is valid before it is written to an index. If our primary worry is data validity, it is possible to implement the index maintenance code as a post-commit hook in Erlang (rather than as a pre-commit hook in either Erlang or JavaScript).

The Trouble with Data Locality

One potential problem we can run into is data locality. Depending on how the data is distributed in the key space, we could end up writing key-value and index data on the same physical node. The problem is that we could double, or more, the physical I/O that we’re performing on those nodes. If an index is not on the same node as the data we might incur tremendous network overhead querying remote indexes before querying data on the local node. On a busy network, this increased bandwidth could have an adverse effect on operations.

Driving Toward Our Goal

We’re clearly heading toward something. Let’s take stock of what we’ve talked about so far:

There are two types of search predicate to support

  • Equality predicates – easy to index
  • Inequality predicates – trickier to index

There several ways to use indexes

  • A single index
  • Multiple indexes with a bitwise comparison
  • Complex indexes

We can read from indexes through

  • Key seeks
  • Key scans (MapReduce)
  • Key/value scans (MapReduce)

We can figure out what type of read to use by using data statistics

  • Uniformly distributed keys are easy
  • Randomly distributed keys are harder
  • A small number of seeks can be more efficient than a single MapReduce

Astute readers will have probably guess what we need: a query engine.

This Ain’t No Car, It’s a Key-Value Store!

Why do we need a query engine? Key-value stores are simple; data is stored in key-value pairs and retrieved by a simple key lookup. A query engine of some type just adds complexity, right? Well, without a query engine, we can’t use secondary indexes without explicitly writing queries to use them.

In a hypothetical Riak query engine, we could process a MapReduce query like this:

Parse map phases and extract search conditions

Examine existing indexes

  1. Do indexes exist?
  2. Examine index type – evaluate statistics if needed
  3. Evaluate data access path needed to retrieve data – this is where we determine if we can use one index, multiple indexes with a bitwise compare, or use a multi-column index
  4. If no indexes exist, does a full-text index exist?

Re-write MapReduce functions on-the-fly

  1. Early Map phases are added to query use indexes
  2. Early Map phase output is used for key selection in MapReduce functions

While Riak doesn’t need anything as complex as an RDBMS’s query engine, this is still a gross simplification of how a query engine would function – additional functionality is needed to perform costing estimates of seeks vs MapReduce scans, statistics will need analyzed, and the query engine will need to be aware of vast portions of Riak in order to make good guesses so it can re-write MapReduce queries on the fly.

Features, Features, Features

I left some topics out of this discussion intentionally. They don’t directly apply to implementing indexes, but they would be very helpful to have down the road.

  • Checking updated values to avoid needless tombstones. When a client updates a key with the same value it had before, there’s no reason to overwrite the stored value: nothing changed.
  • Saving query plans. We are re-writing MapReduce queries on the fly, why not save them off internally and opt to store them for re-use when the cluster comes online?
  • Missing indexes. Frequently used queries should be optimized, either with secondary indexes for faster querying or as standalone entities in Riak. THe easiest way to monitor for this is to store metadata about the query history.
  • Index use data. How often are our indexes being used? Do they meet our needs or should we replace them with different indexes?
  • Index only queries. If a MapReduce can be satisfied by the data in the index, don’t query the main bucket.
  • Running statistics. Riak is a long running process. Over time we will be able to collect a lot of data about Riak. Why not collect data about how our Riak is being used so we can tune our code and hardware to get better performance?

Summing it all Up

Indexing is a tricky thing to think about and an even trickier thing to get right. When I first started on this as a thought experiment, I thought that it was going to be easy. Assuming that indexes are simple led to a lot of dead ends, frustration, and restarts. This came about from talking to Mark Phillips (blog | twitter) and Dan Reverri (twitter) last week at the Basho offices about Riak, HTML, and the sound of farm machinery pulling itself apart.

Sound off in the comments.

TL;DR version – indexes are hard, let’s go shopping.

I’m Going to Hadoop World

Sounds like I’m bragging, right? There is a free book involved. Or maybe you don’t care because it’s all NoSQLs and stuff and you’re a SQL Server DBA. And that’s where we differ.

When I first heard about NoSQL databases, I had the same reaction that a lot of people are having right now: disbelief and mockery. I remember making fun of MySQL when I first ran into it. It was such an odd database: it didn’t have foreign keys, joins didn’t work, it sometimes ate all of your data, and writes put locks on tables. It was no comparison for SQL Server, Oracle, or PostgreSQL. When I hit publish, this blog post is going to get dumped into a MySQL database on my server. I’ll probably hit up slashdot or read some blogs. That’s going to hit MySQL as well. These days, I try to keep a passing familiarity with MySQL because you never know when you’re going to need to use such a cheap, powerful, tool.

I see a filthy commie on the other side of this intertubes

This is what I think about NoSQL databases; especially Hadoop.

So, I’m going to Hadoop World and I’m going to learn as much as I can about Hadoop and the technology that surrounds it. The fun part is that I’m going to take some notes. And I’m going to share those notes.

That’s all well and good. As Senator Joe McCarthy used to say, “sharing is caring and caring is for commies.”

So, instead of just going and learning about what is interesting to me, I want to know what kind of questions DBAs have about this new-fangled Hadoop thing. I want to find the answers to the questions that people have about Hadoop, its place in the enterprise, and how it may or may not change the DBA’s job.

So, sound off in the comments. If you’ve got a question, type in the box and hit send. I’ll give you props when I post my write up of Hadoop World and I’ll do my best to find the answer.

I owe this entire idea to Andy Warren (blog | twitter). He mentioned that he was curious about Hadoop and that sparked this blog post. Here are his questions:

  • How do you install it?
  • What login security model(s) does it have
  • When does it make sense to use Hadoop vs SQL Server?
  • What tools are provided to manage it?

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.