Tag Database

Three Things to Watch with NoSQL

Last week I painted a rosy picture of NoSQL databases. Before you deploy any kind of NoSQL database, you need to be aware of the potential pitfalls of NoSQL databases.

1) Architecture

NoSQL databases have different architectural concerns than traditional RDMBSes.

If you’re wokring with a BigTable or Dynamo database, your network backbone will need to be able to handle the increased traffic from replication and materializing MapReduce data. Dynamo databases are very chatty – reads and writes are only guaranteed when a specific number of servers respond to a query.

MapReduce queries run across many nodes in a cluster. One stage processing a MapReduce query is temporarily materializing intermediate results from each node. The intermediate results may then be shipped to another server for additional processing, aggregation, or just be combined into the final result set.

2) Querying Strategy

RDBMSes are able to perform a mix of batch and ad hoc querying. They produce reasonably fast results for doing online aggregation and analysis of data. It’s just as easy to write a query that returns a single row from a relational database as it is to write a query that aggregates data from several thousand rows.

Not all NoSQL databases are created equally. Hadoop and Cassandra are built to perform large scale analytical queries. Their implementation of MapReduce is designed to process batches of analytical results. Frequently these results are going to be pre-aggregated rather than delivered in real time. Riak and MongoDB, on the other hand, are designed to handle interactive querying.

3) Data Modeling

Data modeling is a tricky field of study in the world of relational databases. A huge number of books have been written on the subject and they deal with both the theoretical and practical aspects of modeling data. In short, while data modeling is tricky it is also a well understood field of study.

Data modeling in the non-relational world is an entirely new field. Many problems that have been solved in relational data modeling are new and require a different way of thinking. BigTable databases, while their structure may initially appear familiar, require different modeling techniques to get the most out of the data store. MongoDB’s document oriented approach makes a great deal of sense to developers, but it still poses a number of data modeling questions about how to link related objects and how deeply nested object graphs should be in the database. Key/Value stores, like Riak, have very little structure and give the application developer a huge amount of freedom to model data.

Summation & Further Reading

NoSQL Databases hold a lot of promise to help businesses rapidly respond to data growth. Whenever you’re planning to make a major infrastructure decision, you need to weigh the pros and cons before determining which solution is right for you. While NoSQL databases provide answers to a lot of questions, they also pose a different set of questions that require a different set of skills to answer.

NoSQL Glossary
MongoDB – Sharding with the Fishes
Hadoop for Batch Processing (from the Yahoo! Developer Network)
Schema Design in Riak
Schema Design for Riak (presentation slides)

Five Reasons to Use NoSQL

You’re not using a NoSQL database right now? I’m not all that surprised. Many IT shops are still evaluating moving from SQL Server 2000 to SQL Server 2005, much less these non-relational databases. A lot of people don’t even know what they would use a NoSQL database for. Does it replace the RDBMS? Work alongside it? Do something else?

1) Analytics

One reason to consider adding a NoSQL database to your corporate infrastructure is that many NoSQL databases are well suited to performing analytical queries. Developers can use the same querying languages to perform analytical queries that they’re using to perform atomic queries. Typically this will be some variation of a MapReduce query, but it’s also possible to query data using Pig or Hive. Don’t worry too much about these weird language terms, MapReduce is a fancy way of saying “SELECT and then GROUP BY” and doing it in a way that is entirely confusing to people who are used to SQL.

Many NoSQL systems boast phenomenal write performance. When you combine high write performance with batch processing it is easy to pre-aggregate data, summarize results, and still guarantee ad hoc query performance.

2) Scale

NoSQL databases are designed to scale; it’s one of the primary reasons that people choose a NoSQL database. Typically, with a relational database like SQL Server or Oracle, you scale by purchasing larger and faster servers and storage or by employing specialists to provide additional tuning. Unlike relational databases, NoSQL databases are designed to easily scale out as they grow. Data is partitioned and balanced across multiple nodes in a cluster, and aggregate queries are distributed by default. Scaling is as easy as racking a new server and executing a few commands to add the new server to the cluster (yeah, it really is that easy). Data will start flowing and you’ll back in business in no time.

3) Redundancy

In addition to rapid scaleability, NoSQL databases are also designed with redundancy in mind. These databases were designed and built at massive scales where the rarest hardware problems go from being freak events to eventualities. Hardware will fail. Rather than treat hardware failure as an exceptional event, NoSQL databases are designed to handle it. While hardware failure is still a serious concern, this concern is addressed at the architectural level of the database, rather than requiring developers, DBAs, and operations staff to build their own redundant solutions. Cassandra uses a number of heuristics to determine the likelihood of node failure. Riak takes a different approach and can survive network partitioning (when one or more nodes in a cluster become isolated) and repair itself.

4) Flexibility

What’s the use of a database if it’s not flexible? While the data modeling issues are completely different in NoSQL, there is a large amount of flexibility in how data is stored for performance.

Databases modeled like Bigtable and Cassandra provide flexibility around how data is stored on disk. It’s possible to create derived column families. In plain English: you can design your database to duplicate frequently accessed data for rapid query response. This is, of course, based on the assumption that writes and storage space are cheap.

Databases based on the Bigtable model also have another benefit – outside of key structure it’s possible to store a variety of disparate data in the same table. Structure is largely irrelevant. Relational databases have adopted features to solve similar problems (such as sparse columns in SQL Server), but they carry overhead. Storing wildly different columns in multiple rows of the same column family is so cheap as to be invisible in a NoSQL database.

Lastly, key-value stores provide an incredible level of flexibility. Data is arbitrarily stored as a value. Key-value databases make it possible to store images, word documents, strings, integers, and serialized objects within the same database. This requires more responsibility and creative thinking on the part of application developers and architects but it also lets the people designing the system build custom a completely custom solution that fills their needs.

5) Rapid Development

Let’s face facts: everyone wants their application to be faster, have more features, and they want it yesterday. NoSQL databases make it easy to change how data is stored or change the queries you’re running. Massive changes to data can be accomplished with simple refactoring and batch processing rather than complex migration scripts and outages and it’s even easier to take nodes in a cluster offline for changes and add them back into a cluster as the new master server – replication features will take care of syncing up data and propagating the new data design out to the other servers in a cluster.

Moving Away From Relational Storage – Introduction

Don’t. Fooled you, didn’t I?

If you’re already using a relational database, keep using it. If it’s scaling just fine with your hardware and workload, keep using it. If you aren’t running into any complexity problems, keep using it.

There’s no reason to change the way you’re storing your data just because you read an article about how BrandNewStartup.com was able to increase uptime and throughput eleventy-four percent by utilizing a new key-value storage solution. That’s like re-writing your core product every time you read an article about how Ruby has excellent metaprogramming features or how Python’s use of significant whitespace can lead to more readable code. These are reasons to choose a language for a new product, but they are not reasons to change the language of an existing project.

If you’re not supposed to move away from relational storage, what is this all about?

When Should I Switch?

When you add new features to an application, or when you rewrite a feature, take a look at what you need. When we needed to make use of large query result caching in a web application, we started using Velocity, which later became AppFabric. There’s some overhead serializing and deserializing objects into the storage mechanism, but could you imagine trying to dump that data to the database? The write overhead would be tremendous! We looked around at products that would work well with our existing application infrastructure, ASP.NET and SQL Server, and chose something that would play well in that garden, Windows Server AppFabric.

Before you start new you should ask yourself questions about what you’re trying to build. Question your assumptions about how the new feature or product will work. Question your assumptions about the existing infrastructure. Make sure that you aren’t shoehorning existing technology into a solution because you are familiar with it. Just because something feels familiar and safe doesn’t mean that it’s the best solution for a problem.

What Questions Should I Ask?

This is the trickiest part. It’s the part that I’ve struggled with and gotten wrong on more than one occasion. It’s okay to screw up, it’s how we learn.

Here are the questions I’ve started asking myself when I start a new project/feature that needs additional data storage:

  • Why are we storing this data?
  • How much data will I collect in a week? A month? A year?
  • How long will this data need to live?
  • How will this data be used?
  • How structured does this data need to be?
  • How available does this data need to be?

Why Are We Storing This Data?

You need to understand how you’ll be using the data before you can figure out how you want to store it. Will you be doing ad hoc reporting? Will the data be aggregated and consumed by other applications? How often will I write this data? How often will I read my data?

Different types of databases have different use case profiles. The way that you’re using the data will make a big difference for how you’re going to store data. You don’t want to store session state in a relational database – you’ll spend a lot of time writing transient data to disk. Likewise, you don’t want to store financial transactions in an in-memory cache.

How Much Data Will I Collect?

The volume of data that you’re working with will influence the way that you’re storing the data. Terabytes, and even petabytes, of data require different storage techniques and management strategies in a relational database, why would it be different anywhere else? Handling huge quantities of data often requires splitting the load across multiple servers or purchasing a SAN. Either way, you’ll want to consider long term budget and how the availability of those budget dollars might change over time. Just as important as the long term capacity of your data is the speed you’re collecting data. The faster you need to collect data, the more you need to look at how you’re storing that data.

Every database engine employs different strategies to maximize I/O throughput. The problem here is that they all use different strategies based on their use case.

How Long Will This Data Need to Live?

The lifespan of data is incredibly important. Short term data need never touch disk – it can live in memory. If this is something like session state data, it’s possible to use a 100% in memory storage solution (like memcached or AppFabric or even riak’s riakkvcache_backend) to solve this problem. Disks are slow, memory is fast.

If the data needs to live longer than a few seconds, then it’s time to consider how long it will really need to stick around. You need to look at different forms of persistence and how the strengths and weaknesses of those systems play into your long term choices. Some next-generation data stores will store data in memory and persist to disk in the background. This speeds up the ability to write data but it does bring up some data life concerns: what happens if the power fails? Other data stores use a write-ahead log, like relational databases, to make sure that the data is safe.

How Will This Data Be Used?

The way that end users are going to use our data is important for our decision making. Some data stores (such as CouchDB) do not allow ad hoc querying. Others (graph databases) make it possible to easily navigate deeply nested and complex data structures.

Relational databases are phenomenal general purpose data stores. They make it possible to store data in a variety of formats but there may be a variety of complications as a result of the general nature of SQL and the relational model. Massive volumes of data stored for statistical analysis have different storage and indexing requirements than data that needs to be instantly available from a variety of locations for atomic reads and writes.

How Structured Does This Data Need To Be?

Data structure can be important for a variety of reasons. Hierarchical data stores make it possible to traverse deeply nested category trees – think of how species are classified. Likewise, graph databases make it possible to navigate through complex relationships, much like the relationships that can be found in strongly object-oriented designs.

Considering how the structure of the data can make it very easy to make a decision on the data store that you’ll be using. If a lot of flexibility is required, it may even be possible that a Big Table derived database, such as Cassandra or HBase might meet your storage needs.

How Available Does This Data Need To Be?

One of the benefits of a relational database is that once a transaction commits, the data is immediately available to everyone querying that database (ignoring things like replication and log shipping). When you start working with distributed data stores, the data may not be immediately available. You need to ask yourself about how available the data needs to be. In some data stores, the delay may be only a few milliseconds. In others, it may be longer. Network latency and hardware utilization play a large role in the latency of data replication.

The other side of the availability question is “How fault tolerant does this data need to be?” With a relational database, unless you’ve implemented a clustering solution, the data is dependent on a single server functioning appropriately. If that single server goes down, business could stop until a replacement is brought back online. Some of the distributed data stores are massively fault tolerant – Riak is designed to be tolerant of hardware failure. If nodes fail, data will be replicated to new nodes, or even to new clusters, automatically. Replacement servers can be brought online and they will begin receiving new data immediately.

The immediately availability of data after a write as well as the long term (fault tolerance) availability of data is an important aspect of choosing a data storage solution.

What’s Next?

Ask yourself these questions the next time that you’re implementing new functionality or an entirely new application. The answers to your questions might very well surprised you.

In the next few weeks, I’ll be putting more thoughts together and showing examples of realistic scenarios of moving away from a relational database and utilizing a NoSQL solution.

Further Reading

NoSQL Summer
NoSQL Heroku and You
What is Big Data?
Big Data Is Less About Size, And More About Freedom
NoSQL: The Dawn of Polyglot Persistence

Finding SQL Agent Jobs Owned By The Wrong Person

Ever have someone leave the company only to find out that they own critical database processes… because that process failed? It hasn’t happened to me, yet, but a recent Active Directory outage this weekend got me thinking about it.

MongoDB – Basic Querying

I put together a little video tutorial showing you how to accomplish some basic querying with MongoDB.

[media id=2 width=558 height=410]

Download the sample code. A QuickTime/H.264/iPhone formatted video is also available – download it now.

The Future of Databases

The Story So Far

I’ve been in love with data storage since I first opened up SQL*Plus and issued a select statement. The different ways to store data are fascinating. Even within a traditional OLTP database, there are a variety of design patterns that we can use to facilitate data access. We keep coming up with different ways to solve the problems what we’re facing in business. The problem is that as the field of computer science advances, and businesses increase in complexity, the ways that we store data must become more complex as well. Exponentially scaling storage complexity isn’t something that I like to think about, but it’s a distinct possibility.

General purpose OLTP databases are not the future of data storage and retrieval. They are a single piece in the puzzle. We’ve been working with OLTP systems for well over 20 years. OLAP is a newer entry, bringing specialized analytical tricks (which are counter intuitive to the way relational data is stored) to the masses. Hell, there are a number of general purpose analytical storage engines on the market. These general purpose analytical databases integrate well with existing databases and provide a complement to the transactional specialization of OLTP systems.

That’s the key, OLTP databases are purpose built transactional databases. They’re optimized for write operations because way back in the dark ages it was far more expensive to write data to disk than it was to read from disk. Data couldn’t be cached in memory because memory was scarce. Architectural decisions were made. The way that we design our databases is specifically designed to work within this structure. A well designed, normalized, database has minimal duplication of data. In OLTP systems this also serves to minimize the number of writes to disk when a common piece of data needs to be changed. I can remember when I was a kid and the United States Postal Service changed from using three letter state abbreviations to two letter abbreviations. I have to wonder what kind of difficulties this caused for many databases…

In the 40 years since E.F. Codd’s paper was published, the programming languages that we use have changed considerably. In 1970, COBOL was still relatively new. 1971 saw the the introduction of C, 1975 brought us MS Basic. 1979, 1980, and 1983 saw Ada, Smalltalk-80, Objective-C, and C++ ushering in a wave of object oriented languages. Suddenly programmers weren’t working on singular data points, they were working with a object that contained a collection of properties. The first ANSI SQL standard was codified in 1986. 1990 gave us Windows 3 and the desktop PC became more than a blinking cursor. The web exploded in 1996, 2001, and continues to explode again in a frenzy of drop shadows, bevels, mirror effects, and Flash.

Throughout the history of computing, we’ve been primarily working with tuples of data – attributes mapped to values; rows to you and I. This model holds up well when we’re working with a entity composed of a single tuple. What happens, though, when the entity becomes more complex? The model to retrieve and modify the entity becomes more complex as well. We can’t issue a simple update statement anymore, we have to go through more complex operations to make sure that the data is kept up to date.

Examples Should Make Things Clearer

Let’s take a look at something simple: my phone bill.

In the beginning…

Long ago, a phone bill was probably stored in a relatively simple format:

  • Account Number
  • Name
  • Address
  • Past Due Amount
  • Current Amount Due
  • Due Date

This was simple and it worked. Detailed records would be kept on printed pieces of paper in a big, smelly, damp basement where they could successfully grow mold and other assorted fungi. Whenever a customer had a dispute, a clerk would have to visit the records room and pull up the customer’s information. This was a manual process that probably involved a lot of letter writing, cursing, and typewriter ribbon.

Eventually, this simple bill format would prove to be unreliable (P.S. I’m totally making this up just to illustrate a point, but I’m guessing it went something like this). In our example, there’s no way to tell when a customer paid or which customer was billed.

After some tinkering…

After a few more iterations, you probably end up with a way of storing a customer’s information and bills that looks something like this:

This is a lot more complicated from both a design perspective and an implementation perspective. One of the things that makes this implementation more difficult is that there are a number of intermediate tables to work with and these tables can become hotspots for reads as well as writes.

When you look at that design, be honest with yourself and answer this question:

How often will you view a single service history or general charge row?

Think about your answer. The fact is, you probably won’t read any of those rows on its own. You might update one if a change comes in from an external source, but otherwise all of the charges, history, etc on any given phone bill will always be read as a unit. In this particular instance, we’re always consuming a bill’s entire graph) at once. Reading a bill into memory is an onerous prospect, not to mention that summarizing phone bills in this system is a read intensive operation.

Fixing the glitch

There are a lot of ways these problems could be worked around in a traditional OLTP database. However, that’s not the point. The point is that there are problems that require actual workarounds. OLTP databases work well for many use cases, but in this case an OLTP database becomes a problem because of the high cost of reading vs writing. (Why should we read-optimize a system that was designed to be write-optimized when writes will probably account for only 10% of our activity, maybe less?)

I’ve hinted at how we fix the glitch at the beginning of this article – we look for a specialized database. In our case, we can use something called a document database. The advantage of a document database is that we’re storing an organized collection of values in the database. This collection of values is similar to a traditional tabular database – we have groups of similar data stored in named collections. The distinction comes in how the data is accessed.

When we’re saving a phone bill, we don’t have to worry about calling multiple stored procedures or a single complex procedure. There’s no need to create complex mappings between a database and our code. We create an object or object graph in the application code and save it. The software that we use to connect to our document database knows how to properly translate our fancy objects into data stored on a disk somewhere.

This solution has several upsides:

  • Related data is stored in close proximity on disk
  • Documents do not require strict structure
  • Documents may change properties without requiring complex changes to physical schema

Physical Proximity

My data is close together, so what?

In a traditional OLTP database, your data may be scattered across one or multiple disk drives. Physical drive platters will have to spin to locate the data on different parts of your storage medium. Drive read/write arms will have to move around in coordination with the spinning platters. The more complex your query, the more complex the dance your physical hardware will have to do; a simple high school slow dance turns into a tango.

In a document database, all of our data is stored together in a single record. When we want to read our bill, we just have to start reading at the beginning of the bill record and stop when we come to the end. There’s no need to seek around on the disk.

You might be worried that all of your data won’t be close together on disk. And you’d be right. However, many databases (including MongoDB) allow for the creation of secondary indexes to speed up data retrieval. The biggest question you need to ask yourself is “How will the applications be accessing the data?” In many applications we’re only acting on a single object. Even when our application isn’t acting on a single object, we can pre-aggregate the data for faster reporting and retrieval. When our application only works on a single object at a time, a document database provides distinct advantages – every time we need an object, we’ll be pulling back all of the data we need in a single read operation.

Strict Structure

Databases typically require data to be rigidly structured. A table has a fixed set of columns. The datatypes, precision, and nullability can vary from column to column, but every row will have the same layout. Trying to store wildly diverse and variable data in a fixed storage medium is difficult.

Thankfully, document databases are well-suited to storing semi-structured data – since our data is a collection of attributes, it’s very easy to add or remove new attributes and change querying strategies rapidly and in response to different data structure. Better yet, document databases let us be ignorant of how the data is stored. If we want to find all bills where the account holder’s last name is ‘Smith’ and they live in Virginia but the bill doesn’t have any county tax charges, it is very easy compared to constructing the query in a typical SQL database.

Using MongoDB our query might look like:

db.bills.find( { last_name : 'Smith' },
               { state : 'Virginia' },
               { charges : { type : 'county tax',
                             $exists : false } } )

Compared to similar SQL:

SELECT  b.*
FROM    bills b
        JOIN accounts a ON b.account_id = a.id
        LEFT JOIN charges c ON b.id = c.bill_id
                               AND c.type = 'county tax'
WHERE   a.last_name = 'Smith'
        AND a.state = 'Virginia'
HAVING  COUNT(c.id) = 0

And right about now, every DBA that reads this blog is going to be shaking with rage and yelling “But that SQL is perfectly clear, I don’t know how you can expect me to understand all of those curly brackets!” I don’t expect you to understand those curly brackets. Nor do I expect developers to understand SQL. The easiest way for us to develop is to use our natural paradigm. That’s why developers write code in C#, PHP, or Ruby and DBAs do their work in some dialect of SQL. MongoDB alleviates this because all the developers are doing is constructing a list of keys and values that must be matched before a document can be returned.

Changing the Schema

Changing the schema of an OLTP database can be an onerous task. You have to wait for, or schedule, down time. Modifications have to take place. Of course, the schema modifications need to take into account any actions (like triggers or replication) that may occur in the background. This alone can require significant skill and internal database engine knowledge to write. It’s not something that application developers should be expected to know. Why do I mention application developers? 99 times out of 100, they’re the ones who are working on the database, not a dedicated DBA.

Many newer, non-traditional, databases make it incredibly easy to change the schema – just start writing the new attribute. The database itself takes care of the changes and will take that into account during querying. When a query is issued for a new attribute, records without that attribute will be ignored (just like a column with a NULL value in a traditional database).

What about Analytics?

I don’t know a lot about analytical databases, in part because they require a different skill set than the one I’ve developed. I do know a few things about them, though.

Analytical databases are currently encumbered by some of the same problems as OLTP databases – data is stored in tables made up of rows and columns. Sure, these are called dimensions/facts and attributes, but the premise is the same – it’s a row-based data store.

Row-based data stores pose particular problems for analytic databases. Analytic databases throw most of the rules about normalization in the garbage and instead duplicate data willy nilly. Without joins, it’s very quick to query and aggregate data. But the problem still remains that there is a large quantity of repeated data being stored on disk.

Columnar databases attempt to solve this problem by compressing columns with similar values and using some kind of magical method to link up columnar values with their respective rows. Sounds complicated, right? Well, it probably is. Let’s say you have a table with 10,000,000,000 rows and the CalendarYear column is a CHAR(4). If there are only 25 different values for CalendarYear in the database, would you rather store 40,000,000,000 bytes of data or 100 bytes of data? I know which makes more sense to me.

Interestingly enough, there are two approaches being taken to solving this problem. The first is by creating single-purpose columnar databases. There are several vendors providing dedicated columnar databases. Other database developers are looking for ways to leverage their existing database engines and create hybrid row and columnar databases.

Looking Into the Future

There are a lot of interesting developments going on in the database world. Many of them seem to be happening outside of the big vendor, traditional database space. Most of this work is being done to solve a particular business need. These aren’t the traditional row-based OLTP systems that we’re all familiar with from the last 30 years of database development. These are new, special purpose, databases. It’s best to think of them like a sports car or even a race car – they get around the track very quickly, but they would be a poor choice for getting your groceries.

The next time you start a new project or plan a new server deployment, think about what functionality you need. Is it necessary to have full transactional support? Do you need a row-based store? How will you use the data?

Installing PostgreSQL on Mac OS X

This is a pretty simple process, but one that I thought I would document because I ran into a few gotchas along the way.

I originally installed OS X using the one click installer from EnterpriseDB. Unfortunately, the installer hung while attempting to finish the installation process and they only thing to do was to roll back the install. I attempted to build from macports, but that proved to be a huge pain in my ass and reminded me a little bit too much of using Linux, so I scrapped that idea as well. I started browsing the EnterpriseDB installation notes and came across an unattended install. Since my installation failed trying to launch the final interactive stage of the installation, I said to myself “Self, I bet we should try this.”

  1. Download PostgreSQL from EnterpriseDB http://www.enterprisedb.com/products/pgdownload.do#osx
  2. Mount the OS X disk image file
  3. In the Terminal, change to the directory where the image file was mounted:
    cd /Volumes/PostgreSQL\ 8.4.2-1/
  4. Start the unattended installation
    sudo ./postgresql-8.4.2-1-osx.app/Contents/MacOS/installbuilder.sh
    --mode unattended
  5. Once the installation has finished, add the following line to your .profile:
    source /Library/PostgreSQL/8.4/pg_env.sh
    If you’ve never created your .profile you can do the following in a terminal window:

    1. touch ~/.profile
      This will create the file if it doesn’t exist and will only change the last modification date if it does.
    2. open /Applications/TextEdit.app .profile
      This opens your .profile in the TextEdit application
  6. Save and close your .profile
  7. Create a new database user.
    createuser [YOUR NAME HERE] --pwprompt --username=postgres
    Alt text for the image, e.g. “The Mona Lisa”

    This could be anyone...

  8. When you’re prompted for the “Password:”, use “postgres”. This is the default password for the postgres superuser. Because you’ve just created another superuser, you’ll never have to touch this account again, but you should probably look up some fantastic documentation and learn how to change the password.

    I'm a real live boy!

There you have it: how to set up and install PostgreSQL on OS X. If you want to continue forward and install a Ruby library to interact with PostgreSQL, just type
sudo gem install postgres-pr to install the pure Ruby version PostgreSQL driver. If you want to install one of the C-based drivers, you’ll need to mess around with macports.

Note: EnterpriseDB make and distribute a binary version of PostgreSQL for a variety of platforms. In addition, they provide paid support and also offer a variety of enterprise features. They have in no way paid me for this post.


Related Posts:

PostgreSQL Tutorial – Creating Your First Database
PostgreSQL Tutorial – Creating the Pagila Schema
PostgreSQL Tutorial – Referring to Other Tables
PostgreSQL Tutorial – Inserting Data

Automating DBA Tasks

A member of my user group recently asked me about the best practices for monitoring and automating database maintenance.

Automate Everything

If I do something once, I automate it. I learned this valuable lesson at last year’s PASS Summit during a panel discussion. Any time I find an error situation, I try to determine which conditions might have caused that situation and I create a repeatable script that I can put on every server to check for a problem and email me.

Which Parts of Everything Should I Automate?

First off, you should check out this excellent article by Jonathan Kehayias about How to Automate Common DBA Tasks. It covers almost everything that we do on a day to day basis to make sure we have early warnings about any problems. I understand that there a products out there that might cover a lot of this functionality. But, if you’re stuck waiting for a purchase order, these scripts can hold you off for a while.

There are some WMI scripts in Jonathan’s article. I’m not using those yet. Why? I’d like to be using PowerShell instead. With the push to pick up PowerShell for the entire admin/IT pro space, I would prefer to migrate things to PowerShell than to add new features into VBScript and WMI applications.

Magical Backups

The final part of this cavalcade of T-SQL is my backup solution. This is no substitute for a commercial backup system, but it’s a start.

The BackupControl Table

First up we have BackupControl table:

CREATE TABLE [dbo].[BackupControl](
  
[id] [int] IDENTITY(1,1) NOT NULL,
  
[DatabaseName] [sysname] NOT NULL,
  
[BackupRoot] [nvarchar](255) NULL,
  
[BackupInterval] [tinyint] NULL,
 
CONSTRAINT [PK_BackupControl] PRIMARY KEY CLUSTERED 
(
  
[id] ASC
)WITH (PAD_INDEX  = OFFSTATISTICS_NORECOMPUTE  = OFFIGNORE_DUP_KEY = OFFALLOW_ROW_LOCKS  = ON
ALLOW_PAGE_LOCKS  = ONON [PRIMARY]
ON [PRIMARY] ;

This is a pretty simple table. There’s an arbitrary primary key so that we don’t keep referencing a lovely heap table. In addition, we have the name of the database, the root path where we want to do our backups, and then the interval for backups, in minutes.

The Jobs

I’ve scheduled a job to run nightly at a specific time, we’ll say 1:00 AM. This job takes nightly backups, removes backup history, and then deletes old backup files.

Backing up Databases

This is a simple step, but I’ve gone and made it complicated by using a lot of code. The reason for using a lot of code is that sometimes it’s necessary to backup databases A, B, and C to the G: drive and then backup databases D, R, and Q to the P: drive. You’ll also notice in this script that I use a maintenance database which has been cunningly named Maintenance. That’s so that I can hold all of my routine scripts in a single database, rather than having to copy them to every database on the server.

DECLARE @current_time AS NVARCHAR(50) ;
DECLARE @backup_root AS NVARCHAR(50) ;
DECLARE @sql AS NVARCHAR(MAX) ;
DECLARE @newline AS NVARCHAR(2) ;
DECLARE @s AS NVARCHAR(1) ;
DECLARE @now AS DATETIME ;

SET @current_time '' ;
SET @backup_root 'H:\BACKUPS\OLTP\DEFAULT'
SET @sql '' ;
SET @newline NCHAR(13) + NCHAR(10) ;
SET @s '''' ;
SET @now GETDATE() ;

SET @current_time CAST(DATEPART(yyyy@nowAS NVARCHAR(4)) + N'.'
  
RIGHT(N'00' CAST(DATEPART(mm@nowAS NVARCHAR(2)), 2) + N'.'
  
RIGHT(N'00' CAST(DATEPART(dd@nowAS NVARCHAR(2)), 2) + N'.'
  
RIGHT(N'00' CAST(DATEPART(hh@nowAS NVARCHAR(2)), 2) + N'.'
  
RIGHT(N'00' CAST(DATEPART(mi@nowAS NVARCHAR(2)), 2) ;
              
SELECT  @sql @sql N'BACKUP DATABASE ' DatabaseName N' TO DISK = ' @s
        
COALESCE(BackupRoot@backup_root DatabaseName N'\')
        + 
DatabaseName N'.' @current_time N'.bak' @s
        
N' WITH FORMAT ;' @newline
FROM    Maintenance.dbo.BackupControl ;

EXEC sp_executesql @sql ;

After this runs, I run a stored procedure to clean up the backup history – more info about why this is important over here, on Brent Ozar’s blog. The code is in a step that is set to run in msdb:

DECLARE @oldest_date AS DATETIME;
SET @oldest_date DATEADD(mm, -1GETDATE());

EXEC sp_delete_backuphistory @oldest_date;

And, finally, I run a little chunk of PowerShell to delete only backup files:

Get-Item "H:\BACKUPS\OLTP\DEFAULT" | Get-ChildItem | Get-ChildItem |? {$_.CreationTime -lt (Get-Date).AddDays(-2)} | Del

What About My Transaction Logs?

I thought you might ask that… Okay, you probably didn’t because you know I’m using rhetorical devices.

Just like our lovely script to take daily backups, we have a script to take regular transaction log backups. The cool part is that we’re going to tell it to run on a schedule and we don’t have to worry about whether or not the timings sync up perfectly. How can we do that?

At the start of the script, we grab the current time into a local variable. Then we go ahead and pull a list of all of the databases from our BackupControl table that meet the following conditions:

  • The BackupInterval isn’t NULL
  • The modulus of the BackupInterval and the current time is 0 (if you want to run the backup every 7 mintues, it will get run at 12:07, 12:14, 12:21, etc)
  • The database recovery mode is not set to simple.

The code to do this is remarkably similar to the full backup script above:

DECLARE @current_time AS NVARCHAR(50) ;
DECLARE @newline AS NVARCHAR(2NCHAR(13) + NCHAR(10) ;
DECLARE @now AS DATETIME GETDATE() ;
DECLARE @sql AS NVARCHAR(MAX'' ;
DECLARE @s AS NVARCHAR(1'''';

SET @current_time CAST(DATEPART(yyyy@nowAS NVARCHAR(4)) + N'.'
  
RIGHT(N'00' CAST(DATEPART(mm@nowAS NVARCHAR(2)), 2) + N'.'
  
RIGHT(N'00' CAST(DATEPART(dd@nowAS NVARCHAR(2)), 2) ;

SELECT  @sql +N'BACKUP LOG ' DatabaseName N' TO DISK = ' @s BackupRoot
        
DatabaseName N'.' @current_time N'.log.bak' @s @newline
FROM    dbo.BackupControl AS bc
WHERE   BackupInterval IS NOT NULL
        AND (
DATEPART(mi@now) % BackupInterval0
        
AND DATABASEPROPERTYEX(DatabaseName'Recovery') <> 'SIMPLE' ;

EXEC sp_executesql @sql ;

What Else Should I Automate?

Automate everything. If you’ve done it once, automate it. If you don’t want to deal with a problem, figure out how to detect it before it’s a problem and automate that. Through the Dynamic Management Views and built-in functions, it’s possible to collect a wealth of information about SQL Server that you can use to detect your pain points. Turn those pain points around and you’ve got a solid monitoring solution.

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 401 items have been purified.