<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Facility9 &#187; MongoDB</title>
	<atom:link href="http://facility9.com/category/database/mongodb/feed/" rel="self" type="application/rss+xml" />
	<link>http://facility9.com</link>
	<description>Jeremiah Peschka - professional something or other</description>
	<lastBuildDate>Fri, 06 Jan 2012 15:00:14 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Three Mistakes I Made With MongoDB</title>
		<link>http://facility9.com/2011/01/three-mistakes-i-made-with-mongodb/</link>
		<comments>http://facility9.com/2011/01/three-mistakes-i-made-with-mongodb/#comments</comments>
		<pubDate>Thu, 06 Jan 2011 14:00:49 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[nosql_syndication]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://facility9.com/?p=2154</guid>
		<description><![CDATA[When I initially started working with MongoDB, it was very easy to get started. I could create schemas, create data, and pretty much do everything I wanted to do very quickly. As I started progressing through working with MongoDB, I started running into more and more problems. They weren&#8217;t problems with MongoDB, they were problems&#8230;]]></description>
			<content:encoded><![CDATA[<p>When I initially started working with MongoDB, it was very easy to get started. I could create schemas, create data, and pretty much do everything I wanted to do very quickly. As I started progressing through working with MongoDB, I started running into more and more problems. They weren&#8217;t problems with MongoDB, they were problems with the way I was thinking about MongoDB.</p>
<h3>I Know How Stuff Works</h3>
<p>The biggest mistake that I made was carrying over ideas about how databases work. When you work with any tool for a significant period of time, you start to make assumptions about how other features will work. Over time, your assumptions get more accurate. On the whole, it&#8217;s a good thing. It gets tricky when you start to switch around your frame of reference.</p>
<p>As a former DBA and alleged <a href="http://www.quest.com/newsroom/Jeremiah-Peschka.aspx">database expert</a>, I&#8217;m pretty comfortable with relational databases. Once you understand some of the internals of one database, you can make some safe assumptions about how other databases have been implemented. One of the advantages of MongoDB is that it&#8217;s a heck of a lot like an RDBMS (sorry guys, that&#8217;s just the truth of it). It has collections, that look and act a lot like tables. There are indexes, there&#8217;s a query engine, there&#8217;s even replication. There are even more features and functionality that map really well between MongoDB and relational databases. It&#8217;s close enough that it&#8217;s painless to make a switch. The paradigms don&#8217;t match up exactly, but there are similarities.</p>
<p>Unfortunately for me, the paradigms and terminology matched up enough that I felt comfortable making a large number of assumptions. Boy was I ever wrong. I&#8217;ve had to stop thinking that I know anything about MongoDB and look up the answers to questions, rather than assume I know anything. It&#8217;s been frustrating, but it&#8217;s also been educational. </p>
<h3>I Know How Data is Modeled</h3>
<p>It&#8217;s really easy to make assumptions about modeling data. This is another area where I made huge assumptions that caused a lot of problems. In a relational database, we know how to model data. Normalization is really well understood. I know how to design database structures to take advantage of best practices and techniques. I know how to work with O/R-Ms and I understand where there are tradeoffs to be made between normalization, denormalization, and the software that talks to the database. We all learn these things as we progress in our careers. Once you get used to normalization, it&#8217;s easy to fall into that pattern. </p>
<p>Document database, like MongoDB, don&#8217;t work to their full potential when you normalize your data. Document databases, in my experience, do the opposite; they work best when the data is stored as a document. Document data is similar to the way data is used in the application child data is stored with a parent. An order&#8217;s line items are just a child collection of the order, there is no <code>OrderHeader</code> and <code>OrderDetails</code> table. Years of working within the same set of rules made it easy to slip into old habits.</p>
<p>Let&#8217;s say I have a number of user created documents in a <code>Documents</code> collection. Documents have an author as well as a set of tags created by the author. In a relational database, we&#8217;d have something like:</p>
<pre><code>CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(30) NOT NULL,
    password VARCHAR(50) NOT NULL
);

CREATE TABLE documents (
    document_id INT PRIMARY KEY,
    title VARCHAR(30) NOT NULL,
    body VARCHAR(MAX) NOT NULL,
    user_id INT NOT NULL REFERENCES users(id)
);

CREATE TABLE tags (
   tag_id INT PRIMARY KEY,
   name VARCHAR(30) NOT NULL
);

CREATE TABLE document_tags (
   document_id INT NOT NULL,
   tag_id INT NOT NULL
);
</code></pre>
<p>For a lot of things, this makes perfect sense. With MongoDB, we&#8217;d have a <code>documents</code> collection and a <code>users</code> collection. We might have a separate <code>tags</code> collection as well, but that would be used as an inverted index for searching. The <code>users</code> collection would be used for validating logins and populating your user profile, but when a document is saved, we wouldn&#8217;t store a pointer to the appropriate record in the <code>users</code> collection. The appropriate thing to do would be to cache the data locally as well as store a pointer to the <code>users</code> collection. The same hold true for the document&#8217;s tags &#8211; why create a join construct between the two collections when we can store all of the tag data we need in the appropriate <code>document</code>? If we decide that we need to find documents by their <code>tags</code> we have two choices:</p>
<ol>
<li>Create an index on the <code>document.tags</code> property</li>
<li>Create a <code>tags</code> inverted index.</li>
</ol>
<blockquote>
<p>If you&#8217;re interested in indexes, the <a href="http://www.mongodb.org/display/DOCS/Indexes">MongoDB documentation</a> on the subject is a great place to start. There is a specific kind of index called a <a href="http://www.mongodb.org/display/DOCS/Multikeys">multikey</a> that allows you to index arrays of values. Inverted indexes are interested and I covered them while talking about <a href="http://facility9.com/2010/12/16/secondary-indexes-how-would-you-do-it">building secondary indexes in Riak</a>.</p>
</blockquote>
<p>With a document database we&#8217;re trying to minimize the number of reads that we&#8217;re performing at any given time. A document should be a logical construct of whatever application entity we&#8217;re saving. A <code>document</code> would be a record of the document at the time it was saved &#8211; there will be cached information from the <code>user</code>, the document and its associated metadata, and a list of <code>tags</code>. </p>
<p>Of course, since we&#8217;re talking about data modeling, there are arguably n(n+1) ways to accomplish this and at least n+1 correct ways to accomplish this. If you really don&#8217;t like it, feel free to comment about it.</p>
<h3>I Can Just Drop This In ###</h3>
<p>Despite appearances and claims to the contrary, MongoDB is not a drop in replacement for an RDBMS. A relational database provides a phenomenal number of features for free &#8211; indexing, declarative referential integrity, transaction support, multi-version concurrency control, and multi-statement transactions to name a few. These features come with a price. Likewise, MongoDB provides a different set of features and they also have a price. </p>
<p>Believing the hype caused some sticky problems for me, not because I destroyed important data or anything like that, but because I assumed that I could use the same tools and tricks. I used <a href="https://github.com/jnunemaker/mongomapper">MongoMapper</a> to handle my database access. MongoMapper is a fine piece of code and it made a lot of things very easily. Using an O/R-M made it feel like I was using a relational database. It made things trickier, especially when I ran into situations where I was running into some of the blurry places I&#8217;ve already talked about. In hindsight, I should have used the stock drivers, built my own abstractions, and the replaced that when it became necessary. I don&#8217;t say that because I think I could do a better job, but because it makes more sense to build something from scratch yourself for the first time and then replace it when you&#8217;re building more plumbing than functionality.</p>
<h3>Would I Use MongoDB Again?</h3>
<p>Sure, if the project needed it. I don&#8217;t believe that MongoDB is a drop in replacement for an RDBMS. Thinking about MongoDB and RDBMSes that was does a disservice to both MongoDB and the RDBMS: they both have their own strengths and weaknesses.</p>
]]></content:encoded>
			<wfw:commentRss>http://facility9.com/2011/01/three-mistakes-i-made-with-mongodb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Comparing MongoDB and SQL Server Replication</title>
		<link>http://facility9.com/2010/11/comparing-mongodb-and-sql-server-replication/</link>
		<comments>http://facility9.com/2010/11/comparing-mongodb-and-sql-server-replication/#comments</comments>
		<pubDate>Tue, 09 Nov 2010 13:00:58 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[nosql_syndication]]></category>
		<category><![CDATA[syndication]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[Replication]]></category>
		<category><![CDATA[SQL Server]]></category>

		<guid isPermaLink="false">http://facility9.com/?p=1946</guid>
		<description><![CDATA[MongoDB has replication built in. So does SQL Server, Oracle, DB2, PostgreSQL, and MySQL. What&#8217;s the difference? What makes each MongoDB a unique and special snowflake? I recently read a three part series on MongoDB repication (Replication Internals, Getting to Know Your Oplog, Bending the Oplog to Your Will) in an effort to better understand&#8230;]]></description>
			<content:encoded><![CDATA[<p>MongoDB has replication built in. So does SQL Server, Oracle, DB2, PostgreSQL, and MySQL. What&#8217;s the difference? What makes each MongoDB a unique and special snowflake?</p>
<p>I recently read a three part series on MongoDB repication (<a href="http://www.snailinaturtleneck.com/blog/2010/10/12/replication-internals/">Replication Internals</a>, <a href="http://www.snailinaturtleneck.com/blog/2010/10/14/getting-to-know-your-oplog/">Getting to Know Your Oplog</a>, <a href="http://www.snailinaturtleneck.com/blog/2010/10/27/bending-the-oplog-to-your-will/">Bending the Oplog to Your Will</a>) in an effort to better understand MongoDB&#8217;s replication compared to SQL Server&#8217;s replication.</p>
<h3 id="logging_sidebar">Logging Sidebar</h3>
<p>Before we get started, it&#8217;s important to distinguish between the <code>oplog</code> and MongoDB&#8217;s regular log. By default, MongoDB pipes its log to <code>STDOUT</code>&#8230; unless you supply the <code>--logpath</code> command line flag. Logging to <code>STDOUT</code> is fine for development, but you&#8217;ll want to make sure you log to a file for production use. The MongoDB log file is not like SQL Server&#8217;s log. It isn&#8217;t used for recovery playback. It&#8217;s an activity log. Sort of like the logs for your web server.</p>
<h3 id="what8217s_the_same">What&#8217;s The Same?</h3>
<p>Both MongoDB and SQL Server store replicated data in a central repository. SQL Server stores transactions to be replicated in the <code>distribution</code> database. MongoDB stores replicated writes in the <code>oplog</code> collection. The most immediate difference between the two mechanisms is that SQL Server uses the transaction as the demarcation point while MongoDB uses the individual command as the demarcation point.</p>
<p>All of our transactions (MongoDB has transactions&#8230; they&#8217;re just only applied to a single command) are logged. That log is used to ship commands over to a subscriber. Both SQL Server and MongoDB support having multiple subscribers to a single database. In MongoDB, this is referred to as a <a href="http://www.mongodb.org/display/DOCS/Replica+Sets">replica set</a> &#8211; every member of the set will receive all of commands from the master. MongoDB adds some additional features: any member of a replica set may be promoted to the master server if the original master server dies. This can be configured to happen automatically. </p>
<h3 id="the_ouroboros">The Ouroboros</h3>
<p>The <a href="http://en.wikipedia.org/wiki/Ouroboros">Ouroboros</a> is a mythical creature than devours its own tail. Like the Ouroboros, the MongoDB oplog devours its own tail. In ideal circumstances, this isn&#8217;t a problem. The oplog will happily write away. The replica servers will happily read away and, in general, keep up with the writing to the oplog. </p>
<p>The oplog file is a fixed size so, like the write ahead log in most RDBMSes, it will begin to eat itself again. This is fine&#8230; most of the time.</p>
<p>Unfortunately, if the replicas fall far enough behind, the oplog will overwrite the transactions that the replicas are reading. Yes, you read that correctly &#8211; your database will overwrite undistributed transactions. DBAs will most likely recoil in horror. Why is this bad? Well, under extreme circumstances you may have no integrity. </p>
<p>Let&#8217;s repeat that, just in case you missed it the first time:</p>
<p><strong>There is no guarantee of replica integrity.</strong></p>
<p>Now, before you put on your angry pants and look at SQL Server Books Online to prove me wrong, this is also entirely possible with transactional replication in SQL Server. It&#8217;s a little bit different, but the principle still applies. When you set up transactional replication in SQL Server, you also need to set up a retention period. If your replication is down for longer than X hours, SQL Server is going to tell you to cram it up your backside and rebuild your replication from scratch.</p>
<h3 id="falling_behind">Falling Behind</h3>
<p>Falling behind is easy to do when a server is under heavy load. But, since MongoDB avoids writing to disk to increase performance, that&#8217;s not a problem, right?</p>
<p>Theoretically yes. In reality that&#8217;s not always the case.</p>
<p>When servers are under a heavy load, a lot of weird things can happen. Heavy network traffic can result in TCP/IP offloading &#8211; the network card can offload work to the CPU. When you&#8217;re using commodity hardware with commodity storage, you might be using software RAID instead of hardware RAID to simulate one giant drive for data. Software RAID can be computationally expensive, especially if you encounter a situation where you start swapping to disk. Before you know it, you have a perfect storm of one off factors that have brought your shiny new server to its knees.</p>
<p>In the process, your oplog is happily writing away. The replica is falling further behind because you&#8217;re reading from your replica and writing to the master (that&#8217;s what we&#8217;re supposed to do, after all). Soon enough, your replicas are out of sync and you&#8217;ve lost data. </p>
<h3 id="falling_off_a_cliff">Falling Off a Cliff</h3>
<p>Unfortunately, in this scenario, you might have problems recovering because the full resync also uses a circular oplog to determine where to start up replication again. The only way you could resolve this nightmare storm would be to shut down your forward facing application, kill incoming requests, and bring the database back online slowly and carefully.</p>
<p>Stopping I/O from incoming writes will make it easy for the replicas to catch up to the master and perform any shard reallocation that you need to split the load up more effectively.</p>
<h3 id="climbing_gear_please">Climbing Gear, Please</h3>
<p>I&#8217;ve bitched a lot in this article about MongoDB&#8217;s replication. As a former DBA, it&#8217;s a scary model. But I&#8217;ve bitched a lot in the past about SQL Server&#8217;s transactional replication &#8211; logs can grow out of control if a subscriber falls behind or dies &#8211; but it happens with good reason. The SQL Sever dev team made the assumption that a replica should be consistent with the master. In order to keep a replica consistent, all of the undistributed commands need to be kept somewhere (in a log file) until all of the subscribers/replicas can be brought up to speed. This does result in a massive hit to your disk usage, but it also keeps your replicated databases in sync with the master.</p>
<p>Just like with MongoDB, there are times when a SQL Server subscriber may fall so far behind that you need to rebuild the replication. This is never an easy choice, no matter which platform you&#8217;re using, and it&#8217;s a decision that should not be taken lightly. MongoDB makes this choice a bit easier because MongoDB might very well eat its own <code>oplog</code>. Once that happens, you have no choice but to rebuild replication.</p>
<p>Replication is hard to administer and hard to get right. Be careful and proceed with caution, no matter what your platform.</p>
<h3 id="at_least_there_is_a_ladder">At Least There is a Ladder</h3>
<p>You can climb out of this hole and, realistically, it&#8217;s not that bad of a hole. In specific circumstances you may end up in a situation where you will have to take the front end application offline in order to resync your replicas. It&#8217;s not the best option, but at least there is a solution.</p>
<p>Every feature has a trade off. Relational databases trade integrity for performance (in this case) whereas MongoDB trades immediate performance for potential maintenance and recovery problems. </p>
<h3 id="further_reading">Further Reading</h3>
<p><strong>MongoDB</strong></p>
<ul>
<li><a href="http://www.snailinaturtleneck.com/blog/2010/10/12/replication-internals/">Replication Internals</a> </li>
<li><a href="http://www.snailinaturtleneck.com/blog/2010/10/14/getting-to-know-your-oplog/">Getting to Know Your Oplog</a></li>
<li><a href="http://www.snailinaturtleneck.com/blog/2010/10/27/bending-the-oplog-to-your-will/">Bending the Oplog to Your Will</a></li>
<li><a href="http://www.snailinaturtleneck.com/blog/2010/09/17/choose-your-own-adventure-mongodb-crash-recovery-edition/">Choose Your Own Adventure MongoDB Crash Recovery Edition</a></li>
</ul>
<p><strong>SQL Server</strong></p>
<ul>
<li><a href="http://www.sqlskills.com/BLOGS/PAUL/post/In-defense-of-transactional-replication-as-an-HA-technology.aspx">In defense of transactional replication as an HA technology</a></li>
<li><a href="http://download.microsoft.com/download/d/9/4/d948f981-926e-40fa-a026-5bfcf076d9b9/ReplicationAndDBM.docx">SQL Server Replication: Providing High Availability Using Database Mirroring</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://facility9.com/2010/11/comparing-mongodb-and-sql-server-replication/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>MongoDB &#8211; Now With Sharding</title>
		<link>http://facility9.com/2010/08/mongodb-now-with-sharding/</link>
		<comments>http://facility9.com/2010/08/mongodb-now-with-sharding/#comments</comments>
		<pubDate>Mon, 09 Aug 2010 14:00:33 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[nosql_syndication]]></category>
		<category><![CDATA[tfc_syndication]]></category>
		<category><![CDATA[HADR]]></category>
		<category><![CDATA[Replication]]></category>
		<category><![CDATA[Sharding]]></category>

		<guid isPermaLink="false">http://facility9.com/?p=1758</guid>
		<description><![CDATA[Version 1.6 of MongoDB was released last week while I was at devLink. What makes this an important point release? Two things: sharding and replica sets. Sharding is a way of partitioning data. Data will be sharded by a key &#8211; this could be based on zip code, customer number, SKU, or any other aspect&#8230;]]></description>
			<content:encoded><![CDATA[<p>Version 1.6 of MongoDB was <a href="http://blog.mongodb.org/post/908172564/mongodb-1-6-released">released last week</a> while I was at <a href="http://devlink.net">devLink</a>. What makes this an important point release? Two things: sharding and replica sets.</p>
<h3 id="sharding"><img src="http://d1kpgdt94igfig.cloudfront.net/wp-content/uploads/2010/08/mongodb_now_with_sharing-shards1.png" alt="" title="shards of data, original from http://www.flickr.com/photos/psyberartist/2549306315/" width="640" height="150" class="alignnone wp-image-1759" /></h3>
<p>Sharding is a way of partitioning data. Data will be sharded by a key &#8211; this could be based on zip code, customer number, SKU, or any other aspect of the data. Sharding data in MongoDB occurs on a collection by collection basis. If you want related data to be on the same server you&#8217;ll need to pick a sharding key that is present in every collection. However, this design decision make sense because some collections may grow faster than others. The point of sharding is, in part, to spread load across multiple servers.</p>
<p>What makes this important? Previously, sharding has been difficult to set up and administer and required custom application code to be written and maintained. More important, though, sharding gives MongoDB the ability to scale across multiple servers with minimal effort.</p>
<p>What would happen if a single node in a set of sharded servers got very busy? Well, MongoDB would detect that one of the nodes is growing faster than the others and it would start balancing the load across the other servers. This might seem like it would violate my earlier statement about how we set MongoDB up to using a sharding key that we define. Here&#8217;s the catch: MongoDB only uses that sharding key when we set things up and when there are no problems. If things start getting busy, it will make changes to the sharding key. Those changes get reported throughout the entire cluster of servers and everyone knows where their data is, although nobody outside of the cluster really needs to care.</p>
<h3 id="replica_sets"><a href="http://d1kpgdt94igfig.cloudfront.net/wp-content/uploads/2010/08/mongdb_now_with_sharding-replica_sets.png"><img src="http://d1kpgdt94igfig.cloudfront.net/wp-content/uploads/2010/08/mongdb_now_with_sharding-replica_sets.png" alt="Replica Sets" title="Replica Sets" width="640" height="150" class="alignnone wp-image-1768" /></a></h3>
<p>Replica sets are a new and improved way to perform data replication in MongoDB. Basically, we set up replication in a cluster of servers. If any single server fails, another server in the replica set will pick up the load. Once we&#8217;re able to get the dead server back up and running, the replica set will automatically start up a recovery process and our users will never know that there was an outage.</p>
<p>There can be only one master server at any given time, so this protects us from master server failures. Through the magic of network heartbeats, we can be aware of all of the servers in the replica set. Interestingly, the master server is determined by a priority setting that is assigned to each server. This way, we could use older hardware to serve as a backup (or read-only server) to the master and use faster machines in the replica set to take over from the master in the event of any kind of hardware failure.</p>
<h3 id="how_it_works">How It Works</h3>
<div id="attachment_1764" class="wp-caption alignnone" style="width: 582px"><a href="http://d1kpgdt94igfig.cloudfront.net/wp-content/uploads/2010/08/sharding.png"><img src="http://d1kpgdt94igfig.cloudfront.net/wp-content/uploads/2010/08/sharding.png" alt="MongoDB Sharding Diagram" title="MongoDB Sharding Diagram" width="572" height="333" class="alignnone wp-image-1764" /></a><p class="wp-caption-text">MongoDB Sharding Diagram</p></div>
<p>Basically, here&#8217;s what happens (if you want more details, please see the <a href="http://www.mongodb.org/display/DOCS/Sharding+Introduction">Sharding Introduction</a>):</p>
<ol>
<li>The <code>mongos</code> server is a router that makes our complicated MongoDB set up look like a single server to the application.</li>
<li>The <code>mongod</code> config servers maintain the shards. They know where data is stored and will attempt to balance the shards if any single node gets out of whack.</li>
<li>Replica sets provide localized redundancy for each shard key.</li>
</ol>
<h3 id="gotchas">Gotchas</h3>
<p>There are a few things to be aware when you&#8217;re considering sharding with MongoDB:</p>
<ol>
<li>If a configuration server goes down, you can no longer reallocate data if any shards become write hot spots. This meta-data must be writeable for data to be repartitioned. You can still read and write data, but load will not be distributed.</li>
<li>Choose sharding keys wisely. An overly broad sharding key will do you no good: all data can end up on one node and you will be unable to split the data onto multiple nodes.</li>
<li>Some queries will use multiple shards &#8211; make sure you understand data distribution, querying patterns, and potential sharding keys.</li>
</ol>
<h3>Photo Credits</h3>
<p><a href='http://www.flickr.com/photos/psyberartist/2549306315/' target='_blank'>glass litter</a> by psyberartist &#8211; Creative Commons Licensed<br />
<a href='http://www.flickr.com/photos/kevenlaw/2487291985/' target='_blank'>I thought I saw a puddy cat&#8230;</a> by Keven Law &#8211; Creative Commons Licensed</p>
]]></content:encoded>
			<wfw:commentRss>http://facility9.com/2010/08/mongodb-now-with-sharding/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>MongoDB &#8211; Basic Querying</title>
		<link>http://facility9.com/2010/07/mongodb-basic-querying/</link>
		<comments>http://facility9.com/2010/07/mongodb-basic-querying/#comments</comments>
		<pubDate>Thu, 29 Jul 2010 13:00:01 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[nosql_syndication]]></category>
		<category><![CDATA[tfc_syndication]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://facility9.com/?p=1715</guid>
		<description><![CDATA[I put together a little video tutorial showing you how to accomplish some basic querying with MongoDB. [media id=2 width=558 height=410] Download the sample code. A QuickTime/H.264/iPhone formatted video is also available &#8211; download it now.]]></description>
			<content:encoded><![CDATA[<p>I put together a little video tutorial showing you how to accomplish some basic querying with MongoDB.</p>
<div style='margin-left: 180px'>
[media id=2 width=558 height=410]
</div>
<p>Download the <a href='http://facility9presentations.s3.amazonaws.com/MongoDB%20-%20Basic%20Queries.js' target='_blank'>sample code</a>. A QuickTime/H.264/iPhone formatted video is also available &#8211; <a href='http://facility9presentations.s3.amazonaws.com/MongoDB%20-%20Basic%20Queries.mov' target='_blank'>download it now</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://facility9.com/2010/07/mongodb-basic-querying/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
<enclosure url="http://facility9presentations.s3.amazonaws.com/MongoDB%20-%20Basic%20Queries.mov" length="12158829" type="video/quicktime" />
		</item>
		<item>
		<title>The Future of Databases</title>
		<link>http://facility9.com/2010/07/the-future-of-databases/</link>
		<comments>http://facility9.com/2010/07/the-future-of-databases/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 13:00:57 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[nosql_syndication]]></category>
		<category><![CDATA[syndication]]></category>
		<category><![CDATA[tfc_syndication]]></category>
		<category><![CDATA[olap]]></category>
		<category><![CDATA[oltp]]></category>

		<guid isPermaLink="false">http://facility9.com/?p=1652</guid>
		<description><![CDATA[The Story So Far I&#8217;ve been in love with data storage since I first opened up SQL*Plus and issued a select statement. The different ways to store data are fascinating. Even within a traditional OLTP database, there are a variety of design patterns that we can use to facilitate data access. We keep coming up&#8230;]]></description>
			<content:encoded><![CDATA[<h3 id="the_story_so_far">The Story So Far</h3>
<p>I&#8217;ve been in love with data storage since I first opened up <a href="http://en.wikipedia.org/wiki/SQL*Plus">SQL*Plus</a> and issued a select statement. The different ways to store data are fascinating. Even within a traditional OLTP database, there are a variety of design patterns that we can use to facilitate data access. We keep coming up with different ways to solve the problems what we&#8217;re facing in business. The problem is that as the field of computer science advances, and businesses increase in complexity, the ways that we store data must become more complex as well. Exponentially scaling storage complexity isn&#8217;t something that I like to think about, but it&#8217;s a distinct possibility.</p>
<p>General purpose OLTP databases are not the future of data storage and retrieval. They are a single piece in the puzzle. We&#8217;ve been working with OLTP systems for well over 20 years. OLAP is a newer entry, bringing specialized analytical tricks (which are counter intuitive to the way relational data is stored) to the masses. Hell, there are a number of <a href="http://www.microsoft.com/sqlserver/2008/en/us/analysis-services.aspx">general</a> <a href="http://www.oracle.com/technology/products/bi/olap/index.html">purpose</a> <a href="http://www.greenplum.com/">analytical</a> storage engines on the market. These general purpose analytical databases integrate well with existing databases and provide a complement to the transactional specialization of OLTP systems.</p>
<p>That&#8217;s the key, OLTP databases are purpose built <em>transactional</em> databases. They&#8217;re optimized for write operations because way back in the <a href="http://portal.acm.org/citation.cfm?id=362685">dark ages</a> it was far more expensive to write data to disk than it was to read from disk. Data couldn&#8217;t be cached in memory because memory was scarce. Architectural decisions were made. The way that we <a href="http://en.wikipedia.org/wiki/Database_normalization">design our databases</a> is specifically designed to work within this structure. A well designed, normalized, database has minimal duplication of data. In OLTP systems this also serves to minimize the number of writes to disk when a common piece of data needs to be changed. I can remember when I was a kid and the United States Postal Service changed from using three letter state abbreviations to two letter abbreviations. I have to wonder what kind of difficulties this caused for many databases&#8230; </p>
<p>In the 40 years since <a href="http://portal.acm.org/citation.cfm?id=362685">E.F. Codd&#8217;s paper</a> was published, the programming languages that we use have changed considerably. In 1970, COBOL was still relatively new. 1971 saw the the introduction of C, 1975 brought us MS Basic. 1979, 1980, and 1983 saw Ada, Smalltalk-80, Objective-C, and C++ ushering in a wave of object oriented languages. Suddenly programmers weren&#8217;t working on singular data points, they were working with a object that contained a collection of properties. The first ANSI SQL standard was codified in 1986. 1990 gave us Windows 3 and the desktop PC became more than a blinking cursor. The web exploded in 1996, 2001, and continues to explode again in a frenzy of drop shadows, bevels, mirror effects, and Flash.</p>
<p>Throughout the history of computing, we&#8217;ve been primarily working with tuples of data &#8211; attributes mapped to values; rows to you and I. This model holds up well when we&#8217;re working with a entity composed of a single tuple. What happens, though, when the entity becomes more complex? The model to retrieve and modify the entity becomes more complex as well. We can&#8217;t issue a simple update statement anymore, we have to go through more complex operations to make sure that the data is kept up to date.</p>
<h3 id="examples_should_make_things_clearer">Examples Should Make Things Clearer</h3>
<p>Let&#8217;s take a look at something simple: my phone bill.</p>
<h4 id="in_the_beginning8230">In the beginning&#8230;</h4>
<p>Long ago, a phone bill was probably stored in a relatively simple format:</p>
<ul>
<li>Account Number</li>
<li>Name</li>
<li>Address</li>
<li>Past Due Amount</li>
<li>Current Amount Due</li>
<li>Due Date</li>
</ul>
<p>This was simple and it worked. Detailed records would be kept on printed pieces of paper in a big, smelly, damp basement where they could successfully grow mold and other assorted fungi. Whenever a customer had a dispute, a clerk would have to visit the records room and pull up the customer&#8217;s information. This was a manual process that probably involved a lot of letter writing, cursing, and typewriter ribbon.</p>
<p>Eventually, this simple bill format would prove to be unreliable (P.S. I&#8217;m totally making this up just to illustrate a point, but I&#8217;m guessing it went something like this). In our example, there&#8217;s no way to tell when a customer paid or which customer was billed. </p>
<h4 id="after_some_tinkering8230">After some tinkering&#8230;</h4>
<p>After a few more iterations, you probably end up with a way of storing a customer&#8217;s information and bills that looks something like this:</p>
<p>
<a href="http://d1kpgdt94igfig.cloudfront.net/wp-content/uploads/2010/07/The-New-Bill.png"><img src="http://d1kpgdt94igfig.cloudfront.net/wp-content/uploads/2010/07/The-New-Bill.png" alt="" title="A more complex bill hang-2-column" width="528" height="335" class="alignnone wp-image-1653" /></a></p>
<p>This is a lot more complicated from both a design perspective and an implementation perspective. One of the things that makes this implementation more difficult is that there are a number of intermediate tables to work with and these tables can become hotspots for reads as well as writes.</p>
<p>When you look at that design, be honest with yourself and answer this question:</p>
<blockquote>
<p>How often will you view a single service history or general charge row?</p>
</blockquote>
<p>Think about your answer. The fact is, you probably won&#8217;t read any of those rows on its own. You might update one if a change comes in from an external source, but otherwise all of the charges, history, etc on any given phone bill will always be read as a unit. In this particular instance, we&#8217;re always consuming a bill&#8217;s entire <a href="http://en.wikipedia.org/wiki/Graph_(data_structure">graph</a>) at once. Reading a bill into memory is an onerous prospect, not to mention that summarizing phone bills in this system is a read intensive operation.</p>
<h3 id="fixing_the_glitch">Fixing the glitch</h3>
<p>There are a lot of ways these problems could be worked around in a traditional OLTP database. However, that&#8217;s not the point. The point is that there are problems that require actual workarounds. OLTP databases work well for many use cases, but in this case an OLTP database becomes a problem because of the high cost of reading vs writing. (Why should we read-optimize a system that was designed to be write-optimized  when writes will probably account for only 10% of our activity, maybe less?)</p>
<p>I&#8217;ve hinted at how we fix the glitch at the beginning of this article &#8211; we look for a specialized database. In our case, we can use something called a <a href="http://en.wikipedia.org/wiki/Document-oriented_database">document database</a>. The advantage of a document database is that we&#8217;re storing an organized collection of values in the database. This collection of values is similar to a traditional tabular database &#8211; we have groups of similar data stored in named collections. The distinction comes in how the data is accessed. </p>
<p>When we&#8217;re saving a phone bill, we don&#8217;t have to worry about calling multiple stored procedures or a single complex procedure. There&#8217;s no need to create complex mappings between a database and our code. We create an object or object graph in the application code and save it. The software that we use to connect to our document database knows how to properly translate our fancy objects into data stored on a disk somewhere.</p>
<p>This solution has several upsides:</p>
<ul>
<li>Related data is stored in close proximity on disk</li>
<li>Documents do not require strict structure</li>
<li>Documents may change properties without requiring complex changes to physical schema</li>
</ul>
<h4 id="physical_proximity">Physical Proximity</h4>
<p>My data is close together, so what?</p>
<p>In a traditional OLTP database, your data may be scattered across one or multiple disk drives. Physical drive platters will have to spin to locate the data on different parts of your storage medium. Drive read/write arms will have to move around in coordination with the spinning platters. The more complex your query, the more complex the dance your physical hardware will have to do; a simple high school slow dance turns into a tango.</p>
<p>In a document database, all of our data is stored together in a single record. When we want to read our bill, we just have to start reading at the beginning of the bill record and stop when we come to the end. There&#8217;s no need to seek around on the disk.</p>
<p>You might be worried that all of your data won&#8217;t be close together on disk. And you&#8217;d be right. However, many databases (including MongoDB) allow for the creation of secondary indexes to speed up data retrieval. The biggest question you need to ask yourself is &#8220;How will the applications be accessing the data?&#8221; In many applications we&#8217;re only acting on a single object. Even when our application isn&#8217;t acting on a single object, we can pre-aggregate the data for faster reporting and retrieval. When our application only works on a single object at a time, a document database provides distinct advantages &#8211; every time we need an object, we&#8217;ll be pulling back all of the data we need in a single read operation.</p>
<h4 id="strict_structure">Strict Structure</h4>
<p>Databases typically require data to be rigidly structured. A table has a fixed set of columns. The datatypes, precision, and nullability can vary from column to column, but every row will have the same layout. Trying to store wildly diverse and variable data in a fixed storage medium is difficult. </p>
<p>Thankfully, document databases are well-suited to storing semi-structured data &#8211; since our data is a collection of attributes, it&#8217;s very easy to add or remove new attributes and change querying strategies rapidly and in response to different data structure. Better yet, document databases let us be ignorant of how the data is stored. If we want to find all bills where the account holder&#8217;s last name is &#8216;Smith&#8217; and they live in Virginia but the bill doesn&#8217;t have any county tax charges, it is very easy compared to constructing the query in a typical SQL database.</p>
<p>Using <a href="http://mongodb.org">MongoDB</a> our query might look like:</p>
<pre>db.bills.find( { last_name : 'Smith' },
               { state : 'Virginia' },
               { charges : { type : 'county tax',
                             $exists : false } } )
</pre>
<p>Compared to similar SQL:</p>
<pre>SELECT  b.*
FROM    bills b
        JOIN accounts a ON b.account_id = a.id
        LEFT JOIN charges c ON b.id = c.bill_id
                               AND c.type = 'county tax'
WHERE   a.last_name = 'Smith'
        AND a.state = 'Virginia'
HAVING  COUNT(c.id) = 0
</pre>
<p>And right about now, every DBA that reads this blog is going to be shaking with rage and yelling &#8220;But that SQL is perfectly clear, I don&#8217;t know how you can expect me to understand all of those curly brackets!&#8221; I don&#8217;t expect you to understand those curly brackets. Nor do I expect developers to understand SQL. The easiest way for us to develop is to use our natural paradigm. That&#8217;s why developers write code in C#, PHP, or Ruby and DBAs do their work in some dialect of SQL. MongoDB alleviates this because all the developers are doing is constructing a list of keys and values that must be matched before a document can be returned.</p>
<h4 id="changing_the_schema">Changing the Schema</h4>
<p>Changing the schema of an OLTP database can be an onerous task. You have to wait for, or schedule, down time. Modifications have to take place. Of course, the schema modifications need to take into account any actions (like triggers or replication) that may occur in the background. This alone can require significant skill and internal database engine knowledge to write. It&#8217;s not something that application developers should be expected to know. Why do I mention application developers? 99 times out of 100, they&#8217;re the ones who are working on the database, not a dedicated DBA.</p>
<p>Many newer, non-traditional, databases make it incredibly easy to change the schema &#8211; just start writing the new attribute. The database itself takes care of the changes and will take that into account during querying. When a query is issued for a new attribute, records without that attribute will be ignored (just like a column with a NULL value in a traditional database).</p>
<h3 id="what_about_analytics">What about Analytics?</h3>
<p>I don&#8217;t know a lot about analytical databases, in part because they require a different skill set than the one I&#8217;ve developed. I do know a few things about them, though.</p>
<p>Analytical databases are currently encumbered by some of the same problems as OLTP databases &#8211; data is stored in tables made up of rows and columns. Sure, these are called dimensions/facts and attributes, but the premise is the same &#8211; it&#8217;s a row-based data store.</p>
<p>Row-based data stores pose particular problems for analytic databases. Analytic databases throw most of the rules about normalization in the garbage and instead duplicate data willy nilly. Without joins, it&#8217;s very quick to query and aggregate data. But the problem still remains that there is a large quantity of repeated data being stored on disk.</p>
<p><a href="http://wiki.toadforcloud.com/index.php/Survey_distributed_databases#Columnar_Databases">Columnar databases</a> attempt to solve this problem by compressing columns with similar values and using some kind of magical method to link up columnar values with their respective rows. Sounds complicated, right? Well, it probably is. Let&#8217;s say you have a table with 10,000,000,000 rows and the CalendarYear column is a CHAR(4). If there are only 25 different values for CalendarYear in the database, would you rather store 40,000,000,000 bytes of data or 100 bytes of data? I know which makes more sense to me.</p>
<p>Interestingly enough, there are two approaches being taken to solving this problem. The first is by creating single-purpose columnar databases. There are <a href="http://www.sybase.com/products/datawarehousing/sybaseiq">several</a> <a href="http://infinidb.org/">vendors</a> providing dedicated columnar databases. Other database developers are looking for ways to leverage their <a href="http://blog.tapoueh.org/char10.html#sec10">existing database engines</a> and create hybrid row and columnar databases.</p>
<h3 id="looking_into_the_future">Looking Into the Future</h3>
<p>There are a lot of interesting developments going on in the database world. Many of them seem to be happening outside of the big vendor, traditional database space. Most of this work is being done to solve a particular business need. These aren&#8217;t the traditional row-based OLTP systems that we&#8217;re all familiar with from the last 30 years of database development. These are new, special purpose, databases. It&#8217;s best to think of them like a sports car or even a race car &#8211; they get around the track very quickly, but they would be a poor choice for getting your groceries.</p>
<p>The next time you start a new project or plan a new server deployment, think about what functionality you need. Is it necessary to have full transactional support? Do you need a row-based store? How will you use the data?</p>
]]></content:encoded>
			<wfw:commentRss>http://facility9.com/2010/07/the-future-of-databases/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>T-SQL Tuesday &#8211; &#8220;Did he just say that?&#8221; edition</title>
		<link>http://facility9.com/2010/06/t-sql-tuesday-did-he-just-say-that-edition/</link>
		<comments>http://facility9.com/2010/06/t-sql-tuesday-did-he-just-say-that-edition/#comments</comments>
		<pubDate>Tue, 15 Jun 2010 12:00:46 +0000</pubDate>
		<dc:creator>Jeremiah Peschka</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[MongoDB]]></category>
		<category><![CDATA[pg_syndication]]></category>
		<category><![CDATA[PostgreSQL]]></category>
		<category><![CDATA[SQL Server]]></category>
		<category><![CDATA[syndication]]></category>
		<category><![CDATA[AppDev]]></category>

		<guid isPermaLink="false">http://facility9.com/?p=1621</guid>
		<description><![CDATA[As in, I didn&#8217;t participate in the most recent T-SQL Tuesday about my favorite feature in SQL 2008 R2. Want to know my favorite 2008R2 features? PostgreSQL 9.0 and MongoDB. PostgreSQL and MongoDB are rapidly advancing features that solve my daily problems. I&#8217;m an OLTP guy. Honestly, I don&#8217;t care about the latest reporting ding&#8230;]]></description>
			<content:encoded><![CDATA[<p>As in, I didn&#8217;t participate in the most recent T-SQL Tuesday about my <a href='http://sqlchicken.com/2010/06/t-sql-tuesday-007-summertime-in-the-sql/' target='_blank'>favorite feature in SQL 2008 R2</a>.</p>
<p>Want to know my favorite 2008R2 features? <a href='http://www.postgresql.org/about/news.1210' target='_blank'>PostgreSQL 9.0</a> and <a href='http://mongodb.com/' target='_blank'>MongoDB</a>.</p>
<p>PostgreSQL and MongoDB are rapidly advancing features that solve my daily problems. I&#8217;m an OLTP guy. Honestly, I don&#8217;t care about the latest reporting ding dongs and doo dads. I already know <a href='http://perl.org' target='powershell is perl'>PowerShell</a>. I manage 6 production servers and we&#8217;re unlikely to grow, so these MDM and Utility Control Points don&#8217;t make me giddy with excitement. </p>
<p>I solve problems using SQL.</p>
<p>You know what makes me happy? <a href='http://www.postgresql.org/docs/8.4/interactive/tutorial-window.html' target='_blank'>Support for window functions</a> or better yet, <a href='http://www.depesz.com/index.php/2010/02/17/waiting-for-9-0-extended-frames-for-window-functions/' target='_blank'>improved support for window functions</a>. What about <a href='http://www.pgcon.org/2010/schedule/events/201.en.html' target='_blank'>exclusion constraints</a>? Or column level triggers so I don&#8217;t have to use branching logic in my triggers? Yes, I use triggers. Or any other of <a href='http://www.postgresonline.com/journal/index.php?/archives/164-What-is-new-in-PostgreSQL-9.0.html#extended'>these features</a>.</p>
<p>What about MongoDB? I&#8217;ve just started playing with it, but it solves a lot of the problems  I face at work. Not just in the day job, but in side projects as well. I&#8217;ve bitched about O/R-Ms before, but one of the biggest problems that users of O/R-Ms (developers) face is that the ideal way to model data for object-oriented programming bears no resemblance to the ideal way to store relational data. A recent article about <a href='http://gigaom.com/2010/06/08/how-zynga-survived-farmville/' target='_blank'>scaling Farmville</a> hints at this &#8211; the developers of Farmville managed scale by storing everything in a key-value store (memcached) before persisting to a relational data store later. Digg does <a href='http://about.digg.com/blog/looking-future-cassandra' target='_blank'>something</a> <a href='http://about.digg.com/node/564' target='_blank'>similar</a> with Cassandra. It&#8217;s not like these guys are idiots, the blog posts from Digg show that they know their stuff.</p>
<p>MongoDB lets me solve these problems. I can control the frequency of syncs to disk (just as I can in PostgreSQL) to improve raw write performance to memory. I only have to worry about storing data the way my application expects to see the data &#8211; arrays and hashes &#8211; without worrying about building many-to-many join tables. </p>
<p>What about DMVs and data integrity and a write-ahead log and indexes? MongoDB has <a href='http://www.mongodb.org/display/DOCS/Monitoring+and+Diagnostics' target='_blank'>instrumentation</a> and  <a href='http://www.mongodb.org/display/DOCS/Indexes' target='_blank'>indexes</a>. Yeah, you <a href='http://www.mongodb.org/display/DOCS/Durability+and+Repair' target='_blank'>sacrifice some durability</a> but many applications don&#8217;t need that. Hell, Amazon has even designed their systems to account for the <a href='http://www.infoq.com/news/2009/01/EventuallyConsistent,' target='_blank'>potential of failure</a>. </p>
<p>When I start my next application, I&#8217;m going to look long and hard at the platform I&#8217;m building on. There&#8217;s a good chance it&#8217;s not going to be a relational database and a really good chance it&#8217;s not going to be using SQL Server. It&#8217;s not because I have a problem with SQL Server or even RDBMSes, but because there are other choices that give me the flexibility I need to solve the problems I&#8217;m facing.</p>
<p>This is nothing against SQL Server 2008 R2, it&#8217;s a great step forward in the direction that SQL Server seems to be going. Sometimes I wonder if SQL Server and I are on the same road.</p>
]]></content:encoded>
			<wfw:commentRss>http://facility9.com/2010/06/t-sql-tuesday-did-he-just-say-that-edition/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk: basic
Page Caching using disk: basic
Database Caching 3/33 queries in 0.019 seconds using disk: basic
Object Caching 621/679 objects using disk: basic
Content Delivery Network via Amazon Web Services: CloudFront: d1kpgdt94igfig.cloudfront.net

Served from: facility9.com @ 2012-02-11 04:42:24 -->
