I should have written this right when I got back from Hadoop World, instead of a week or so later, but things don’t always happen the way you plan. Before I left to go to Hadoop World (and points in between), I put up a blog post asking for questions about Hadoop. You guys responded with some good questions and I think I owe you answers.
What Is Hadoop?
Hadoop isn’t a simple database; it’s a bunch of different technologies built on top of the Hadoop common utilities, MapReduce, and HDFS (Hadoop Distributed File System). Each of these products serves a simple purpose – HDFS handles storage, MapReduce is a parallel job distribution system, HBase is a distributed database with support for structured tables. You can find out more on the Apache Hadoop page. I’m bailing on this question because 1) it wasn’t asked and 2) I don’t think it’s fair or appropriate for me to regurgitate these answers.
How Do You Install It?
Installing Hadoop is not quite as easy as installing Cassandra. There are two flavors to choose from:
Cloudera flavors
- If you’re running linux (which is the easiest way to do this), just follow the instructions to use Cloudera’s Hadoop repositories.
- If you don’t have a Linux distribution handy, you can download a VM from Cloudera (yeah, it’s that easy).
- If you really want to run Cloudera’s Hadoop on Windows, you will need to install Cygwin and create a Linux-like environment. Checkout Hadoop on Windows with Eclipse or Running Hadoop on Windows
Homemade flavors
- If you want to run Hadoop natively on your local machine, you can go through the single node setup from the Apache Hadoop documentation.
- Hadoop The Definitive Guide also has info on how to get started.
- Windows users, keep in mind that Hadoop is not supported on Windows in production. If you want to try it, you’re completely on your own and in uncharted waters.
What Login Security Model(s) Does It Have?
Good question! NoSQL databases are not renowned for their data security.
Back in the beginning, at the dawn of Hadoopery, it was assumed that everyone accessing the system would be a trusted individual operating inside a secured environment. Clearly this happy ideal won’t fly in a modern, litigation fueled world. As a result, Hadoop has support for a Kerberos authentication (now meaning all versions of Hadoop newer than version 0.22 for the Apache Hadoop distribution, Cloudera’s CDH3 distribution, or the 0.20.S Yahoo! Distribution of Hadoop). The Kerberos piece handles the proper authentication, but it is still up to Hadoop and HDFS(it’s a distributed file system, remember?) to make sure that an authenticated user is authorized to mess around with a file.
In short: Kerberos guarantees that I’m Jeremiah Peschka and Hadoop+HDFS guarantee that I can muck with data.
N.B. As of 2010-10-19 (when I’m writing this), Hadoop 0.22 is not available for general use. If you want to run Hadoop 0.22, you’ll have to build from the source control repository yourself. Good luck and godspeed.
When Does It Make Sense To Use Hadoop Instead of SQL Server, Oracle, or DB2?
Or any RDBMS for that matter. Personally, I think an example makes this easier to understand.
Let’s say that we have 1 terabyte of data (the average Hadoop cluster is reported as being 114.5TB in size) and we need to process this data nightly – create reports, analyze trends, detect anomalous data, etc. If we are using an RDBMS, we’d have to batch this data up (to avoid transaction log problems). We would also need to deal with the limitations of parallelism in your OS/RDBMS combination, as well as I/O subsystem limitations (we can only read so much data at one time). SQL dialects are remarkably bad languages for loop and flow control.
If we were using Hadoop for our batch operations, we’d write a MapReduce program (think of it like a query, for now). This MapReduce program would be distributed across all of the nodes in your cluster and then run in parallel using local storage and resources. So instead of hitting a single SAN across 8 or 16 parallel execution threads, we might have 20 commodity servers all processing 1/20 of the data simultaneously. Each server is going to process 50GB. The results will be combined once the job is done and then we’re free to do whatever we want with it – if we’re using SQL Server to store the data in tables for report, we would probably bcp the data into a table.
Another benefit of Hadoop is that the MapReduce functionality is implemented in Java, C, or another imperative programming language. This makes it easy to solve computationally intensive operations in MapReduce programs. There are a large number of programming problems that cannot be easily solved in a SQL dialect; SQL is designed for retrieving data and performing relatively simple transformations on it, not for complex programming tasks.
Hadoop (Hadoop common + HDFS + MapReduce) is great for batch processing. If you need to consume massive amounts of data, it’s the right tool for the job. Hadoop is being used in production for point of sale analysis, fraud detection, machine learning, risk analysis, and fault/failure prediction.
The other consideration is cost: Hadoop is deployed on commodity servers using local disks and no software licensing fees (Linux is free, Hadoop is free). The same thing with an RDBMS is going to involve buying an expensive SAN, a high end server with a lot of RAM, and paying licensing fees to Microsoft, Oracle, or IBM as well as any other vendors involved. In the end, it can cost 10 times less to store 1TB of data in Hadoop than in an RDBMS.
HBase provides the random realtime read/write that you might be used to from the OLTP world. The difference, however, is that this is still NoSQL data access. Tables are large and irregular (much like Google’s Big Table) and there are no complex transactions.
Hive is a data warehouse toolset that sits on top of Hadoop. It supports many of the features that you’re used to seeing in SQL, including a SQL-ish query language that should be easily learned but also provides support for using MapReduce functionality in ad hoc queries.
In short, Hadoop and SQL Server solve different sets of problems. If you need transactional support, complex rule validation and data integrity, use an RDBMS. If you need to process data in parallel, perform batch and ad hoc analysis, or perform computationally expensive transformations then you should look into Hadoop.
What Tools Are Provided to Manage Hadoop?
Unfortunately, there are not a lot of management tools on the market for Hadoop – the only tools I found were supplied by Cloudera. APIs are available to develop your own management tools. From the sound of discussions that I overheard, I think a lot of developers are going down this route – they’re developing monitoring tools that meet their own, internal, needs rather than build general purpose tools that they could sell to third parties. As the core product improves, I’m sure that more vendors will be stepping up to the plate to provide additional tools and support for Hadoop.
Right now, there are a few products on the market that support Hadoop:
- Quest’s Toad for Cloud makes it easy to query data stored using HBase and Hive.
- Quest’s OraOop is an Oracle database driver for Sqoop – you can think of Sqoop as Hadoop’s equivalent of SQL Server’s
bcpprogram - Karmasphere Studio is a tool for writing, debugging, and watching MapReduce jobs.
What Is The State of Write Consistency?
As I mentioned earlier, there is no support for ACID transactions. On the most basic level, Hadoop processes data in bulk from files stored in HDFS. When writes fail, they’re retried until the minimum guaranteed number of replicas is written. This is a built-in part of HDFS, you get this consistency for free just by using Hadoop.
As far as eventual consistency is concerned, Hadoop uses an interesting method to ensure that data is quickly and effectively written to disk. Basically, HDFS finds a place on disk to write the data. Once that write has completed, the data is forwarded to another node in the cluster. If that node fails to write the data, a new node is picked and the data is written until the minimum number of replicas have had data written to them. There are additional routines in place that will attempt to spread the data throughout the data center.
If this seems like an overly simple explanation of how HDFS and Hadoop write data, you’re right. The specifics of write consistency can be found in the HDFS Architecture Guide (which is nowhere near as scary as it sounds).
Comments
Jeremiah – I’m a fan on the new outside-the-box non-SQL Server material. Keep it up! I know you’ll be busy the week of the summit, but hopefully I will have a chance to say hey.
Glad to hear that you’re enjoying the new stuff! I should be easy to find at the Summit, look me up if you aren’t too busy. We can grab lunch or dinner or something.
Likewise, I found this post interesting. It’s good to expand your horizons and I think a lot of ‘SQL people’ have irrational fears about non-SQL solutions that a little more knowledge and awareness could clear up.
Thank you for a very informative article. You mention that RDBMS’s are best suited for “transactional support, complex rule validation and data integrity”, while Hadoop is much better for data processing (to lump all your specifics into one term).
Our application utilizes a high-volume, high-transaction RDBMS; we depend on the capabilities already listed from an industrial-strength RDBMS. However, as the amount of data grows, we are looking for tools to help us continue to provide fast analytics and reporting. Hadoop sounds like it could be a solution for us. However, like I said, we are not going to move the back-end to Hadoop. Is there any method by which we could regularaly populate a Hadoop system from our RDBMS on a regular basis?
I did some searching and found some products and solutions for a one-time move from an RDBMS to Hadoop, assuming Hadoop would now be the system of record, but nothing that would allow you to use Hadoop as a regularly-scheduled updating data warehouse companion to the RDBMS.
Any insight into this scenario?
Thanks.
One solution would be to create ETL jobs to pull recently modified data from your RDBMS and push that data into whichever Hadoop datastore you end up using. There is a product called Sqoop that will move data in between an RDBMS and Hadoop. Sqoop can be used as a continuous ETL tool. There are also other vendors who have ETL tools (Pentaho have ETL tools that work with Hadoop). Once you have a solid ETL process, you should be able to push data into Hadoop whenever you want.
The problem is really a problem of designing the data motion process well. I hope that helps. If it doesn’t, hit me up again and we can talk more.
I think it’s a good idea to contain large, non-strucured (or at least non-relational), datasets outside of SQL Server’s MDF/LDF file system. This also goes for vertical key value datasets that reach 100s/millions or billions of records in size. There reaches a point where attempting to optimize non-relational data within a RDMS becomes a lost cause. In lieu of Microsoft developing their own key-value database engine (similar in concept to their existing Text Search, FileStream, or Analysis Services engines), I think leveraging a 3rd party platform like Cassandra makes practical sense. It helps that Microsoft is now developing and releasing their own Hadoop distributions.
Dotcloud is a ‘cloud solution provider’ that offers Hadoop now (among other things). https://www.dotcloud.com/
(I don’t work for dotcloud, just discovered and started working with it recently).
I got interested in Hadoop after I heard in keynote at SQL PASS Summit that Microsoft is now involved. Your article has made things clear for me and has given the beginning from where I know I have to dive deeper to gain more knowledge. Thanks for putting this article together and hope to get more insight like this from you.
I’ve been a mainframe programmer with close to 10 yrs of experience have very good knowledge in designing and maintaining legacy applications. l and have very little knowledge of Java, C or any other OO Programming language but have done decent amount of shell scripting. I have some experience of doing data migration from mainframe datasets to open systems. Offlate I’ve been thinking of pursuing a career in Big data technologies. It would be very great if someone can advice me on how to get started and move along this track.
I recommend hitting up local user groups and finding a good community to provide support and advice. You can find a lot of good info on meetup.com
Trackbacks
5 pings so far. Trackbacks are closed.