This is a follow up to the last two posts I made about querying HBase and Hive. The set up for this was a bit trickier than I would have liked, so I’m documenting my entire process for three reasons.
To remind myself for the next time I have to do this. To help someone else get started. In the hope that someone will know a better way and help me improve this.
I recently talked about using the Toad for Cloud Databases Eclipse plug-in toquery an HBase database. After I finished up the video, I did some work loading a sample dataset from Retrosheet into my local Hive instance. This 7 minute tutorial shows you brand new functionality in the Toad for Cloud Databases Eclipse plug-in and how you can use it to perform data warehousing queries against Hive. https://www.youtube.com/watch?v=iWzvBWd9K-M
It’s my birthday today, so I’m going to subject you to more things that I’ve been reading recently. William Gibson says the future is right here, right now – I couldn’t agree with this more. Time gets faster because we’re constantly connected to the world. The Humble Pencil, The Mighty Computer – I love using pencil and paper to work through things. In fact, I don’t understand people who can draw, sketch, or brainstorm straight into a piece of software.
When I first started blogging, I was nervous because I might be wrong. Not just “wrong” like “Gee Ted, I don’t know if I agree with your opinion about using Irish children as forced labor,” but factually wrong. And if I was wrong on the internet, people would see it forever and ever. Well, Google would know about it forever and ever. A friend of mine advised me that I would certainly be wrong on the internet, just like I’ve been fantastically wrong in real life, and that the only thing you can do is get up, admit you were wrong, and keep going again.
Update: I want to thank Ben Black, Todd Lipcon, and Kelley Reynolds for pointing out the inaccuracies in the original post. I’ve gone through the references they provided and made corrections. Facebook, the data giant of the internet, recently unveiled a new messaging system. To be fair, I would normally ignore feature roll outs and marketing flimflam, but Facebook’s announcement is worthy of attention not only because of the underlying database but how the move was done.
Understanding checkpoint_completion_target Ever wonder what’s REALLY going on under the hood of my favorite database? I do. Hubert Depesz writing series of articles about PostgreSQL configuration parameters. This one goes into incredible detail about how PostgreSQL handles checkpointing and keeps the database performing at high speeds. Riak SmartMachine Benchmark: The Technical Details How much data can you push for not a lot of money? It turns out that the answer is “A lot of data.
We all know that you can use NoSQL databases to store data. And that’s cool, right? After all, NoSQL databases can be massively distributed, areredundant, and really, really fast. But some of the things that make NoSQL database really interesting aren’t just the redundancy, performance, or their ability to use all of those old servers in the closet. Under the covers, NoSQL databases are supported by complex code that makes these features possible – things like distributed file systems.
MongoDB has replication built in. So does SQL Server, Oracle, DB2, PostgreSQL, and MySQL. What’s the difference? What makes each MongoDB a unique and special snowflake? I recently read a three part series on MongoDB repication (Replication Internals, Getting to Know Your Oplog, Bending the Oplog to Your Will) in an effort to better understand MongoDB’s replication compared to SQL Server’s replication.
Logging Sidebar Before we get started, it’s important to distinguish between the oplog and MongoDB’s regular log.
A client recently asked me for help with their SQL Server environment. It seems that replication was running slowly and was getting further and further behind – replication had been turned off during heavy data modification and was turned on after several days.
Protip: This is why it’s important to have a full checklist for everything that you do on a server.
Check Everyone’s Health When you have a complicated system you want to take a look at everything, not just the symptoms of the problem.
While perusing twitter, I saw that Google has open sourced Sawzall, one of their internal tools for data processing. WTF does this mean?
Sawzall, WTF? [caption id="" align=“alignright” width=“250”] For Data Analytics or automotive modification, you will find no finer tool.[/caption] Apart from a tool that I once used to cut the muffler off of my car (true story), what is Sawzall? Sawzall is a procedural language for analyzing excessively large data sets.