NoSQL Summer Reading List

For those of you who aren’t as much into reading up on different types of database, there’s an interesting summer reading list going on right now over atA NoSQL Summer. Unfortunately, I’m not lucky enough to live in a town with a NoSQL Summer group (not that I know of, at least) and I’ve had too much on my plate to start one up. But I still wanted to read all of the papers. What’s a poor guy to do?

Instead of navigating a bunch of web pages and downloading some PDFs, I decided to automate the process and write a tiny program to do it for me. I turned to my favorite rapid fire language, Ruby, and fired off a quick script to parse the web pages and get me the content that I was looking for.


#!/usr/bin/ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'net/http'

# path to the target directory, you'll probably want to change this...
# unless your account is named 'jeremiah'
base_folder = "/Users/jeremiah/Desktop/NoSQL"

# open up the list of papers
doc = open('http://nosqlsummer.org/papers') { |f| Hpricot(f) }

# find all of the links to each paper and loop through them
doc.search("//div[@class='o-papers on']/a").each do |link|
  # ignore the closing tags. 
  # there's probably a better way to do this, 
  # but I wrote this in 15 minutes at 11:30 at night
  next unless link.is_a? Hpricot::Elem

  paper_doc = open("http://nosqlsummer.org/#{link.attributes['href']}") { |f| Hpricot(f) }
  
  # get the necessary elements to build our document name for saving
  difficulty = paper_doc.at("h4[@class*='difficulty']")['class'][-1,1]
  title = (paper_doc/"div[@class='o-paper on']/h1").inner_text 
  download_link = paper_doc.at("a[@class='download']")['href']
  
  begin
    # try to save
    puts "Attempting to download #{title} from #{download_link}..."
    write_out = open("#{base_folder}/#{difficulty}_#{title}.pdf", "wb")
    write_out.write(open(download_link).read)
    write_out.close
  rescue Exception
    puts "  *** v^v^v^ error ^v^v^v ***"
  end
end

This script very neatly downloads everything to the directory of your choosing (change the directory name). It also thoughtfully names the files with their difficulty rating as the first character so you can sort them ASCII-betically and make a halfway decent list to help your learn your way into NoSQL nerdery.

There’s only one problem. One of the papers, the graph traversal paper, won’t download for some reason. The ACM server returns an HTTP access denied error code. To get around this you can either download it with your browser, or you can go ahead and use the copy that I’ve provided – The Graph Traversal Pattern.

Enjoy!

Menu