For those of you who aren’t as much into reading up on different types of database, there’s an interesting summer reading list going on right now over at A NoSQL Summer. Unfortunately, I’m not lucky enough to live in a town with a NoSQL Summer group (not that I know of, at least) and I’ve had too much on my plate to start one up. But I still wanted to read all of the papers. What’s a poor guy to do?
Instead of navigating a bunch of web pages and downloading some PDFs, I decided to automate the process and write a tiny program to do it for me. I turned to my favorite rapid fire language, Ruby, and fired off a quick script to parse the web pages and get me the content that I was looking for.
#!/usr/bin/ruby
require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'net/http'
# path to the target directory, you'll probably want to change this...
# unless your account is named 'jeremiah'
base_folder = "/Users/jeremiah/Desktop/NoSQL"
# open up the list of papers
doc = open('http://nosqlsummer.org/papers') { |f| Hpricot(f) }
# find all of the links to each paper and loop through them
doc.search("//div[@class='o-papers on']/a").each do |link|
# ignore the closing tags.
# there's probably a better way to do this,
# but I wrote this in 15 minutes at 11:30 at night
next unless link.is_a? Hpricot::Elem
paper_doc = open("http://nosqlsummer.org/#{link.attributes['href']}") { |f| Hpricot(f) }
# get the necessary elements to build our document name for saving
difficulty = paper_doc.at("h4[@class*='difficulty']")['class'][-1,1]
title = (paper_doc/"div[@class='o-paper on']/h1").inner_text
download_link = paper_doc.at("a[@class='download']")['href']
begin
# try to save
puts "Attempting to download #{title} from #{download_link}..."
write_out = open("#{base_folder}/#{difficulty}_#{title}.pdf", "wb")
write_out.write(open(download_link).read)
write_out.close
rescue Exception
puts " *** v^v^v^ error ^v^v^v ***"
end
end
This script very neatly downloads everything to the directory of your choosing (change the directory name). It also thoughtfully names the files with their difficulty rating as the first character so you can sort them ASCII-betically and make a halfway decent list to help your learn your way into NoSQL nerdery.
There’s only one problem. One of the papers, the graph traversal paper, won’t download for some reason. The ACM server returns an HTTP access denied error code. To get around this you can either download it with your browser, or you can go ahead and use the copy that I’ve provided – The Graph Traversal Pattern.
Enjoy!
Comments
Thanks! Worked like a charm.
Glad to hear that it worked well for someone other than me. It was a lot of fun to write, glad you got some use out of it.
Thanks! Worked for me too
Trackbacks
One Trackback