How to generate a static version of a website using nanoc and nokogiri
More than a year has passed since we organized EuRuKo 2009 on Barcelona, so we thought that it had come the time to shut down the Ruby on Rails web applications we used for its website (a custom registration app and a simplelog).
We wanted to maintain an archive for them, but it had to be so simple that it required zero maintenance. So we thought, hey, let's just make a static HTML version of the whole site.
Static, but not too static
After considering several techniques like activating full page caching on the rails applications and grab the generated files, or use wget on recursive mode to spider the site, none of these solutions satisfied me. Both of these solutions would do more or less what we wanted, but we would end up with hundreds of static html files that would be a nightmare to postprocess and adapt for the final static version.
So after thinking about it for a while, I reached out for my ruby toolbox, took a few of them and built a custom quick-and-dirty script standing on the shoulders of giants like nanoc and nokogiri.
What is nanoc?
"nanoc is a tool that runs on your local computer and compiles documents written in formats such as Markdown, Textile, Haml… into a static web site consisting of simple HTML files, ready for uploading to any web server".
Yeah, that's what it is. You have a folder with the content of each section, and a folder with the layouts you want to apply, using erb, just like you would do on a Ruby on Rails site. nanoc runs through all the content files and applies the layout to them, generating an output folder with all your HTML ready to upload to the server.
You're also able to define custom rules and use helpers and filters for processing the content, it's a really simple and awesome tool, perfect for this task.
So, I chose nanoc. Adapting the layout from our rails apps was a piece of cake. I had almost the same layout with its <%= yield %> line into place, and I also copied the assets (css, javascript and images). Now I just needed to define the contents of each section. But we had hundreds of them!
We need a spider robot here
Going manually through all of the sections on the original site and copypasting its contents was certainly not one of my options. I'd rather sit on a slowly rotating swordfish.
This was a job for one of those efficient and cold-blooded spider robots. That's the plan: have a spider robot visit all the sections of the original site, and get its title and the contents of the DIV with id="content". Then tell nanoc to create an item for it. Put the grabbed content on the nanoc item, along with its title attribute. Then, once done for all the site, we would just run "nanoc compile" and the whole static site would be generated. Easy, isn't it?
Sitemaps for lazy spiders
Poor little spider robot looked at me and said "hey, it's summer here, I'm lazy, I could certainly find my way through your website but... know what? We spider robots are quite tired of doing all this over and over again! I'd too rather sit on a slowly rotating swordfish than spidering all of your links and then finding that some of them had to be excluded. Why don't you give me a little sitemap and I'll know exactly what you want me to visit?"
So I said "hey, excuse me, you're right, I wouldn't want to waste your precious robot time, let's build this sitemap you want". And I emailed Fernando Guillén, who had built the original app, and I said to him, "hey, can you build a sitemap of the site for this little grumpy spider?". And in a couple of minutes he had it ready.
Enter nokogiri
So, if you're still reading this, at this point of the story the plan is visiting a sitemap XML file with all the URLs we want to grab, and for each one of them, visit it and scrape its title and the contents of a given DIV. What could we use for this? Right! Nokogiri!
"Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors."
Nokogiri can process XML files, as well as HTML files. It can also process a file on a server, or a local file. Perfect for the job.
The script
So I wrote this quick-and-dirty script. ("hey, stop calling me quick and dirty!" -- said the spider. But that's what it is, a quick, dirty and hairy spider robot.)
To create a new spider you need to pass it these three parameters:
- sitemap, which can be a local XML file or a remote XML file on a server
- root_url, which is a string indicating the original root URL
- extract_id, which is a string telling the id of the DOM element from where the contents should be grabbed
Then you just call its create_nanoc_items method, and the spider will go berserker, process the XML and create all the sections inside your nanoc site.
Helpers and filters
nanoc lets you use helpers, just like you're used to in a Rails application. I used two from nanoc's helpers, one to let me include partials, other to be able to use link_to, and defined a pair of them myself.
You can also use filters to process your content. I defined a custom filter to clean the grabbed HTML a bit, hiding an annoying div and rewriting some links.
Rack attack!
So I finally run the spider, it fetched the site and prepared all those nanoc items for me. Thanks! Then I just run "nanoc compile" to generate the final site, and the generated static site was there waiting peacefully on its output folder. Time to upload it to the new server!
While it would be too easy to just upload it to a normal web server, we thought it would be hackier to put it on Heroku. We love it and it's as simple as it can get. Yes, you can have static sites on Heroku too, as long as you define some basic rack configuration for its Thin web server.
I then remembered that Raúl Murciano had given a great talk (with a pair of videos available) about Rack on Conferencia Rails 2009, so I emailed him for help, and he soon came out with this config.ru file. I throwed it in, along with the gem manifesto file that Heroku asks, pushed it to the server and... boom! there it was, alive and kicking, but completely static, the new EuRuKo 2009 archive.