Greg Hewgill (ghewgill) wrote,
Greg Hewgill
ghewgill

stack overflow questions in ebook format

In my quest to create useful reference ebooks for the Kindle, I've created ebooks of the top rated programming questions on Stack Overflow for the top 20 tags.

Stack Overflow ebooks

These files are created from the monthly Stack Overflow Creative Commons data dump. I've got a combination of Python, Java, and XSLT scripts that process the raw XML database dumps into something usable. Then the Amazon kindlegen program creates the ebook file (in Mobipocket format).

Last year I got a new computer with a fast CPU and lots of memory and disk space. While working on the XML processor, I realised that I was doing a lot of work seeking around a big XML file (it's over 4 GB) and collecting questions and answers together. This was taking quite some time because of the sheer size of the files. Since I am running a 64-bit OS (FreeBSD 8 amd64), I memory mapped the entire 4 GB XML file into memory and then didn't have to think about seeking anymore. Letting the OS manage the caching is a much better approach, and the improved performance really shows.

The preprocessing step (that needs to run once per data dump) creates all the HTML files for each question and its set of answers. I was originally storing all the files in one directory, but a million files in a single directory wasn't working very well. I ended up splitting the question number into groups of three digits, so 1234567.html is actually stored in 123/456/7.html. This step takes about two hours to run.

Creating each ebook file is then a single XSLT transformation (taking about a minute), plus the kindlegen step which can take several minutes depending on the number of questions. The performance of kindlegen isn't very impressive and appears to be O(n2) in the number of pages.

The source for all this is available on Github.
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 2 comments