Stack Overflow ebooks
These files are created from the monthly Stack Overflow Creative Commons data dump. I've got a combination of Python, Java, and XSLT scripts that process the raw XML database dumps into something usable. Then the Amazon kindlegen program creates the ebook file (in Mobipocket format).
Last year I got a new computer with a fast CPU and lots of memory and disk space. While working on the XML processor, I realised that I was doing a lot of work seeking around a big XML file (it's over 4 GB) and collecting questions and answers together. This was taking quite some time because of the sheer size of the files. Since I am running a 64-bit OS (FreeBSD 8 amd64), I memory mapped the entire 4 GB XML file into memory and then didn't have to think about seeking anymore. Letting the OS manage the caching is a much better approach, and the improved performance really shows.
The preprocessing step (that needs to run once per data dump) creates all the HTML files for each question and its set of answers. I was originally storing all the files in one directory, but a million files in a single directory wasn't working very well. I ended up splitting the question number into groups of three digits, so 1234567.html is actually stored in 123/456/7.html. This step takes about two hours to run.
Creating each ebook file is then a single XSLT transformation (taking about a minute), plus the kindlegen step which can take several minutes depending on the number of questions. The performance of kindlegen isn't very impressive and appears to be O(n2) in the number of pages.
The source for all this is available on Github.