Inspired by Git's content-indexed repository storage format, I've designed a new backup program that solves a few problems that have been bugging me. First, let me rewind a bit.
I still vividly remember the first time I accidentally overwrote an important source file. I was about 14 years old, writing an Apple ][ word processor for visually impaired people in 6502 assembly language. The program was too big to edit as a single source file, so I split it into two halves, and one day overwrote my only copy of the second half with the first half! I managed to mostly recover it by using a disassembler, but I learned an important lesson that day. It didn't properly sink in though, because I learned that very same lesson no fewer than three more times after that (maybe I'll tell those stories some day too).
I've tried many different backup solutions over the years. I've owned at least three different types of tape drives (for which I may still have tapes, but not the drives), stacks of burned CDs, several external hard drives, etc. More recently, I've tried rsync, rdiff-backup, duplicity, and my own implementation of a program that mirrors files to Amazon S3.
Each of these solutions, while sophisticated and efficient in their own ways, has one important disadvantage for me: The same file that happens to reside on more than one computer gets stored more than once in the backups. If I move stuff around, between computers or even between directories, this causes a fair bit of noise in the backup volume. The set of bits is the same, why store them more than once?
My current idea is to store file data based on the SHA-1 hash of the file contents. There is an area called the "blob storage" whose only purpose is to store piles of bits named after their hash value. For example, the file named 0dc6832636f48e24a1423d02175e6023d76810bd might contain a valuable source file. The other storage area contains associations between filenames and blobs, for example:
With this arrangement, if I moved the hello.* files from src/hello.* to src/hello/hello.*, the blobs would not need to change, only the references. Obviously this is of greater benefit for larger files. And, if I have the same files on a different computer, and I backed up that other computer to the same storage area, the blobs would still remain the same and the other computer's references file would point to them too. A couple of auxiliary commands are needed to (a) check the integrity of the whole storage area, to ensure that all referenced blobs exist, and (b) to clean out blobs that are no longer referenced.
I'm sure this isn't a new idea, for example I know that corporate email systems will generally store only one copy of an attachment in their data store, no matter how many recipients there are. When the last recipient deletes the message, only then is the attachment data deleted. This is just that same idea applied to backups.
There is room for improvement, too. Currently I compress all the data before it's put in the blob storage, and I'm going to add an encryption option. A significant improvement might come from the ability to create a new type of blob, which doesn't contain the whole file but only the differences from some other blob. Not sure whether I've got all the details sorted for that yet though. Come to think of it, it might be possible to use the librsync algorithm in combination with an Amazon EC2 instance to efficiently compute the differences.
I have started to implement this idea in a program called shaback, which could be either represent "SHAred BACKup", or "SHA-1 BACKup". It is implemented in Python, depends on my s3lib library, and uses Amazon S3 as the storage backend. I've published what I've got so far on github - please clone and hack on it if you feel so inclined.