Recently I've been playing with genealogical data. I came across a sheet of paper that my mom gave me that lists my direct ancestors back to a maximum of five or so generations. I downloaded GenealogyJ and entered all 50 or so names.
GenealogyJ saves files in GEDCOM format, which is a common file format for exchange of family history data. It's a structured text file format, which made it relatively easy to throw together a program to convert it to XML (which is much easier to work with).
Once I had the data in XML format, I wrote some XSLT to generate a pile of web pages that describe each individual. For an example, see the page for Thomas Hewgill (1852-1955). Each page has links to parents, siblings, spouse(s), and children, as well as any biographical information on the individual.
My grandfather has put a ton of work into collecting family history data for his side of the family. Initially, this data was collected on individual written sheets in a big binder. At some point, one of my relatives transcribed it into an Excel spreadsheet, then finally imported it into Brother's Keeper. Brother's Keeper can export into GEDCOM format, and my dad happened to have a copy of that data sitting around. It appears to have been last updated in 1997 so he's going to see if he can get a more recent copy.
Anyway, the new file I have has approximately the following stats:
- People: 2229
- Surnames: 572
- Alive: 1343 people
- Dead: 428 people
- Unknown or probably dead: 458
The oldest record in the file is a Daniel Hewgill (1751-1824). This file is only my dad's side of the family, so I will have to merge in the data I have for my mom's side. Also, the GEDCOM file didn't appear to have any additional information on individuals aside from birth year, marriage year, death year, and family relationships. The data file in Brother's Keeper format might have more detailed information, I'll have to investigate that and figure out how to extract it if so.
My web presentation for this is currently very basic. I expect to improve this over time. One idea I have is to create a big virtual family tree image, and then write a viewer that works sort of like Google Maps where you can scroll around and zoom in and out.
I also wrote a program that can identify the name of the specific family relationship between two people anywhere in the tree. I used the algorithm described here which was pretty easy to implement. With that, I can quickly find out that some person X is my fourth cousin once removed or whatever.
At some point I would like to make the data I've got available on the web, at least for those people who are no longer living. Unfortunately, excluding data for living people will make the rest of the data extremely fragmented. If I do include living people, I'm not sure what the right level of detail is there. Even just names is a potential problem because "mother's maiden name" is used so often as an identity authenticator. Any thoughts?
Finally, if anybody would like to get a copy of the code I'm working on here, let me know. It's in Python and XSLT.