Greg Hewgill (ghewgill) wrote,
Greg Hewgill

the abomination of mediawiki templates

The dict command line program is something I use pretty much daily. I always have a command prompt window open, and frequently (perhaps compulsively) look up words with which I'm unfamiliar. dict uses the standard Dictionary Server Protocol to connect to a dictionary server (by default, to retrieve definitions. The server has quite a number of different dictionaries available, of varying utility.

Enter Wiktionary. Wiktionary is a sister project to the more well-known Wikipedia. Like Wikipedia, anybody can contribute and edit definitions to any word. The incredible thing about Wiktionary is that it is fully multilingual—it aims to provide definitions for every word in every language in every other language. So, you can look up definitions in English for thank you, danke, спасибо, or even ありがとう. No matter what your preferred language, ideally you will be able to use Wiktionary to look up anything found in print anywhere and find a definition in your chosen language. This is an enormously aggressive goal and I look forward to seeing it grow.

I want to use dict to look up words in Wiktionary. As nice as a web browser is, often I just want to use a simple command line program to do a quick lookup without all the extra fluff. So, the first thing I did was implement a DICT protocol server in Python. Next, I downloaded the entire English wiktionary from Wikimedia Downloads, which gives me the raw wiki markup for each entry in one big XML file. Then I wrote a quick program to extract the entries from the XML and do some simple formatting of the wiktionary pages. This is where things started to get complicated.

The MediaWiki software offers simple templates, which allow page authors to include common text and markup into articles. Wikipedia makes limited use of templates, but as I've discovered, Wiktionary uses them extensively and in decidedly nontrivial ways. For example, have a look at the template documentation for indicating English noun plural forms. It seems reasonably easy to use, but behind the scenes the source for the en-noun template is a nearly impenetrable forest of curly braces, wiki markup, HTML, and XML-like tags and comments. My program attempts to parse this.

The MediaWiki template language is an example of a domain-specific language. However, as languages go, it is not terribly well specified or documented. The Wikimedia Help:Template page seems to be the best documentation I can find, and it's chock full of contrived examples and pathological cases without even bothering to present a clear grammar. The lack of a grammar, and the Byzantine expansion rules, makes this template language somewhat challenging to parse. This is a prime example of Greenspun's Tenth Rule that states "Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp."

I'm not sure where to go from here. I've implemented what one would consider a normal recursive descent parser for the MediaWiki template language, but my implementation doesn't quite match up with the given examples in the weird corner cases. It seems to me that the only way to make a parser that works in the same way is to follow the vague instructions in the Template mechanism documentation. This means implementing an ad hoc parser, and it might even be necessary to borrow the implementation from MediaWiki itself. I really hadn't anticipated taking this that far.

An appealing option is to try to avoid evaluating templates at all, and offer only the defintions of words without all the extra etymology and inflection information. I had already planned to offer several different views of the database: raw wiki markup; full formatted page; normal view without translations; and a brief view with just definitions. Perhaps I want to start from the brief view and work my way up.

If you've read this far and want to try what I've got in its current state, try: dict -h word. The available databases (-d option) are: en-raw, en-full, en, en-brief. Not all databases may have complete (or any) info at any given time.
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded