I was chatting with Phil the other day and he was working on some statistical analysis tools for the Twitter API. I was reminded of a little project that had popped into my head a while ago: How hard would it be to identify the language in which a twitter status update is written? With only 140 characters, some of which are going to be a URL or something, there won't be much info there. Is it possible?
The Twitter API provides a method to get the most recent 20 updates from the general public on twitter. You can see this public timeline in your browser as well as getting it via XML or JSON or whatever. I took one look at that page and was immediately struck by:
- the wide range of languages represented (I imagine this varies by time of day)
- the high frequency of spelling errors, both active (because of the 140 character limit) and passive (because people can't or won't spell properly)
- the wide range of "words" used that simply aren't in any dictionary (mmmmm, XD, arrr, dat) (well, "arrr" is permissible because it's International Talk Like A Pirate Day today, of course)
- the low information content of most of the crap people post on twitter
- the (attempted) spam (who would actually read or click on your gratuitous message about affiliate marketing?)
I was disheartened by what I saw, ready to give up on the project. Here's a typical gem:
ne1 der.....2 b frndz wid me..
But you know what? Regardless of whether any of this could be considered correct, literate, crap, spam, or what have you, it's what people are actually writing. It's (mostly) human communication, even if it has a low information content. Doesn't it deserve analysis anyway? Real world problems are almost never the easy ones.