Greg Hewgill (ghewgill) wrote,
Greg Hewgill

the general public on twitter

I was chatting with Phil the other day and he was working on some statistical analysis tools for the Twitter API. I was reminded of a little project that had popped into my head a while ago: How hard would it be to identify the language in which a twitter status update is written? With only 140 characters, some of which are going to be a URL or something, there won't be much info there. Is it possible?

The Twitter API provides a method to get the most recent 20 updates from the general public on twitter. You can see this public timeline in your browser as well as getting it via XML or JSON or whatever. I took one look at that page and was immediately struck by:

  • the wide range of languages represented (I imagine this varies by time of day)
  • the high frequency of spelling errors, both active (because of the 140 character limit) and passive (because people can't or won't spell properly)
  • the wide range of "words" used that simply aren't in any dictionary (mmmmm, XD, arrr, dat) (well, "arrr" is permissible because it's International Talk Like A Pirate Day today, of course)
  • the low information content of most of the crap people post on twitter
  • the (attempted) spam (who would actually read or click on your gratuitous message about affiliate marketing?)

I was disheartened by what I saw, ready to give up on the project. Here's a typical gem:

ne1 der.....2 b frndz wid me..

But you know what? Regardless of whether any of this could be considered correct, literate, crap, spam, or what have you, it's what people are actually writing. It's (mostly) human communication, even if it has a low information content. Doesn't it deserve analysis anyway? Real world problems are almost never the easy ones.


  • 2013 in review

    2013 is the year when everything changed. The biggest event was the birth of our daughter Lily. She was born prematurely in Shanghai while we…

  • 2012 in review

    2012 has been fairly quiet. Maybe it just seems that way because I haven't actually written anything new in this blog since last year's annual…

  • new photo galleries

    I've been busy processing photo galleries from the last year (or two) and putting them online for your perusal. Vancouver 2010 Northland…

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded