Saturday, February 9, 2008

I ate a skunk for lunch

Faithful xkcd readers will know about the sudden change in Google results caused by the Dangers comic. The number of hits for "died in a blogging accident" has since risen from 2 to over 32,000.

The results were acquired simply by entering the corresponding Google query. This has one disadvantage: you have to know in advance which type of accident you're looking for.

I wrote a little Perl script to overcome this. It Googles for "died in a" "accident" and parses the first 1000 results (unfortunately, Google refuses to give more than that). The number of occurrences of “died in a _ accident”, with exactly one word in place of the _, is counted, and the results are charted using the Google Chart API.

So now we can see which accidents are really most common:

It appears that blogging is far more dangerous than I always thought!

The number of words we want in the results can be modified (e.g. “give me all results consisting of one, two or three words”), and they do not have to be sandwiched between two known phrases, but can also come before or after a certain phrase, as in “_ is an idiot” (which obviously requires that we specify a fixed number of words to take). There is also an ignore list to get rid of meaningless matches like “he” and “bush”.

The script can also tell us what people eat:

Fair enough. But:

A rat? A skunk? A hippopotamus?!

Unfortunately the script does not work as well as I hoped. The results often do contain both parts of the query, but on different parts of the page. These hits get in the way of the useful results, and because we're limited to 1000 results we cannot dig any deeper to find them.

The script is full of known and unknown bugs, not in the least because it's my first nontrivial Perl script, but I put it online for your enjoyment anyway. It is called accident.pl and requires Perl (obviously), WWW::Mechanize and URI. Run it without arguments to get a brief help text. Let me know what results you come up with!

5 comments:

m-m said...

Haha, you have way too much time on your hands ;)

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.
Mark IJbema said...

Two details: try using "died in a * accident", and use the google api, instead of parsing it.

Nice work :)

Thomas ten Cate said...

Blimey, the * actually works! :) It matches only one word, though, but that's okay.

I couldn't use the Google API because they have stopped giving out keys for their SOAP API, and the AJAX API is only for embedding results in your own webpage, as far as I could tell -- i.e. would require me to parse anyway.

But I haven't actually tried, so I may be mistaken.