Friday, February 15, 2008

Fun with aspell word lists

The GPL spell checking program aspell has support for many languages, including my native language, Dutch. Since it's open source, so are the dictionaries, which means that we should be able to extract a word list. And complete word lists for a language are simply fun to play with.

Unfortunately aspell is too advanced to use a plain text word list. But there is a way to dump it:

aspell dump master

This will print the entire word list for your default language. You can specify the language used with -l:

aspell -l nl dump master

The argument to -l is the ISO 639 language code (see man aspell for details). The argument master tells aspell to use the systemwide dictionary, not your personal wordlist. The dictionary must be installed on your system; on Ubuntu the Dutch language package is called aspell-nl.

When we run aspell dump master for Dutch we get something unexpected:

blaat/MWPG
bloeit/KU
bloot/G
blootte
blote/N

There are strange tags attached to the end of many words. These are affixes and they represent variations of that word. (Although there is an English affix file, no affixes tags are printed if we dump an English dictionary.) We can expand the affix tags into all possible variations by sending them through aspell expand:

aspell -l nl dump master | aspell -l nl expand

That's better:

blaat geblaat blaatten blaatten blaatte blaten
bloeit opbloeit uitbloeit
bloot gebloot
blootte
blote bloten

If we now pipe this through tr we get all variations on separate lines as well. Thus the final command to get a word list for any aspell-supported language becomes:

aspell -l nl dump master | aspell -l nl expand | tr ' ' '\n'

(Note that this breaks for words that originally contained spaces. The Dutch word list does not have these, though.)

Then I got interested in how these affixes work. Take blaat (to bleat) for example; it is followed by M, W, P and G. Looking in the affix definition file /usr/lib/aspell/nl_affix.dat there are some lines that define the meaning of these characters:

SFX M N 13
SFX M 0 ben b
SFX M 0 den d
...
SFX M 0 ten t
SFX M z zen z

SFX W N 7
SFX W 0 t [^t]
SFX W 0 te [kfstp]
SFX W 0 ten [kfstp]
SFX W 0 te ch
SFX W 0 ten ch
SFX W 0 de [^kfstp]
SFX W 0 den [^kfstp]

SFX P N 34
SFX P ad den aad
SFX P af fen aaf
...
SFX P at ten aat
...

PFX G Y 1
PFX G 0 ge .

So what does this all mean? SFX and PFX stand for suffix (ending) and prefix (beginning). The first line of each block gives the number of lines in the rest of the block; the Y or N before the number indicate whether the suffix may be combined with prefixes or vice versa.

The line SFX M 0 ten t simply creates the plural past form blaatten, by appending -ten if the original word ends in t. We also see suffixes for other cases in which consonant doubling is required.

The next suffix SFX W is more interesting. This one takes care of various conjugations, including the past tense and the past participle (‘voltooid deelwoord’). We see that -te is appended when the word ends in k, f, s, t or p, and -de otherwise. Any Dutch person will immediately recognize this as the dreaded ‘kofschipregel’ that is the cause of so many spelling errors. In this case it gives rise to the word forms blaatte and blaatten.

The suffix SFX P at ten aat takes care of the infinitive. Note that the double a has been replaced by a single one; the pronunciation remains identical. (Dutch works in mysterious ways…) This replacement is done by the third field on the line, that indicates the text to strip off the end of the word; so far, it has been 0, which means to strip off nothing. (The SFX M z is the exception; as far as I can tell, it is not used anywhere in the aspell dictionary.) We take blaat, which matches -aat, so we strip off -at and stick -ten in its place, resulting in blaten.

Finally, we have a prefix rule PFX G, which sticks ge- before anything, leading to the form geblaat (bleating, as in “the bleating of the sheep”).

The file nl_affix.dat also contains a list of general replace rules (for example, replacing g by ch and vice versa) and specific ones (kado by cadeau). These rules are used when suggesting possible corrections for a misspelled word.

So where was I? Oh yeah, building a word list. To play with. For fun.

10 comments:

Anonymous said...

Thank you for the post. It enabled me to use aspell dictonaries with java spell checker jortho and describe it here:

https://sourceforge.net/forum/message.php?msg_id=6828072

Regards,
Dimitry Polivaev

Thomas ten Cate said...

Cool! Nice to see that this post is actually useful to somebody :)

everthonVS said...

Cool! I used it to generate a word list for a cellphone app called TextTwist, with this script:

aspell --clean-affixes --clean-words -l pt_BR dump master | cut -d / -f
1 | grep -v '[A-Z]\+' | awk '{if(length>=2 && length<=5)print $0}' > words.txt

(TextTwist for Motorola A1200: http://www.motorolafans.com/joom/index.php/for/showthread.php/0,t=22118/t,22118/)

Thomas ten Cate said...

Your link does not work for me. But when using this in publicly released software, be advised that the word list may not be public domain. Please remember to check the license on that.

James D said...

Thanks. This is really useful. I'm now some way toward generating a list of two-letter words for playing Scrabble in Welsh. Now to work out how to get the computer to count the seven digraphs (ch, dd, ff, ng, ll, rh, th) as single letters...

Eroen said...

Thanks, you enabled me to troll my roommate in online scrabbe through the use of (poorly written) python scripts.

nalin4linux77 said...

Thanks Dear friend god bless you

Avenger said...

To get words of desired min-max length you can do something simpler than using awk:

aspell -l dump master | cut -f1 -d/ | egrep "^.{4,8}\$"

Actually I wanted all 3-char words free of any accent in the pt_BR language, so I dod:

aspell -l pt_BR dump master | cut -f1 -d/ | egrep "^[a-z]{3}\$"

and voila. :)

Roger said...

Thanks Thomas! This was really helpful for me (I needed a way to load an english dictionary onto my android phone that I got in China, and did it between aspell, a text file, and an app to add it to the user dictionary)

Before, I kept on seeing the /q etc suffixes on everything and didn't know how to convert them into a workable file.

survivant said...

Hello. I saw your blog when I was looking informations about exporting the words list from hunspell/aspell and ispell. What I was trying to do, was to convert the file : american-huge (ispell) to aspell, to be able to dump it from aspell by the command line. but I wasn't able to find the info.

Do you know where I can get the file for aspell or better, how can I do it myself ?