The GPL spell checking program aspell has support for many languages, including my native language, Dutch. Since it's open source, so are the dictionaries, which means that we should be able to extract a word list. And complete word lists for a language are simply fun to play with.
Unfortunately aspell is too advanced to use a plain text word list. But there is a way to dump it:
aspell dump master
This will print the entire word list for your default language. You can specify the language used with -l
:
aspell -l nl dump master
The argument to -l
is the ISO 639 language code (see man aspell
for details). The argument master
tells aspell to use the systemwide dictionary, not your personal wordlist. The dictionary must be installed on your system; on Ubuntu the Dutch language package is called aspell-nl
.
When we run aspell dump master
for Dutch we get something unexpected:
blaat/MWPG bloeit/KU bloot/G blootte blote/N
There are strange tags attached to the end of many words. These are affixes and they represent variations of that word. (Although there is an English affix file, no affixes tags are printed if we dump an English dictionary.) We can expand the affix tags into all possible variations by sending them through aspell expand
:
aspell -l nl dump master | aspell -l nl expand
That's better:
blaat geblaat blaatten blaatten blaatte blaten bloeit opbloeit uitbloeit bloot gebloot blootte blote bloten
If we now pipe this through tr
we get all variations on separate lines as well. Thus the final command to get a word list for any aspell-supported language becomes:
aspell -l nl dump master | aspell -l nl expand | tr ' ' '\n'
(Note that this breaks for words that originally contained spaces. The Dutch word list does not have these, though.)
Then I got interested in how these affixes work. Take blaat
(to bleat) for example; it is followed by M
, W
, P
and G
. Looking in the affix definition file /usr/lib/aspell/nl_affix.dat
there are some lines that define the meaning of these characters:
SFX M N 13 SFX M 0 ben b SFX M 0 den d ... SFX M 0 ten t SFX M z zen z SFX W N 7 SFX W 0 t [^t] SFX W 0 te [kfstp] SFX W 0 ten [kfstp] SFX W 0 te ch SFX W 0 ten ch SFX W 0 de [^kfstp] SFX W 0 den [^kfstp] SFX P N 34 SFX P ad den aad SFX P af fen aaf ... SFX P at ten aat ... PFX G Y 1 PFX G 0 ge .
So what does this all mean? SFX
and PFX
stand for suffix (ending) and prefix (beginning). The first line of each block gives the number of lines in the rest of the block; the Y
or N
before the number indicate whether the suffix may be combined with prefixes or vice versa.
The line SFX M 0 ten t
simply creates the plural past form blaatten
, by appending -ten
if the original word ends in t
. We also see suffixes for other cases in which consonant doubling is required.
The next suffix SFX W
is more interesting. This one takes care of various conjugations, including the past tense and the past participle (‘voltooid deelwoord’). We see that -te
is appended when the word ends in k
, f
, s
, t
or p
, and -de
otherwise. Any Dutch person will immediately recognize this as the dreaded ‘kofschipregel’ that is the cause of so many spelling errors. In this case it gives rise to the word forms blaatte
and blaatten
.
The suffix SFX P at ten aat
takes care of the infinitive. Note that the double a
has been replaced by a single one; the pronunciation remains identical. (Dutch works in mysterious ways…) This replacement is done by the third field on the line, that indicates the text to strip off the end of the word; so far, it has been 0
, which means to strip off nothing. (The SFX M z
is the exception; as far as I can tell, it is not used anywhere in the aspell dictionary.) We take blaat
, which matches -aat
, so we strip off -at
and stick -ten
in its place, resulting in blaten
.
Finally, we have a prefix rule PFX G
, which sticks ge-
before anything, leading to the form geblaat
(bleating, as in “the bleating of the sheep”).
The file nl_affix.dat
also contains a list of general replace rules (for example, replacing g
by ch
and vice versa) and specific ones (kado
by cadeau
). These rules are used when suggesting possible corrections for a misspelled word.
So where was I? Oh yeah, building a word list. To play with. For fun.
12 comments:
Thank you for the post. It enabled me to use aspell dictonaries with java spell checker jortho and describe it here:
https://sourceforge.net/forum/message.php?msg_id=6828072
Regards,
Dimitry Polivaev
Cool! Nice to see that this post is actually useful to somebody :)
Cool! I used it to generate a word list for a cellphone app called TextTwist, with this script:
aspell --clean-affixes --clean-words -l pt_BR dump master | cut -d / -f
1 | grep -v '[A-Z]\+' | awk '{if(length>=2 && length<=5)print $0}' > words.txt
(TextTwist for Motorola A1200: http://www.motorolafans.com/joom/index.php/for/showthread.php/0,t=22118/t,22118/)
Your link does not work for me. But when using this in publicly released software, be advised that the word list may not be public domain. Please remember to check the license on that.
Thanks. This is really useful. I'm now some way toward generating a list of two-letter words for playing Scrabble in Welsh. Now to work out how to get the computer to count the seven digraphs (ch, dd, ff, ng, ll, rh, th) as single letters...
Thanks, you enabled me to troll my roommate in online scrabbe through the use of (poorly written) python scripts.
Thanks Dear friend god bless you
To get words of desired min-max length you can do something simpler than using awk:
aspell -l dump master | cut -f1 -d/ | egrep "^.{4,8}\$"
Actually I wanted all 3-char words free of any accent in the pt_BR language, so I dod:
aspell -l pt_BR dump master | cut -f1 -d/ | egrep "^[a-z]{3}\$"
and voila. :)
Thanks Thomas! This was really helpful for me (I needed a way to load an english dictionary onto my android phone that I got in China, and did it between aspell, a text file, and an app to add it to the user dictionary)
Before, I kept on seeing the /q etc suffixes on everything and didn't know how to convert them into a workable file.
Hello. I saw your blog when I was looking informations about exporting the words list from hunspell/aspell and ispell. What I was trying to do, was to convert the file : american-huge (ispell) to aspell, to be able to dump it from aspell by the command line. but I wasn't able to find the info.
Do you know where I can get the file for aspell or better, how can I do it myself ?
Danke schoen! Ich lerne Sprachen mit deiner Idee...
Schoenen Tag noch
The ortho dictionary must be zlib compressed
# aspell -l pt_BR dump master | aspell -l pt_BR expand | tr ' ' '\n' | pigz -9zc > /usr/share/freeplane/resources/ortho/dictionary_pt.ortho
Post a Comment