The GPL spell checking program aspell has support for many languages, including my native language, Dutch. Since it's open source, so are the dictionaries, which means that we should be able to extract a word list. And complete word lists for a language are simply fun to play with.
Unfortunately aspell is too advanced to use a plain text word list. But there is a way to dump it:
aspell dump master
This will print the entire word list for your default language. You can specify the language used with
aspell -l nl dump master
The argument to
-l is the ISO 639 language code (see
man aspell for details). The argument
master tells aspell to use the systemwide dictionary, not your personal wordlist. The dictionary must be installed on your system; on Ubuntu the Dutch language package is called
When we run
aspell dump master for Dutch we get something unexpected:
blaat/MWPG bloeit/KU bloot/G blootte blote/N
There are strange tags attached to the end of many words. These are affixes and they represent variations of that word. (Although there is an English affix file, no affixes tags are printed if we dump an English dictionary.) We can expand the affix tags into all possible variations by sending them through
aspell -l nl dump master | aspell -l nl expand
blaat geblaat blaatten blaatten blaatte blaten bloeit opbloeit uitbloeit bloot gebloot blootte blote bloten
If we now pipe this through
tr we get all variations on separate lines as well. Thus the final command to get a word list for any aspell-supported language becomes:
aspell -l nl dump master | aspell -l nl expand | tr ' ' '\n'
(Note that this breaks for words that originally contained spaces. The Dutch word list does not have these, though.)
Then I got interested in how these affixes work. Take
blaat (to bleat) for example; it is followed by
G. Looking in the affix definition file
/usr/lib/aspell/nl_affix.dat there are some lines that define the meaning of these characters:
SFX M N 13 SFX M 0 ben b SFX M 0 den d ... SFX M 0 ten t SFX M z zen z SFX W N 7 SFX W 0 t [^t] SFX W 0 te [kfstp] SFX W 0 ten [kfstp] SFX W 0 te ch SFX W 0 ten ch SFX W 0 de [^kfstp] SFX W 0 den [^kfstp] SFX P N 34 SFX P ad den aad SFX P af fen aaf ... SFX P at ten aat ... PFX G Y 1 PFX G 0 ge .
So what does this all mean?
PFX stand for suffix (ending) and prefix (beginning). The first line of each block gives the number of lines in the rest of the block; the
N before the number indicate whether the suffix may be combined with prefixes or vice versa.
SFX M 0 ten t simply creates the plural past form
blaatten, by appending
-ten if the original word ends in
t. We also see suffixes for other cases in which consonant doubling is required.
The next suffix
SFX W is more interesting. This one takes care of various conjugations, including the past tense and the past participle (‘voltooid deelwoord’). We see that
-te is appended when the word ends in
-de otherwise. Any Dutch person will immediately recognize this as the dreaded ‘kofschipregel’ that is the cause of so many spelling errors. In this case it gives rise to the word forms
SFX P at ten aat takes care of the infinitive. Note that the double
a has been replaced by a single one; the pronunciation remains identical. (Dutch works in mysterious ways…) This replacement is done by the third field on the line, that indicates the text to strip off the end of the word; so far, it has been
0, which means to strip off nothing. (The
SFX M z is the exception; as far as I can tell, it is not used anywhere in the aspell dictionary.) We take
blaat, which matches
-aat, so we strip off
-at and stick
-ten in its place, resulting in
Finally, we have a prefix rule
PFX G, which sticks
ge- before anything, leading to the form
geblaat (bleating, as in “the bleating of the sheep”).
nl_affix.dat also contains a list of general replace rules (for example, replacing
ch and vice versa) and specific ones (
cadeau). These rules are used when suggesting possible corrections for a misspelled word.
So where was I? Oh yeah, building a word list. To play with. For fun.