The GPL spell checking program aspell has support for many languages, including my native language, Dutch. Since it's open source, so are the dictionaries, which means that we should be able to extract a word list. And complete word lists for a language are simply fun to play with.
Unfortunately aspell is too advanced to use a plain text word list. But there is a way to dump it:
aspell dump master
This will print the entire word list for your default language. You can specify the language used with -l
:
aspell -l nl dump master
The argument to -l
is the ISO 639 language code (see man aspell
for details). The argument master
tells aspell to use the systemwide dictionary, not your personal wordlist. The dictionary must be installed on your system; on Ubuntu the Dutch language package is called aspell-nl
.
When we run aspell dump master
for Dutch we get something unexpected:
blaat/MWPG
bloeit/KU
bloot/G
blootte
blote/N
There are strange tags attached to the end of many words. These are affixes and they represent variations of that word. (Although there is an English affix file, no affixes tags are printed if we dump an English dictionary.) We can expand the affix tags into all possible variations by sending them through aspell expand
:
aspell -l nl dump master | aspell -l nl expand
That's better:
blaat geblaat blaatten blaatten blaatte blaten
bloeit opbloeit uitbloeit
bloot gebloot
blootte
blote bloten
If we now pipe this through tr
we get all variations on separate lines as well. Thus the final command to get a word list for any aspell-supported language becomes:
aspell -l nl dump master | aspell -l nl expand | tr ' ' '\n'
(Note that this breaks for words that originally contained spaces. The Dutch word list does not have these, though.)
Then I got interested in how these affixes work. Take blaat
(to bleat) for example; it is followed by M
, W
, P
and G
. Looking in the affix definition file /usr/lib/aspell/nl_affix.dat
there are some lines that define the meaning of these characters:
SFX M N 13
SFX M 0 ben b
SFX M 0 den d
...
SFX M 0 ten t
SFX M z zen z
SFX W N 7
SFX W 0 t [^t]
SFX W 0 te [kfstp]
SFX W 0 ten [kfstp]
SFX W 0 te ch
SFX W 0 ten ch
SFX W 0 de [^kfstp]
SFX W 0 den [^kfstp]
SFX P N 34
SFX P ad den aad
SFX P af fen aaf
...
SFX P at ten aat
...
PFX G Y 1
PFX G 0 ge .
So what does this all mean? SFX
and PFX
stand for suffix (ending) and prefix (beginning). The first line of each block gives the number of lines in the rest of the block; the Y
or N
before the number indicate whether the suffix may be combined with prefixes or vice versa.
The line SFX M 0 ten t
simply creates the plural past form blaatten
, by appending -ten
if the original word ends in t
. We also see suffixes for other cases in which consonant doubling is required.
The next suffix SFX W
is more interesting. This one takes care of various conjugations, including the past tense and the past participle (‘voltooid deelwoord’). We see that -te
is appended when the word ends in k
, f
, s
, t
or p
, and -de
otherwise. Any Dutch person will immediately recognize this as the dreaded ‘kofschipregel’ that is the cause of so many spelling errors. In this case it gives rise to the word forms blaatte
and blaatten
.
The suffix SFX P at ten aat
takes care of the infinitive. Note that the double a
has been replaced by a single one; the pronunciation remains identical. (Dutch works in mysterious ways…) This replacement is done by the third field on the line, that indicates the text to strip off the end of the word; so far, it has been 0
, which means to strip off nothing. (The SFX M z
is the exception; as far as I can tell, it is not used anywhere in the aspell dictionary.) We take blaat
, which matches -aat
, so we strip off -at
and stick -ten
in its place, resulting in blaten
.
Finally, we have a prefix rule PFX G
, which sticks ge-
before anything, leading to the form geblaat
(bleating, as in “the bleating of the sheep”).
The file nl_affix.dat
also contains a list of general replace rules (for example, replacing g
by ch
and vice versa) and specific ones (kado
by cadeau
). These rules are used when suggesting possible corrections for a misspelled word.
So where was I? Oh yeah, building a word list. To play with. For fun.