dev.languagetool.org

Improving Spell Checking

You can improve the LanguageTool spell checker without touching the dictionary. Add your words to one of these files:

After making changes to any of the files listed above, you need to restart LanguageTool for the change to become active. The rest of this page explains how to modify the internal spell checking dictionary.

Hunspell

LanguageTool supports spell checking using hunspell via BridJ (thanks to hunspell-java). Unfortunately, the Hunspell performance of creating suggestions is very low.

Morfologik

Another speller, based on finite-state automata (FSA), was included because of the speed problems with hunspell. The dictionary for the speller is built in the way similar to Developing a tagger dictionary. All it requires is a list of valid words in a language. Example: for Polish, the 3.5 million word list becomes less than 1MB file in the FSA format.

For Users

LanguageTool’s stand-alone version comes with a tool to build a binary dictionary. As input, it needs a plain text list of words (one word per line) and the .info file that’s already part of LanguageTool. You can call the tool like this to write the output to /tmp/output.dict:

java -cp languagetool.jar org.languagetool.tools.SpellDictionaryBuilder de-DE /path/to/dictionary.txt org/languagetool/resource/en/hunspell/en_US.info - -o /tmp/output.dict

To export an existing dictionary as plain text you can use this command:

java -cp languagetool.jar org.languagetool.tools.DictionaryExporter -i org/languagetool/resource/en/hunspell/en_GB.dict -info org/languagetool/resource/en/hunspell/en_GB.info -o /tmp/out.txt

For some dictionaries this prints a plain list of words, for some the result might look like this:

Abe+I
Abel+J
Abelard+F
Abelson+E
Aberconwy+E

The character after the + indicates a frequency class. A marks the least frequent words, Z the most frequent ones (Z might not be used, so maybe the most frequent words are marked with X or so).

For Developers

To create a morfologik dictionary under Linux, you can use create_dict.sh. It assumes you have a Hunspell dictionary (.dic and .aff) in ISO-8859-1 encoding. If your encoding is different, or your target encoding (as specified in the .info file) is not UTF-8, you need to adapt the script. As an example, call it like this for American English from the top-level LanguageTool directory:

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/create_dict.sh en US

It assumes the .dic and .aff files of the Hunspell dictionary are placed in languagetool-language-modules/<LANG>/src/main/resources/org/languagetool/resource/<LANG>/hunspell/. It will write its resulting .dict into your system’s temp directory.

This script will expand the Hunspell word list to a list of inflected forms and tries to do all the work automatically.

Configuring the dictionary

The dictionary can be further configured using the .info file. Currently, the following properties are supported:

An example of replacement pairs (just like REP in hunspell):

fsa.dict.speller.replacement-pairs=rz ż, ż rz, ch h, h ch, ę en, en ę

The speller will then suggest “brzuch” if you mistype “bżuch”, or even “bżuh”.

An example of equivalent chars (this is the same as MAP in hunspell):

fsa.dict.speller.equivalent-chars=x ź, l ł, u ó, ó u

If you write “wex”, the speller will be able to suggest “weź”.

Including frequency data

Sorting the suggestions by frequency of their usage is very useful to make our speller work in the expected way for the user.

To include the frequency data, you can run our dictionary builder like this:

java -cp languagetool.jar org.languagetool.tools.SpellDictionaryBuilder -i dictionary.dump -info org/languagetool/resource/en/hunspell/en_US.info -freq en_us_wordlist.xml -o /tmp/output.dict

The frequency data files can be found here. In addition, one needs to include the flag fsa.dict.frequency-included. The process of producing the dictionary dump from the current version of the binary dictionary is described on the page Developing a tagger dictionary.

Note that this approach is of limited value for languages with compounds (like German): only words listed in the dictionary will get their occurrence count, but other words are accepted as compounds and will thus not have an occurrence count.

Current limitations of MorfologikSpeller

There is a script to convert hunspell dictionaries to finite wordlists but it is painfully slow. Alternatively, one could convert hunspell dictionaries to hfst format, but we cannot (at least now) convert them to our fsa format.