LanguageTool can make use of large n-gram data sets to detect errors with words that are often confused, like their and there. The n-gram data set is huge and thus not part of the LT download. To make use of it, you have two choices:
To use the data locally:
ngrams-xx-2015*
files for LanguageTool <= 6.5, ngrams-xx-2024*
files for LanguageTool >= 6.6.en
, de
, fr
, or es
, depending on the
language. The path you need to set in the next step is the directory that the en
etc.
directory is in, not that directory itself.--languagemodel
option pointing to the ngram-index directory.--languageModel
option. Alternatively, you can start with
the --config file
option. This properties file needs to have a languageModel=...
entry
pointing to the ngram-index directory. Using the properties files will give you some advanced
configurations. Call java -jar languagetool-server.jar
to get a list of all options.An n-gram is a contiguous sequence of n items from a text, like a girl
(2-gram) or
a tall girl
(3-gram). Once you have a large amount of these n-grams with their number
of occurrences, you can use this to detect errors in texts. For example, in
This is there last chance to escape.
, LanguageTool will look at the context of
there, considering up to three words:
This is there
, is there last
, there last chance
The probabilities of these n-grams are then compared to the probabilities of:
This is their
, is their last
, their last chance
If the probability of the n-grams with their is higher than of those with there, LanguageTool assumes there’s an error in the input sentence.
We use this data set from Google, which is very similar to what Google uses for its n-gram viewer.
Here you can see the confusion pairs we support so far:
How to add words pairs is documented at Adding ngram Data Rules.
Technical background information can be found on
Finding errors using Big Data.
To use ngrams via the Java API, use JLanguageTool.activateLanguageModelRules(File)
.