LanguageTool can make use of large n-gram data sets to detect errors with words that are often confused, like their and there. The n-gram data set is huge and thus not part of the LT download. To make use of it, you have two choices:
To use the data locally:
es, depending on the language. The path you need to set in the next step is the directory that the
enetc. directory is in, not that directory itself.
--languagemodeloption pointing to the ngram-index directory.
--languageModeloption. Alternatively, you can start with the
--config fileoption. This properties file needs to have a
languageModel=...entry pointing to the ngram-index directory. Using the properties files will give you some advanced configurations. Call
java -jar languagetool-server.jarto get a list of all options.
An n-gram is a contiguous sequence of n items from a text, like
a girl (2-gram) or
a tall girl (3-gram). Once you have a large amount of these n-grams with their number
of occurrences, you can use this to detect errors in texts. For example, in
This is there last chance to escape., LanguageTool will look at the context of
there, considering up to three words:
This is there,
is there last,
there last chance
The probabilities of these n-grams are then compared to the probabilities of:
This is their,
is their last,
their last chance
If the probability of the n-grams with their is higher than of those with there, LanguageTool assumes there’s an error in the input sentence.
We use this data set from Google, which is very similar to what Google uses for its n-gram viewer.
Here you can see the confusion pairs we support so far:
How to add words pairs is documented at Adding ngram Data Rules.
Technical background information can be found on
Finding errors using Big Data.
To use ngrams via the Java API, use