Adding ngram Data Rules

This page describes how rule developers can make use of the n-gram data to detect errors. As an introduction and for set-up, please see Finding errors using n-gram data first.

To add word pairs, simply add them to the resource/<languageCode>/confusion_sets.txt file of your language. If such a file doesn’t exist for your language, ask the developers to add one. However, you need massive amounts of ngram data which is easily available only for English, German, French, Spanish, Italian, Russian, Chinese, and Hebrew. Here’s an example line from that file:

word1; word2; 10     # comment

In this example, word1 and word2 are two similar words that can easily be confused. The order of these words doesn’t matter to LT, but we recommend sorting them alphabetically. 10 is a factor that determines how much the word used in the text is preferred. If you use 1 here, the other word will be suggested as a correction even when it’s only a little bit more probable than the word from the text. In many cases, this would lead to false alarms. To avoid those, use a larger factor, often something from 10 to 10000000. To decide what’s a good compromise between finding many errors and not creating too many false alarms, use

Finding a factor

Run from your IDE. It will print the options you need to provide: the token pair, the language code, the ngram directory, and an evaluation file with example sentences that contain the words in their correct context. It will show output like this:

Factor:   10 - 5 false positives, 11 false negatives
Summary:  p=0.997, r=0.994, 2000, 3grams, 2015-09-16

Factor:   100 - 2 false positives, 22 false negatives
Summary:  p=0.999, r=0.989, 2000, 3grams, 2015-09-16

Factor:   1000 - 2 false positives, 42 false negatives
Summary:  p=0.999, r=0.979, 2000, 3grams, 2015-09-16

p= is the precision value, i.e. the probability that a correct use of a word is detected as correct. This value needs to be close to 1, as the difference between 1 and the value indicates the amount of false alarms the rule will throw. r= is the recall, i.e. the probability that an incorrect use of a word is detected as incorrect. The closer this is to 1, the better, but a high precision should be preferred. You can now choose the factor value that has a good compromise between precision and recall and add it as the third value for this pair of words in confusion_sets.txt.

There’s also the chance that even a very high factor doesn’t cause a good precision. As a rule of thumb, the precision should be at least 0.995 for very common words and 0.99 for other words. If that’s not possible with any factor, or if the recall is very low (< 0.5), this pair of words might simply not be a good fit for ngram-based error detection.

Of course the result of evaluation depends on what you use as input: the input itself may contain errors. That’s why will also print the cases that are probably false alarms. If they aren’t, you should probably clean up the input. For English, we use a combination of Wikipedia and Tatoeba as input. The fewer of the affected words appear in Wikipedia and Tatoeba, the less meaningful the evaluation output will become. For English, we try to use at least 1000 example sentences for each word.

Known Limitations