How to avoid mistakes when writing rules

There are many ways you can make mistakes in grammar rules. In case of XML-formatted rule files, there are recurrent mistakes such as:

Hasty generalization

Hasty generalization creates false positives (or reduces precision) of rules. It’s advisable to use the rule editor.

How to fix:

Bad suggestions

Sometimes, instead of correct forms, suggestions contain only messages that explain the kind of error.

How to fix:

Errors in regular expressions

One of the common problems is not using parentheses () to group disjunctive groups in case of regular expressions with spaces. For example, A would match any POS tag in case it contains A. But when used in an exception, you want to exclude exactly A. This is a bad way:

<exception postag_regexp="yes" postag="!A"/>

The correct form:

<exception postag="A"/>

This way it would match the POS tag as a whole string – this is what you actually want. Regular expressions have limited ways of expressing negation (via sets like this: [^A]) but using something inside an exception enables you to negate the POS tag. In normal tokens, you can use negate_pos="yes" as a negation operator, like here:

<token negate_pos="yes" postag="A"/>

How to fix:

Badly encoded exceptions

Exceptions in the rules can remain untested if they are not accompanied by example. Otherwise, you don’t really know if the exception does work.

How to fix:

Untested corrections

Sometimes the corrections are quite not like the author intended - the ordering of tokens encoded as \1, \2 or <match no="1".../> can be broken, etc.

How to fix:

Lack of exceptions for skipping

Skipping enables matching non-contiguous sequences of tokens. However, some sequences (such as noun phrases or verb phrases) might be broken by punctuation characters, intervening connectives, other verb forms etc. In Constraint Grammar, there is a notion of Barrier that specifies such breaking-elements. In LanguageTool, we use exceptions for skipping (with scope="next"). Add as many exceptions as necessary.

How to fix: