dev.languagetool.org

Developing a Disambiguator

A disambiguator might be used for a language in case when the tagger creates many interpretations for a token and rules get very complex because of the same set of exceptions used everywhere to disambiguate part-of-speech tags.

The disambiguator might be rule-based, as it is for French or English, or it can implement a completely different scheme (statistical). Note that you cannot simply adapt existing disambiguators, even rule-based, as they are used to make taggers robust. Robustness means that good taggers should ignore small grammatical problems when tagging. However, we want to recognize them rather than hide from linguistic processing. Anyway, I found that even automatically created rules (such as ones generated by training a Brill tagger for English) can be a source of inspiration.

Note that in contradistinction to XML grammar rules, the order of disambiguation rules is important (like in Brill tagger rules, they are cascaded). They are applied in the order as they appear in the file, so you can use a step-by-step strategy and use the results of previous rules in what follows after them.

The rule-based disambiguator may be used to add additional markup and simplify error-matching rules. For example, you can conditionally mark up some punctuation, or phrases. It’s also useful to mark up tokens that you would otherwise with lengthy regular-expression based disjunctions (word1|word2...|wordn), if these disjunctions appear in multiple rules. This will be more efficient in terms of processing speed and will make the rules a bit more understandable for a human being.

To create a new disambiguator

    package org.languagetool.tagging.disambiguation.rules.xx;
    import java.io.IOException;
    import org.languagetool.AnalyzedSentence;
    import org.languagetool.language.Yyyyyy;
    import org.languagetool.tagging.disambiguation.Disambiguator;
    import org.languagetool.tagging.disambiguation.rules.XmlRuleDisambiguator;
    
    public class YyyyyyRuleDisambiguator implements Disambiguator {
      
      private final Disambiguator disambiguator = new XmlRuleDisambiguator(new Yyyyyy());
    
      @Override
      public final AnalyzedSentence disambiguate(AnalyzedSentence input)
          throws IOException {
        return disambiguator.disambiguate(input);
      }
    
    }

where:

xx is the-two letter country code

Yyyyyy is the language name

    @Override
    public final Disambiguator getDisambiguator() {
      if (disambiguator == null) {
        disambiguator = new YyyyyyRuleDisambiguator();
      }
      return disambiguator;
    }

XML syntax

Rule based XML disambiguator uses a syntax very similar to XML rules. For example:

    <rule name="determiner + verb/NN -> NN" id="DT_VB_NN">
      <pattern>
    	<token postag="DT"><exception postag="PDT" /></token>
    	<marker>
    	  <and>
    	    <token postag="VB" />
    	    <token postag="NN" ><exception negate_pos="yes" postag="VB|NN" postag_regexp="yes"/></token>
    	  </and>
        </marker>
      </pattern>
      <disambig postag="NN" />
    </rule>

The only new element here is disambig. It simply assigns a new POS tag to the word being disambiguated. Note I am using a trick that the rule applies only to words having both NN and VB tags - in English, there are many much more ambiguous words which require much more complex rules. Without the trick, the disambiguation rule could create more damage than good - it would garble the tagger output. This is a constant danger when writing disambiguator rules.

Note that by default disambig is applied to a single token which is selected with the <marker></marker> elements in the
pattern element. However, you can use the action attribute to select more tokens for unification or for adding new interpretations.

The possible values of the action attribute are:

Filtering tags

Instead of adding a single tag, as above, you can select an already existing tag (that would also retain the old lemma that gets overwritten in case of simple assignment as above):

    <rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP">
      <pattern>
        <marker>
          <token>his</token>
        </marker>
        <token postag="NN.*" postag_regexp="yes" />
      </pattern>
      <disambig><match no="1" postag="PRP\$" postag_regexp="yes" /></disambig>
    </rule>

In this case, we select an existing interpretation (and only that interpretation) from the set of previous interpretations.

You can also assign a lemma if there are multiple interpretations and you don’t want to pick just the first one as supplied by the tagger (this is the default behavior):

    <rule name="Don't|do|don/vb ->don't/vb" id="DONT_VB">
      <pattern>
        <marker>
          <token>don</token>
        </marker>
    	<token>'</token>
    	<token>t</token>
      </pattern>
      <disambig><match no="1" postag="VBP">do</match></disambig>
    </rule>

In this case, a contracted form of ‘‘do’’ is assigned a proper lemma and form tag. All other interpretations are being discarded.

There is another, shorter syntax that you might use for simple forms of filtering:

    <rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP">
      <pattern>
        <marker>
          <token>his</token>
        </marker>
        <token postag="NN.*" postag_regexp="yes" />
      </pattern>
      <disambig action="filter" postag="PRP\$" />
    </rule>

It is exactly equivalent to the first example. Note that you cannot specify a lemma this way, so you need the full syntax for this. Note that if “his” is not tagged as PRP$, this action is not executed. In other words, this action of the disambiguator presupposes that the new tag matches POS tags already found in the given token. To add new interpretations or replace ones, you need to use actions add or replace, accordingly.

Note: it is also possible to filter out some interpretations that you need to remove by using a special regular-expression with negative lookahead, which is a trick that enables negation in regular expression syntax. For example, this rule will remove all interpretations that are equivalent to PRP$ from the token “his”:

    <rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP">
      <pattern>
        <marker>
          <token>his</token>
        </marker>
        <token postag="NN.*" postag_regexp="yes" />
      </pattern>
      <disambig><match no="1" postag="^(?!PRP\$).*$" postag_regexp="yes" /></disambig>
    </rule>

As you specify the filter using the regular expression, you can remove multiple interpretations at once.

Filtering multiple tokens

The action filterall works the following way: Once a pattern matches, then every token inside the <marker> tag is filtered by its correspondent postag. So if a pattern like “determinant-noun-adjective masculine singular (with its exceptions)” is matched, then all other readings (pronoun, verb, etc.) are removed.

For example:

    <rule>
      <pattern>
        <marker>
          <token postag="D[^R].[MC][SN0].*" postag_regexp="yes"/>
          <token postag="N.[MC][SN0].*" postag_regexp="yes"/>
          <token postag="A..[MC][SN0].*|V.P..SM|PX.[MC][SN0].*" postag_regexp="yes"><exception postag="V[MA]IP3S0" postag_regexp="yes"/></token>
        </marker>
      </pattern>
      <disambig action="filterall"/>
    </rule>

This would be equivalent to three rules (one for every token) like this:

    <rule>
      <pattern>
        <marker>
          <token postag="D[^R].[MC][SN0].*" postag_regexp="yes"/>
        </marker>
        <token postag="N.[MC][SN0].*" postag_regexp="yes"/>
        <token postag="A..[MC][SN0].*|V.P..SM|PX.[MC][SN0].*" postag_regexp="yes"/>
      </pattern>
      <disambig action="filter" postag="D.*"/>
    </rule>

Using unification

Before using unification, you need to define features and equivalences of features, as described in Using unification. In the disambiguator file, you add the same unification block as in the rules file (the syntax is the same). Then, in the rule, you can leave only unified tokens, that is tokens that share the same features. For example, take a simple agreement rule from the Polish disambiguator:

    <rule name="unifikacja przymiotnika z rzeczownikiem" id="unify_adj_subst">
      <pattern>
        <marker>
          <unify> 
          <feature id="number"><feature id="gender"><feature id="case">
            <token postag="adj.*" postag_regexp="yes"><exception negate_pos="yes" postag_regexp="yes" postag="adj.*"/></token>
            <token postag="subst.*" postag_regexp="yes"><exception negate_pos="yes" postag_regexp="yes" postag="subst.*"/></token>
          </unify>
        </marker>
      </pattern>
      <disambig action="unify"/>
    </rule>

It uses unification on three features (defined earlier in the file): number, gender, and case. Note that I am using a Tips and tricks to make sure that only words that are marked only as adjectives or substantives are unified (otherwise the rule is too greedy).

There are several important restrictions: You cannot use two unified blocks in the disambiguator file; only one unify sequence per pattern is allowed. Moreover, the length of the matched tokens (selected with <marker></marker>) must match the length of the unified sequence. Of course, there might be more tokens in the rule, but they cannot be selected with <marker></marker> if the disambiguator is supposed to unify the sequence of tokens.

Removing only some interpretations

Sometimes, instead of filtering, you might want to remove only one interpretation from the token. You can do this in the following way:

    <rule name="mają to nie maić" id="MAJA_MAIC">
      <pattern>
        <token>mają</token>
      </pattern>
      <disambig action="remove"><wd lemma="maić" pos="verb:fin:pl:ter:imperf">mają</wd></disambig>
    </rule>

The above code removes one interpretation of the word “mają”: the one with the POS equal to “verb:fin…”, token equal “mają”, and lemma “maić”. You can supply only some of three parameters to remove a token: all supplied parameters must be matched, so the fewer parameters supplied, the more interpretations removed.

Adding completely new readings

Adding new readings can be useful to mark up groups, such as noun groups or multi-word expressions. You can add a single reading or many readings to the whole sequence (for example, a start mark, an “inside” mark, and an end mark).

For example:

    <rule name="ciemku" id="ciemku">
      <pattern>
        <token>ciemku</token>
      </pattern>
      <disambig action="add"><wd lemma="po ciemku" pos="adjp">ciemku</wd></disambig>    
    </rule>

The number of wd elements must match the number of tokens selected with <marker></marker>.

Adding only POS tags or tokens

You can also add just POS tags without having to specify the lemmas or tokens added. This is especially useful if you’re tagging tokens that are matched by regular expressions or POS tags, so you don’t actually know which one you will find. You can add a POS tag just by supplying the wd element without the lemma attribute or without textual content:

    <rule name="uppercase tag" id="UPTAG">
      <pattern case_sensitive="yes">
        <token regexp="yes">\p{Lu}+</token>
      </pattern>
      <disambig action="add"><wd pos="UP"/></disambig>    
    </rule>

In the above example, I only added UP tag to uppercase words, the lemma is assumed to be equal to the token content, and the content of the token is not changed. So if the word was “Smiths”, it would be tagged as “UP”, and the lemma would be “Smiths” (although in other readings it could be “Smith”).

If you omit only a token, it will be equal to the token matched by the current rule (rather than empty).

Immunizing words from matching

Sometimes a string of tokens get raise a false alarm in many rules even when the words are correctly tagged, and adding an exception to all the rules might be overkill (for example, an idiomatic phrase). You can then immunize the tokens by using the action immunize:

    <pattern>
      <token>dla</token>
      <marker>
        <token>Windows</token>
      </marker>
    </pattern>
    <disambig action="immunize"/>

The above pattern will be immunized, but only for the word “Windows”. This way no XML rule will match it. Java rules can ignore immunization - it’s up to their authors to respect immunization.

Ignoring in spell-checking rules

One can mark up tokens as spelled correctly. For example, one can use a regular expression to make the spell checker accept all Roman numbers:

    <pattern case_sensitive="yes">
       <token regexp="yes">(?:M*(?:D?C{0,3}|C[DM])(?:L?X{0,3}|X[LC])(?:V?I{0,3}|I[VX]))</token>
    </pattern>
    <disambig action="ignore_spelling"/>

There are also very rare words that are not used anywhere but in some specific contexts. We want the spell checker to complain about those in all but these special contexts. For example, in Polish, the the word ząb cannot be used in its genetive form zębu unless it’s applied in a specific context when one speaks of a kind of corn. The rule looks like this:

    <rule id='KONSKI_ZAB' name="końskiego zębu - dobra pisownia">
      <pattern>
        <token>końskiego</token>
        <marker>
          <token>zębu</token>
        </marker>      
      </pattern>
      <disambig action="ignore_spelling"/>      
    </rule>

Possible strategies of disambiguation

Testing disambiguation rules

The best way to test disambiguation rules is to run LanguageTool on a middle-sized corpus (comparable to Brown corpus in English) and see if the previous false alarms are now fixed and no new false alarms are being created. Otherwise, it’s very hard to predict the impact of disambiguation rules.

You can test the disambiguation rules in a similar fashion to the way grammar rules are being tested. Let’s look at the example.

    <example type="ambiguous" inputform="What[what/WDT,what/WP,what/UH]" outputform="What[what/WDT]"><marker>What</marker> kind of bread is this?</example>
    <example type="untouched">What are you doing?</example>

In the above snippet, we declare that the sentence “What are you doing?” should be left untouched, or unchanged by the disambiguation rule, contrary to the ambiguous sentence that will be processed. Using marker element, we select the token that will be changed. The attribute inputform specifies the input forms of the token, in a word[lemma/POS] format. The outputform is of course what the disambiguation rule should produce.

Note also that in verbose mode (-v on the commandline interface), LT will display a log of all actions of the disambiguator for a given sentence.