dev.languagetool.org

Using Unification

Unification is used to find agreement between words.

More formally…

Unification is used to match sequences of tokens that match the same criteria, or share some features. In the context of formal grammar, unification grammar stipulate that the linguistic tokens share certain features which are defined formally.

In LanguageTool, unification might be used to match several tokens that share a certain set of features, while the exact values of the features are unknown. This way certain rules of agreement can be defined. For example, if the feature to be matched is the same letter case – all uppercase; all lowercase; starting from uppercase and continuing in lowercase etc. – then you simply specify that all tokens must share this feature and only such tokens will be matched.

Unification is not limited to matching tokens in XML rules (the support in LanguageTool is universal enough to be used in Java rules; simply look at JUnit tests for some inspiration).

To make it work, you need to first define the feature. You simply need to give a name to it:

    <unification feature="case_sensitivity">
    ...  
    </unification>

Now, you need to add some possible values, or types of these features. To do that, you specify certain criteria of equivalence between tokens the same way as in rules, that is via the token element:

    <unification feature="case_sensitivity">
        <equivalence type="startupper">
          <token regexp="yes">\p{Lu}\p{Ll}+</token>
        </equivalence>
        <equivalence type="lowercase">
          <token regexp="yes">\p{Ll}+</token>
        </equivalence>
     </unification>

Here you can see two possible types of instances of the feature case_sensitivity: startupper and lowercase, both defined with a regular-expression token. The unification block must appear in the XML file before any rules and phrases, immediately after the root element rules.

To match tokens that share some feature, you simply write inside the pattern element:

    <unify>
      <feature id="case_sensitivity">
        <type id="startupper"/>
      </feature>
      <token/>
      <token>York</token>
    </unify>

The pattern will match any uppercase-starting word before the word “York” (New York, Old York, Pork York…).

A slightly less trivial would be an example of unification over three features with many values. Take features such as grammatical number and gender: they have different values in different languages (like singular / plural / dual; feminine / masculine / neutral…). Inflected languages usually have tagsets that specify such features in POS tags. You can match those features using token element, and stipulate that following tokens will have the same features as the starting one:

    <unify>
    	<feature id="gender">
    		<type id="masc"/>
    	</feature>
    	<feature id="number">
    		<type id="singular"/>
    	</feature>
       <token/>
       <token>foo</token>
    </unify>

This pattern will match only two tokens which have the same gender and number (masculine and singular). You can also skip specifying the types - in this case, LanguageTool will try to match all possible values defined as equivalences for the features. Note: you cannot skip features.

    <unify>
       <feature id="gender"/>
       <feature id="number"/>
       <token/>
       <token>foo</token>
    </unify>

You can also match all tokens but the ones that share a certain set of features. Simply use negate="yes" on the unify element:

    <unify negate="yes">
       <feature id="gender"/>
       <feature id="number"/>
       <token/>
       <token>foo</token>
    </unify>

Unification is also used for disambiguation. It is integrated with the XML-based disambiguator. Simply specify the action attribute of the disambig element as "unify", and only agreeing tokens will be selected.

Ignoring neutral elements

Sometimes it is useful not to include some tokens in the matched (agreeing) sequence. Think of punctuation, adverbs that do not have any gender or number, weird idiomatic expressions, or connectives. To silently add these to the unified sequence, simply use:

    <unify>
       <feature id="gender"/>
       <feature id="number"/>
       <token>foo</token>
       <unify-ignore>
          <token>,</token>
       </unify-ignore>
       <token>foo</token>
    </unify>

The comma will be then added without checking whether it agrees with other tokens (frankly, it cannot, as commas are not inflected).

Beware: The first token inside <unify> cannot be ignored using <unify-ignore> (this will cause bugs). In that case, open <unify> after the token that should be ignored anyway.