Unification is used to find agreement between words.
Unification is used to match sequences of tokens that match the same criteria, or share some features. In the context of formal grammar, unification grammar stipulate that the linguistic tokens share certain features which are defined formally.
In LanguageTool, unification might be used to match several tokens that share a certain set of features, while the exact values of the features are unknown. This way certain rules of agreement can be defined. For example, if the feature to be matched is the same letter case – all uppercase; all lowercase; starting from uppercase and continuing in lowercase etc. – then you simply specify that all tokens must share this feature and only such tokens will be matched.
Unification is not limited to matching tokens in XML rules (the support in LanguageTool is universal enough to be used in Java rules; simply look at JUnit tests for some inspiration).
To make it work, you need to first define the feature. You simply need to give a name to it:
<unification feature="case_sensitivity">
...
</unification>
Now, you need to add some possible values, or types of these features.
To do that, you specify certain criteria of equivalence between tokens
the same way as in rules, that is via the token
element:
<unification feature="case_sensitivity">
<equivalence type="startupper">
<token regexp="yes">\p{Lu}\p{Ll}+</token>
</equivalence>
<equivalence type="lowercase">
<token regexp="yes">\p{Ll}+</token>
</equivalence>
</unification>
Here you can see two possible types of instances of the feature
case_sensitivity
: startupper
and lowercase
, both defined with a
regular-expression token. The unification
block must appear in the
XML file before any rules and phrases, immediately after the root
element rules
.
To match tokens that share some feature, you simply write inside the
pattern
element:
<unify>
<feature id="case_sensitivity">
<type id="startupper"/>
</feature>
<token/>
<token>York</token>
</unify>
The pattern will match any uppercase-starting word before the word “York” (New York, Old York, Pork York…).
A slightly less trivial would be an example of unification over three
features with many values. Take features such as grammatical number and
gender: they have different values in different languages (like
singular / plural / dual; feminine / masculine / neutral…). Inflected
languages usually have tagsets that specify such features in POS tags.
You can match those features using token
element, and stipulate that
following tokens will have the same features as the starting one:
<unify>
<feature id="gender">
<type id="masc"/>
</feature>
<feature id="number">
<type id="singular"/>
</feature>
<token/>
<token>foo</token>
</unify>
This pattern will match only two tokens which have the same gender and number (masculine and singular). You can also skip specifying the types - in this case, LanguageTool will try to match all possible values defined as equivalences for the features. Note: you cannot skip features.
<unify>
<feature id="gender"/>
<feature id="number"/>
<token/>
<token>foo</token>
</unify>
You can also match all tokens but the ones that share a certain set of
features. Simply use negate="yes"
on the unify
element:
<unify negate="yes">
<feature id="gender"/>
<feature id="number"/>
<token/>
<token>foo</token>
</unify>
Unification is also used for disambiguation. It is integrated with the
XML-based disambiguator. Simply specify the action
attribute of the
disambig
element as "unify"
, and only agreeing tokens will be
selected.
Sometimes it is useful not to include some tokens in the matched (agreeing) sequence. Think of punctuation, adverbs that do not have any gender or number, weird idiomatic expressions, or connectives. To silently add these to the unified sequence, simply use:
<unify>
<feature id="gender"/>
<feature id="number"/>
<token>foo</token>
<unify-ignore>
<token>,</token>
</unify-ignore>
<token>foo</token>
</unify>
The comma will be then added without checking whether it agrees with
other tokens (frankly, it cannot, as commas are not inflected).
Beware: The first token inside <unify>
cannot be ignored using <unify-ignore>
(this will cause bugs). In that case, open <unify>
after the token that should be ignored anyway.