A disambiguator might be used for a language in case when the tagger creates many interpretations for a token and rules get very complex because of the same set of exceptions used everywhere to disambiguate part-of-speech tags.
The disambiguator might be rule-based, as it is for French or English, or it can implement a completely different scheme (statistical). Note that you cannot simply adapt existing disambiguators, even rule-based, as they are used to make taggers robust. Robustness means that good taggers should ignore small grammatical problems when tagging. However, we want to recognize them rather than hide from linguistic processing. Anyway, I found that even automatically created rules (such as ones generated by training a Brill tagger for English) can be a source of inspiration.
Note that in contradistinction to XML grammar rules, the order of disambiguation rules is important (like in Brill tagger rules, they are cascaded). They are applied in the order as they appear in the file, so you can use a step-by-step strategy and use the results of previous rules in what follows after them.
The rule-based disambiguator may be used to add additional markup and
simplify error-matching rules. For example, you can conditionally mark
up some punctuation, or phrases. It’s also useful to mark up tokens
that you would otherwise with lengthy regular-expression based
disjunctions (word1|word2...|wordn
), if these disjunctions appear in
multiple rules. This will be more efficient in terms of processing
speed and will make the rules a bit more understandable for a human
being.
src/main/java
:
package org.languagetool.tagging.disambiguation.rules.xx;
import java.io.IOException;
import org.languagetool.AnalyzedSentence;
import org.languagetool.language.Yyyyyy;
import org.languagetool.tagging.disambiguation.Disambiguator;
import org.languagetool.tagging.disambiguation.rules.XmlRuleDisambiguator;
public class YyyyyyRuleDisambiguator implements Disambiguator {
private final Disambiguator disambiguator = new XmlRuleDisambiguator(new Yyyyyy());
@Override
public final AnalyzedSentence disambiguate(AnalyzedSentence input)
throws IOException {
return disambiguator.disambiguate(input);
}
}
where:
xx
is the-two letter country code
Yyyyyy
is the language name
org.languagetool.Language.Yyyyyy.java
and override getDisambiguator()
: @Override
public final Disambiguator getDisambiguator() {
if (disambiguator == null) {
disambiguator = new YyyyyyRuleDisambiguator();
}
return disambiguator;
}
in src/main/resources
:
org/languagetool/resource/xx/disambiguation.xml
Rule based XML disambiguator uses a syntax very similar to XML rules. For example:
<rule name="determiner + verb/NN -> NN" id="DT_VB_NN">
<pattern>
<token postag="DT"><exception postag="PDT" /></token>
<marker>
<and>
<token postag="VB" />
<token postag="NN" ><exception negate_pos="yes" postag="VB|NN" postag_regexp="yes"/></token>
</and>
</marker>
</pattern>
<disambig postag="NN" />
</rule>
The only new element here is disambig
. It simply assigns a new POS
tag to the word being disambiguated. Note I am using a trick that the
rule applies only to words having both NN
and VB
tags - in
English, there are many much more ambiguous words which require much
more complex rules. Without the trick, the disambiguation rule could
create more damage than good - it would garble the tagger output. This
is a constant danger when writing disambiguator rules.
Note that by default disambig
is applied to a single token which
is selected with the <marker>
… </marker>
elements in the
pattern
element. However, you can use the action
attribute to
select more tokens for unification or for adding new interpretations.
The possible values of the action
attribute are:
replace
- the default one, assumed in the above example,filter
- used for filtering single tokens,filterall
- used for filtering multiple tokens by using the postags given in the rule,unify
- used for unification of groups of tokens,remove
- used for removing single tokens,add
- used for adding interpretations,immunize
- used to mark the tokens as immunized, i.e., never matched by any rule,ignore_spelling
- used to mark tokens that should not be marked as misspelled.Instead of adding a single tag, as above, you can select an already existing tag (that would also retain the old lemma that gets overwritten in case of simple assignment as above):
<rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP">
<pattern>
<marker>
<token>his</token>
</marker>
<token postag="NN.*" postag_regexp="yes" />
</pattern>
<disambig><match no="1" postag="PRP\$" postag_regexp="yes" /></disambig>
</rule>
In this case, we select an existing interpretation (and only that interpretation) from the set of previous interpretations.
You can also assign a lemma if there are multiple interpretations and you don’t want to pick just the first one as supplied by the tagger (this is the default behavior):
<rule name="Don't|do|don/vb ->don't/vb" id="DONT_VB">
<pattern>
<marker>
<token>don</token>
</marker>
<token>'</token>
<token>t</token>
</pattern>
<disambig><match no="1" postag="VBP">do</match></disambig>
</rule>
In this case, a contracted form of ‘‘do’’ is assigned a proper lemma and form tag. All other interpretations are being discarded.
There is another, shorter syntax that you might use for simple forms of filtering:
<rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP">
<pattern>
<marker>
<token>his</token>
</marker>
<token postag="NN.*" postag_regexp="yes" />
</pattern>
<disambig action="filter" postag="PRP\$" />
</rule>
It is exactly equivalent to the first example. Note that you cannot
specify a lemma this way, so you need the full syntax for this. Note
that if “his” is not tagged as PRP$
, this action is not executed. In
other words, this action of the disambiguator presupposes that the new
tag matches POS tags already found in the given token. To add new
interpretations or replace ones, you need to use actions add
or
replace
, accordingly.
Note: it is also possible to filter out some interpretations that you
need to remove by using a special regular-expression with negative
lookahead, which
is a trick that enables negation in regular expression syntax. For
example, this rule will remove all interpretations that are equivalent
to PRP$
from the token “his”:
<rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP">
<pattern>
<marker>
<token>his</token>
</marker>
<token postag="NN.*" postag_regexp="yes" />
</pattern>
<disambig><match no="1" postag="^(?!PRP\$).*$" postag_regexp="yes" /></disambig>
</rule>
As you specify the filter using the regular expression, you can remove multiple interpretations at once.
The action filterall
works the following way: Once a pattern matches,
then every token inside the <marker>
tag is filtered by its
correspondent postag. So if a pattern like “determinant-noun-adjective
masculine singular (with its exceptions)” is matched, then all other
readings (pronoun, verb, etc.) are removed.
For example:
<rule>
<pattern>
<marker>
<token postag="D[^R].[MC][SN0].*" postag_regexp="yes"/>
<token postag="N.[MC][SN0].*" postag_regexp="yes"/>
<token postag="A..[MC][SN0].*|V.P..SM|PX.[MC][SN0].*" postag_regexp="yes"><exception postag="V[MA]IP3S0" postag_regexp="yes"/></token>
</marker>
</pattern>
<disambig action="filterall"/>
</rule>
This would be equivalent to three rules (one for every token) like this:
<rule>
<pattern>
<marker>
<token postag="D[^R].[MC][SN0].*" postag_regexp="yes"/>
</marker>
<token postag="N.[MC][SN0].*" postag_regexp="yes"/>
<token postag="A..[MC][SN0].*|V.P..SM|PX.[MC][SN0].*" postag_regexp="yes"/>
</pattern>
<disambig action="filter" postag="D.*"/>
</rule>
Before using unification, you need to define features and equivalences of features, as described in Using unification. In the disambiguator file, you add the same unification block as in the rules file (the syntax is the same). Then, in the rule, you can leave only unified tokens, that is tokens that share the same features. For example, take a simple agreement rule from the Polish disambiguator:
<rule name="unifikacja przymiotnika z rzeczownikiem" id="unify_adj_subst">
<pattern>
<marker>
<unify>
<feature id="number"><feature id="gender"><feature id="case">
<token postag="adj.*" postag_regexp="yes"><exception negate_pos="yes" postag_regexp="yes" postag="adj.*"/></token>
<token postag="subst.*" postag_regexp="yes"><exception negate_pos="yes" postag_regexp="yes" postag="subst.*"/></token>
</unify>
</marker>
</pattern>
<disambig action="unify"/>
</rule>
It uses unification on three features (defined earlier in the file):
number
, gender
, and case
. Note that I am using a Tips and
tricks
to make sure that only words that are marked only as adjectives or
substantives are unified (otherwise the rule is too greedy).
There are several important restrictions: You cannot use two unified
blocks in the disambiguator file; only one unify
sequence per
pattern
is allowed. Moreover, the length of the matched tokens
(selected with <marker>
… </marker>
) must match the length of the
unified sequence. Of course, there might be more tokens in the rule,
but they cannot be selected with <marker>
… </marker>
if the
disambiguator is supposed to unify the sequence of tokens.
Sometimes, instead of filtering, you might want to remove only one interpretation from the token. You can do this in the following way:
<rule name="mają to nie maić" id="MAJA_MAIC">
<pattern>
<token>mają</token>
</pattern>
<disambig action="remove"><wd lemma="maić" pos="verb:fin:pl:ter:imperf">mają</wd></disambig>
</rule>
The above code removes one interpretation of the word “mają”: the one with the POS equal to “verb:fin…”, token equal “mają”, and lemma “maić”. You can supply only some of three parameters to remove a token: all supplied parameters must be matched, so the fewer parameters supplied, the more interpretations removed.
Adding new readings can be useful to mark up groups, such as noun groups or multi-word expressions. You can add a single reading or many readings to the whole sequence (for example, a start mark, an “inside” mark, and an end mark).
For example:
<rule name="ciemku" id="ciemku">
<pattern>
<token>ciemku</token>
</pattern>
<disambig action="add"><wd lemma="po ciemku" pos="adjp">ciemku</wd></disambig>
</rule>
The number of wd
elements must match the number of tokens selected
with <marker>
… </marker>
.
You can also add just POS tags without having to specify the lemmas or
tokens added. This is especially useful if you’re tagging tokens that
are matched by regular expressions or POS tags, so you don’t actually
know which one you will find. You can add a POS tag just by supplying
the wd
element without the lemma
attribute or without textual
content:
<rule name="uppercase tag" id="UPTAG">
<pattern case_sensitive="yes">
<token regexp="yes">\p{Lu}+</token>
</pattern>
<disambig action="add"><wd pos="UP"/></disambig>
</rule>
In the above example, I only added UP tag to uppercase words, the lemma is assumed to be equal to the token content, and the content of the token is not changed. So if the word was “Smiths”, it would be tagged as “UP”, and the lemma would be “Smiths” (although in other readings it could be “Smith”).
If you omit only a token, it will be equal to the token matched by the current rule (rather than empty).
Sometimes a string of tokens get raise a false alarm in many rules even
when the words are correctly tagged, and adding an exception to all the
rules might be overkill (for example, an idiomatic phrase). You can
then immunize the tokens by using the action immunize
:
<pattern>
<token>dla</token>
<marker>
<token>Windows</token>
</marker>
</pattern>
<disambig action="immunize"/>
The above pattern will be immunized, but only for the word “Windows”. This way no XML rule will match it. Java rules can ignore immunization - it’s up to their authors to respect immunization.
One can mark up tokens as spelled correctly. For example, one can use a regular expression to make the spell checker accept all Roman numbers:
<pattern case_sensitive="yes">
<token regexp="yes">(?:M*(?:D?C{0,3}|C[DM])(?:L?X{0,3}|X[LC])(?:V?I{0,3}|I[VX]))</token>
</pattern>
<disambig action="ignore_spelling"/>
There are also very rare words that are not used anywhere but in some specific contexts. We want the spell checker to complain about those in all but these special contexts. For example, in Polish, the the word ząb cannot be used in its genetive form zębu unless it’s applied in a specific context when one speaks of a kind of corn. The rule looks like this:
<rule id='KONSKI_ZAB' name="końskiego zębu - dobra pisownia">
<pattern>
<token>końskiego</token>
<marker>
<token>zębu</token>
</marker>
</pattern>
<disambig action="ignore_spelling"/>
</rule>
The best way to test disambiguation rules is to run LanguageTool on a middle-sized corpus (comparable to Brown corpus in English) and see if the previous false alarms are now fixed and no new false alarms are being created. Otherwise, it’s very hard to predict the impact of disambiguation rules.
You can test the disambiguation rules in a similar fashion to the way grammar rules are being tested. Let’s look at the example.
<example type="ambiguous" inputform="What[what/WDT,what/WP,what/UH]" outputform="What[what/WDT]"><marker>What</marker> kind of bread is this?</example>
<example type="untouched">What are you doing?</example>
In the above snippet, we declare that the sentence “What are you
doing?” should be left untouched, or unchanged by the disambiguation
rule, contrary to the ambiguous sentence that will be processed. Using
marker
element, we select the token that will be changed. The
attribute inputform
specifies the input forms of the token, in a
word[lemma/POS]
format. The outputform
is of course what the
disambiguation rule should produce.
Note also that in verbose mode (-v
on the commandline interface), LT
will display a log of all actions of the disambiguator for a given
sentence.