dev.languagetool.org

Customizing sentence segmentation in SRX rules

LanguageTool supports specifying the sentence segmentation rules in SRX format, thanks to segment Java library. The rules for all languages are contained in /resource/segment.srx file, which can be downloaded also directly from git here. The rules are cascading, i.e., there are a few universal rules; alternating rules for paragraph breaking (you don’t need to edit them); and rules for specific languages.

The file can be edited using one of the available SRX editors: Ratel (open source, actively developed and used in the project), and SRXEditor (proprietary but free; contains a very helpful example file, which is copyrighted so it can only be consulted for inspiration). You can also use Pangolin, which is a web-based editor using the same code as Ratel.

Please use Ratel to maintain the same file formatting for easy version control. Basically, there are two kinds of rules:

No-break rules should precede the break rules in the file. All rules have two parts:

Both parts have to be specified using regular expressions. The library we’re using, segment, uses standard Java regular expressions which are slightly more expressive than what is described in SRX specification. To maintain portability, one shouldn’t use very advanced features such as lookahead - they are missing in the spec.

Note: in SRX specification and most available rules, there is an obvious mistake: sentences that are not the first in the paragraph start with leading whitespaces, which is obviously wrong and unacceptable for our project. Take time to see that the afterbreak section of the rule doesn’t contain any space (\s). In most cases, afterbreak should stay empty.

The rules are stored in a single file to make maintanance easier. Please note that the file covers many languages, and that rules support inheritance, and if you want to change Java code to support another SRX file, you will need to use one and two rule sets to support proper paragraph-breaking.