4 Problems of tokenization

In the former section, we saw that the requirements for a monolingual tokenizer are to be able to merge rule databases and for a multilingual tokenizer, to do this job dynamically.

4.1 The Standard Approach

Problems of tokenization include word tokenization and sentence tokenization. (GrefenTapa94) lists all the problems arising in text tokenization (dates, references, numbers, accronyms, abbreviations, compound words, punctuation). They expose the standard approach to text tokenization based on lexical analyser generators using pattern-matching via regular expressions (e.g lex et flex).

This solution is one of the most used and certainly one of the cleanest but it has some limits. The rules are declarative but not interpreted. They have to be compiled to generate an automaton in a source file. This source file has to be compiled and linked to the application. Moreover, the rules must be ordered, so including a new rule in a base is not so easy.

The state of the art does not enable us to solve our problem. In fact, we can't merge databases since the rules are ordered and even if we find a way to merge them, this cannot be done dynamically since the compilation makes it static. Moreover, the lack of flexibility prevent us from adding a new language without recompiling the whole application.

4.2 Improving the Standard Approach

To solve these problems, we consider the dual approach. In fact, we note that on a text it is often easier to locate tokens via their border with their environment rather than locating them by their constituents. So, instead of writting rules defining explicitly tokens as done in the standard approach, we write rules specifying borders: the tokens are implicitely defined as the region between two cuts.

A rule is a unique regular expression formed by three sub regular expressions: an optional left-context, a border-window, an optional right-context. The left-context and the right-context can be seen respectively as look-behind and look-ahead windows. The border between two tokens is defined between the border-window and the look-ahead window (i.e after the last character matched by the border-window).

The analysis is done in one pass from left to right. At the beginning of the process, we assume that a cut is located before the first character of the input.

A rule is applicable if at least one matching of the whole regular expression can be found in the input and if the first character matched by its border-window is after the last defined cut.

Among the applicable rules, the triggered rule is the one for which the first character matched by its border-window is the nearest character after the last defined cut. A new cut is defined after the last character matched by its border-window.

No more cut can be set before the last defined cut: all the applicable rules for which the first character of the matched border-window is before the last defined cut have their matching recomputed. This is done such that they become applicable or that no more match of the whole regular expression can be found.

At the end of the process and since the match of a look-behind window can start before a last defined cut, every matches of all regular expressions have been found.

Thus, the rules are independant and the rule databases are not ordered. This is possible thanks to both the dual approach and the rule triggering. The rules are totally declarative and interpreted, no order is required in the rule base.

4.3 Some remarks about the method

It is interesting to note that specifying tokens via their border is not really new.

The same approach is also used for the Penn TreeBank's tokenization with sed scripts. The standard unix command sed (stream editor) apply in sequence editing commands defined in a script on the input stream. The commands modify the input. Tokenization is done in inserting a special character (a border) in the stream.

The improvment of our method is the rule triggering combined with the semantic of the rules. This allows tokenization in one pass and the absence of order among the rules.

Up Down Previous Next