The stakes of multilinguality

Conclusion

In this article, the problem we tried to solve was the tokenization of a multilingual text in the framework of Natural Language Diagnosis. In our earlier related work, an answer to Natural Language Diagnosis has been given but the tokenization of multilingual text was a problem since it was done by a french ad-hoc tokenizer.

Our goal was to find an elegant way of tokenizing multilingual texts. So, efficiency was not the first goal. The study of the problem in multilingual framework clarify the monolingual tokenization process and lead us to find a new way to solve it. We propose to separate generic rules and language specific rules. The consequences are clear. Each database become smaller and, thus, more readable. Adding a new language is quiet simple since we just have to write the few specific rules and to merge statically or dynamically the shared database. Updating a generic rule is also simple since one file has to be modified for all the languages.

Looking for a solution to multilingual text tokenization, we found an elegant architecture to tokenize monolingual text. Then, studying again the multilingual aspects of the problem, we find that it was possible to merge all the databases to tokenize multilingual texts. The condition was the absence of order among the rules in the databases.

Using the fact, that it is often easier to define tokens implicitely via their border rather than by then constituents, we proposed an algorithm to tokenize in one pass an input stream.

Down