The stakes of multilinguality

3 Multilingual Text Tokenization for Natural Language Diagnosis

The problem is that we don't know which monolingual tokenization rule database is to be used because the Natural Language Diagnosis is not done yet.

3.1 The first experiment

In the first version of our system, a french basic tokenizer was used to tokenize multilingual texts. The results were good since a lot of tokenization rules are common to all the targetted languages but many tokens from other languages were erroneous. In fact, the way of splitting tokens is sometimes different from one language to another even if it seems at first to be the same. Let's take a look at the elision. An elision often means that a voyell has disappeared. Depending on the language, several ways of splitting the elision exist. Here are some examples:

In french, an elision is done between a pronoun, a determiner or a conjunction and the following word if it starts by a voyell. the elided voyell is always the last voyell of the first word (e.g le avion => l'avion => l' + avion). Another phenomenon is the elision in popular language. In this case, there is only one token (e.g petit => p'tit => p'tit).
In english, with verb contractions, the elided voyell is the first of the second word (e.g they are => they're => they + 're). With negation contraction, the tokenization is different since the elision is done inside the negation (e.g does not => doesn't => does + n't).
In italian, we note same ways of splitting than in french (e.g della arte => dell'arte => dell' + arte).

In this first experiment, the consequences of using a french basic tokenizer were very bad. We had to modify the grammatical word databases of the Natural Language Diagnosis to take into account the french tokenizer's errors. For instance, they're was split they' + re and doesn't was split doesn' + t because in french it is the easiest way of splitting the elision. So, we had to consider they' and {doesn'} as english grammatical words which was not acceptable in a linguistic point of view, even if the results were improved.

Splitting correctly tokens is fundamental for our system. In fact, we claimed in 2.1 that the more grammatical words appear in a sentence, the best Natural Language Diagnosis is. So it is important for our system to split contractions or inverted pronouns since this process often make grammatical words appear.

To settle this problem, we first thought about updating the french rules and adding rules to the french tokenizer to get a multilingual tokenizer but it was not a good solution. In fact, even if we have to update the french rules which were too basic, adding rules would not only generate an unmaintainable set of multilingual rules but this set would also be redundant with any monolingual tokenization rule database (i.e the english rules we should add to the multilingual tokenizer would be redundant to the rules included in a traditionnal english tokenizer).

From this point, we could not continue our reseach without studying precisely what meant tokenization in a multilingual framework.

3.2 Studying Tokenization in a Multilingual Framework

Studying problems of tokenization in five Western-Europe natural languages (french, english, german, spanish and italian), we found that most of the rules were common to all these languages. These rules process tokens bounded by explicit separators like spaces and punctuation.

The language specific rules split tokens where no explicit boundaries can be located. For instance, one would like to split verb contractions in english that's, couldn't (or the Anglo-Saxon genitive of nouns), determiner, pronoun and conjunction contractions in french l'envie, j'aime, qu'elle, inverted pronouns in french donne-le, veux-tu and determiner contractions in italian dell'arte.

In a conceptual point of view, this multilingual analysis leads us to consider:

one shared database for every languages,
one language specific rule database for each particular language.

In a linguistic point of view, this study clarify the (monolingual) tokenization mechanism. In a computational point of view, instead of having one unstructured and quiet unreadable rule database, the two kinds of rules are physically divided into two databases that have to be merged to tokenize one particular language. For instance, to tokenize english, the shared database and the english specific rule database have to be merged.

To tokenize multilingual texts, we experiment the merge of the monolingual tokenizers. Merging the tokenizers means combining their tokenization rules.

3.3 Combining monolingual tokenization rule databases

We start the study with the background reported in section 3.2. There, we saw that many tokenization rules are common to every languages and that it was interesting to put them in a shared database. To tokenize a multilingual text, this set of rules just have to be executed once for all the targetted languages.

The problem of combining the language specific rules still remains. We choose to experiment the following idea. To tokenize a monolingual text, we have to merge the shared database and the language corresponding to the language of the text. To tokenize a multilingual text, a solution can simply consist in merging the shared database and all the language specific rule databases of the targetted languages, and then in applying the rules to the input.

Merging the shared database and a language specific rule database seems to be easy since the two sets of rules match different inputs. The first one matches patterns with explicit separators whereas the second one matches patterns with no explicit separators.

The problem one could fear states in the language specific rules. Are there interferences between the specific rules of the different languages ? Until now, we never see such interferences but we have got some explanations.

The specific tokenization rules of a language process tokens with no explicit boundaries. These rules are often written to make grammatical words appear. With respect to the linguistic features described in 2.1, the grammatical words are not numerous. So, there are a few percent of them which can be involved in agglutination via a dash or an elision. Moreover, from one language to another, the grammatical words are in a whole different. According to us, an interference could states between two rules involving a grammatical word and an elision or a dash and the two rules don not define the cut at the same place. An example can be found in french where the string rendez-vous can be whether the noun rendez-vous or the conjugated verb rendre following by the inverted pronoun vous.

The problem is now to check if the monolingual standard approach to tokenization suits these multilingual requirements and allows us to realize such a model. We will see that it does not and that we show how to modify it.

Down