Abstract

In this article, we address a problem occurring in any multilingual treatments. This problem is interesting since it involves what we can call the stakes of multilinguality.

The major problem occuring in multilingual text treatments is that at any moment the language can change and linguistic treatments must manage this event to prevent failure. A way of preventing failure is to use a function called Natural Language Diagnosis or Categorization According to Language. The idea is to tag each part of the input with the name of its language.

The problem we are going to give a solution to, is that at any moment we never know the current language since it is the goal of our tool to discover it. It is impossible to choose the right monolingual tokenization rule database. So, how can we do a blind but efficient tokenization of a multilingual text,?

First, it seems important to explain the importance of multilinguality for our research. Then, we will illustrate step by step our claims in describing the way we solve our problem.

After introducing the linguistic knowledge used for Natural Language Diagnosis, we will present the advantages of studying the problem of tokenization in a multilingual framework and we will give an elegant answer to both Monolingual and Multilingual Text Tokenization.

Then, we will present a way of modifying the standard monolingual tokenization approach to take into account multilingual requirements.

Up Down Previous Next