2 The Framework: Natural Language Diagnosis

An answer to Natural Language Diagnosis also called Categorization According to Language, has been given in (Giguet95a and Giguet95b). The aim of such a tool is to tag sentences with the name of their language.

This research is both parallel and complementary to some standardization efforts which are done for language engineering in large projects such as Text Encoding Initiative, Eagles and Multext. It is parallel since it is very useful for a NLP system to know the language which is processed. It is complementary since quantities of texts are already available and will never be standardized. Furthermore we don't believe that in the future authors will waste their time tagging by hand their own documents to help a NLP system understanding what they write.

One problem we have to solve was the following : "To process an input we have to tokenize it. But among the known languages, which tokenization rule database is to be used since the Natural Language Diagnosis is not done yet ?"

Before answering to this question, we briefly describe the linguistic properties used for our Natural Language Diagnosis tool.

2.1 Linguistic Features used for Natural Language Diagnosis

The process is based on the study of natural properties of languages. The power of resolution is based on the combination of several methods. For each of them, we try to find linguistic justifications. Efficiency is not the goal, it has to be the consequence of a good linguistic analysis.

The main method is the search of grammatical words. In fact, they are proper to each language and are in a whole different from one language to another. Moreover, they are short, not numerous and we can easily build an exhaustive list. They represent about 50% of the words of a sentence. With this method and tested for 4 Western-Europe natural languages discrimination, the system achieve perfect categorization for sentences of more than 9 words.

The goal is now to categorize short sentences. To do this, we have to caracterize non grammatical words because short sentences don't have enough grammatical words to allow total discrimination. The other methods used are the alphabet and the word endings. The alphabet is useful thanks to characters with diacritics. The word endings is a compromise for non grammatical words caracterization. In fact, it is very difficult to get a representative corpus of one language, so relying on other informations is quite risky. When we add these two methods, the system achieve perfect categorization for sentences of more than 7 words and a very high discrimination rate for small sentences.

Among the different languages there can be interferences since a grammatical words or a character may exist in several languages (e.g French:on vs. English:on) but it is always the convergence of several clues which allows the categorization.

Since research on Natural Language Diagnosis has already been reported in (Giguet95a and Giguet95b), interested readers can get more precisions about this work in these articles or by getting in touch with me.

Up Down Previous Next