Automatic Identification of Major Text Language

Main Article Content



Text-based language identification is the task of automatically recognizing a language from a given text of document. It is an important research area as large quantities of text are processed automatically for tasks such as spelling and grammar checking, information retrieval, search engines, language translation, and text mining. In this research, an adequate mechanism for efficient text-based language identification is presented with an emphasis on 7 major languages used in Ethiopia and India namely, Afar, Amharic, Nuer, Oromo, Sidamo, Somali and Tigrigna. These languages were chosen because they are spoken by more than 79.3% of the total population of India and Ethiopia. Factors affecting accuracy such as the size and variety of training data and the size of the string to be identified are investigated. Naïve Bayes classifier, SVM classifier and Dictionary Method are used. Naïve Bayes and SVM classifiers are trained by using character n-gram of size 3 as a feature set. The dictionary method uses stopwords. The experiments are conducted on three different character windows that provide an equivalent representation of short, medium and long document size. Overall, the 3-gram Naïve Bayes classifier, the 3-gram SVM classifier and the dictionary method showed an average classification accuracy of 98.37%, 99.53%, and 90.53% respectively. When trained with homogeneously distributed training data per language, the 3-gram Naïve Bayes and SVM classifiers showed an average classification accuracy of 95.16% and 96.2% respectively. To evaluate multilingual identification, an artificial corpus that contains 1050 documents is constructed. 45 out of 1050 documents are wrongly classified which corresponds to 95.71% accuracy. The challenging tasks in the study are: identification of closely related languages that share similar character sequences, identifying the language of short excerpts from texts, and the unavailability of standard corpus. The use of classification approach, combined with linguistically motivated features such as POS tags and morphological information is recommended as a way forward for providing empirical evidence on the convergences and divergences of language varieties in terms of lexicon, orthography, morphology and syntax.

Article Details