Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Reconstructing the Language Family Tree from Multilingual Corpus Based on Probabilistic Language Modeling
KENJI KITA
Author information
JOURNAL FREE ACCESS

1997 Volume 4 Issue 3 Pages 71-82

Details
Abstract
This paper proposes a new method for automatically clustering languages.The basicidea of this method involves developing a probabilistic model for each languagefrom the given linguistic data, and then computing the distances between languagesaccording to the distance measure defined on the language models.Clustering isperformed based on this distance measure.The paper embodies this idea when the N-gram language model is concerned.The effectiveness of the proposed methodhas been confirmed by evaluation experiments using multilingual texts of nineteendifferent languages from the ECI Corpus (European Corpus Initiative Multilingual Corpus).The results were very encouraging.They were very close to the family treeof languages established in linguistics.
Content from these authors
© The Association for Natural Language Processing
Previous article Next article
feedback
Top