Abstract
This paper proposes a new method for automatically clustering languages.The basicidea of this method involves developing a probabilistic model for each languagefrom the given linguistic data, and then computing the distances between languagesaccording to the distance measure defined on the language models.Clustering isperformed based on this distance measure.The paper embodies this idea when the N-gram language model is concerned.The effectiveness of the proposed methodhas been confirmed by evaluation experiments using multilingual texts of nineteendifferent languages from the ECI Corpus (European Corpus Initiative Multilingual Corpus).The results were very encouraging.They were very close to the family treeof languages established in linguistics.