This paper discusses a segmentation approach of Mongolian for Cyrillic text for machine translation. Using this method, the processing of one-to-one word permutation between the variations of Mongolian and other languages, especially Altaic family languages like Japanese, becomes easier. Furthermore, it can be used for two-way conversion between texts of Mongolian used in different regions and counties, such as Mongolia and China. Our system has been implemented based on DP (dynamic programming) matching supported by knowledge-based sequence matching, referred to as a multilingual dictionary and linguistic rule bank (LRB), and a data-driven approach of the target language corpus (TLC). For convenience, NM (New Mongolian) is treated as the source language, and TM (Traditional Mongolian) and Todo as the target language in this test. Our application was tested using manually transcribed texts with sizes of 5, 000 sentences paralleled from NM to TM and Todo. We found that our method could achieve 91.9% of the transformation accuracy for “NM” to “TM” and 94.3% for “NM” to “Todo”.
Distributional similarity is a widely adopted concept to capture the semantic relatedness of words based on their context in various NLP tasks. While accurate similarity calculation requires a huge number of context types and co-occurrences, the contribution to the similarity calcualtion depends on individual context types, and some of them even act as noise. To select well-performing context and alleviate the high computational cost, we propose and investigate the effectiveness of three context selection schemes: category-based, type-based, and co-occurrence based selection. Categorybased selection is a conventional, simplest selection method which limits the context types based on the syntactic category. Finer-grained, type-based selection assigns importance scores to each context type, which we make possible by proposing a novel formalization of distibutional similarity as a classification problem, and applying feature selection techniques. The finest-grained, co-occurrence based selection assigns importance scores to each co-occurrence of words and context types. We evaluate the effectiveness and the trade-off between co-occurrence data size and synonym acquisition performance. Our experiments show that, on the whole, the finest-grained, co-occurrence based selection achieves better performane, although some of the simple category-based selection show comparable performance/cost trade-off.