In this paper, we describe a rule-based mechanism that detects Japanese term variations from textual corpora. The system operates on the basis of meta-rules that map syntactic and morpho-syntactic variants of terms to the original forms of terms. The framework used here has been successfully applied to such languages as English and French, and we show here that it also works well in detecting Japanese term variants, once we properly take into account specific characteristics of the Japanese language. We also discuss the potential of this work for IR-related applications.
We address the problem of automatically transcribing Japanese orthographic words into symbols representing their pronunciations. Such a function is necessary for commercial continuous speech recognition systems since there are constant needs to create new recognition lexica for new applications or purposes. Simple look-up schemes are not adequate to deal with Japanese, while methods based on morphological analysis require in-depth linguistic knowledge and development effort. In this paper, we propose a statistical approach which is based on an N-gram language model. It is assumed that the pronunciation of a character only depends on the previous one to two characters and their pronunciations. Given an orthographic word, our method outputs the most likely phonetic transcription. It is shown that our approach provides superior performance to the public-domain conversion tool KAKASI on ten out of twelve test sets.
This paper quantitatively analyses the role of morphemes with respect to their types of origin. Static quantitative analysis of a given data set is not sufficient for this aim, as language data in general and terminological data in particular have the specific characteristic of being “incomplete” in the sense that many unseen elements are expected in the theoretical population. Thus, the quantitative structure of morphemes in terminology should be analysed dynamically, by observing the growth pattern of morphemes. In order to allow for that, we use binomial interpolation and extrapolation. Results of analyses of the terminologies of six different domains follow, revealing interesting characteristics of the role of morphemes of different types of origin that do not manifest themselves through static quantitative analysis.