1992 Volume 19 Issue 1 Pages 49-65
This paper introduces a computationally useful aspect of Mandarin reveled by statistical analysis of the6321most frequently used Chinese words of Suen(1986). The statistics extractherein include;(1)frequency distribution of consonants, vowels, phonemes and tones, (2)word-length count of syllables and phonemes, (3)entropies and primary as well as secondary conditional entropies of phonemes, (4)frequency distribution of short-distance words based on consonants, vowels, and/or phonemes, (5)substitution pairs of consonants, vowels and phonemes.These statistical properties provide useful information of fundamental importance in computer processing of the Chinese language. For example, an error-correcting scheme for a single Chinese character, even of known tone and part of speech, is difficult because the number of Chinese characters having Levenshtein distance of1averages10.38per word. But if we consider two-Chinese-character-word, the average number of words having the same Levenshtein distance of1reduces to3. 19per word, without taking the tone and parts of speech into account. This can now be drastically reduced to0. 26per word if such linguistic information in fully utilized. We believe that effective use of the statistical properities of the language thus extracted should be more fully explored in implementing an efficient error-correcting scheme in the machine processing of Mandarin.