Graph unification has become a central processing mechanism of many natural language systems due to the popularity of unification-based theories of computational linguistics. Despite the popularity of graph unification as the central processing mechanism, it remains the most expensive part of unification-based natural language processing. Graph unification alone often takes over 90% of total parsing time. As the criteria for efficient unification, we focus on two elements in the design of an efficient unification algorithm: 1) elimination of excessive copying and 2) quick detection of unification failures. We propose a scheme to attain these criteria without expensive overhead for reversing the changes made to the graph node structures based on the notion of quasi-destructive unification. Our experiments using an actual large scale grammar and also using a simulated grammar producing different unification success rates show that the quasi-destructive graph unification algorithm runs roughly at twice as fast as Wroblewski's non-destructive unification algorithm.
Morphological analysis of Japanese is very different from that of English, because no spaces are placed between words. This is also the case in many Asian languages such as Korean, Chinese, Thai and so forth. In the Indo-European family, some languages such as German have the same phenomena in forming complex noun phrases. Processing such languages requires the identification of the word boundaries in the first place. This process is often called segmentation. Segmentation is a very important process, since the wrong segmentation causes fatal errors in the later stages such as syntactic, semantic and contextual analysis. However, correct segmentation is not always possible only with morphological information. Syntactic, semantic and contextual information are also necessary to resolve the ambiguities in segmentation. This paper proposes a method to integrate the morphological and syntactic analysis based on LR parsing algorithm. An LR table derived from grammar rules is modified on the basis of connectabilities between two adjacent words. The modified LR table reflects both the morphological and syntactic constraints. Using the LR table and the generalized LR parsing algorithm, efficient morphological and syntactic analysis is available.
There have been a number of theoretical studies devoted to the notion of sublanguage. Furthermore, there are some successful natural language processing systemswhich have explicitly or implicitly utilized sublanguage restrictions. However, two big problems are still unsolved to utilize the sublanguage notion: 1) automatic definition and dynamic identification of a text to sublanguage, and 2) automatic linguistic knowledge acquisition for sublanguage. There are now new opportunities to address these problems owing to the appearance of large machine-readable corpora. Although there have been several experiments to try to solve the second problem listed above, the first problem has not received so much attention. In the previous sublanguage N. L. P. systems, the domain the system is dealing with was defined by a human. This is actually one method to define the sublanguage of a text, and, in a sense, it seems to work well. However, it is not always possible and sometimes it may be wrong. In order to maximize the benefit of the sublanguage notion, we need automatic definition and dynamic sublanguage identification. We will report preliminary experiments on sublanguage definition and identification based on lexical appearance. The results of the experiments show that the methods proposed can be useful in processing a new text. In particular, the fact that the first two sentences can reliably identify a text's sublanguage encourages us in further investigation of this line of research. In conclusion, it appears that the inductive definition of sublanguage and sublanguage identification would be beneficial for natural language processing.