Answer validation is a component of question answering system, which selects the most reliable answer from the answer candidates that are extracted by certain meth-ods. In this paper, we propose a new approach of answer validation based on the strength of lexical association between keywords in a question sentence and each of the answer candidates, and examines the effectiveness of the approach; we use search engine hits to calculate the association score. We propose two types of answer se-lection methods based on the association score. The first method first extracts the appropriate keywords from a given question sentence using word weights and then selects the final answer by using the association score. The second method selects appropriate keywords and answer candidate simultaneously using the association score.The result of experimental evaluation shows that a good proportion (79%) of a multiple-choice quiz “Who wants to be a millionaire” can be solved by combin-ing these two methods. The proposed method significantly outperforms a previous method for answer selection based on search engine hits.
This paper focuses on bilingual news articles on WWW news sites as a source for translation knowledge acquisition.We take an approach of acquiring translation knowledge of domain specific named entities, event expressions, and collocational expressions from the collection of bilingual news articles on WWW news sites. Inthis framework, pairs of Japanese and English news articles which report identical contents or at least closely related contents are retrieved. Then, a statistical measure is employed for the task of estimating bilingual term correspondences based on co-occurrence of Japanese and English terms across relevant Japanese and English news articles. We experimentally show that the proposed method is effective in estimating bilingual term correspondences from cross-lingually relevant news articles.
Expressions of “prefix 0+main verb+auxiliary verb” and “prefix GO+main verb+auxiliary verb” are important verbal-honorific expressions in the Japanese language. It has been pointed out in past linguistic researches that the difference between the two types of expressions is that the main verb after “O” is a Japanese word and the one after “GO” is a Chinese word. However, there have hardly been any quantitative researches made on the differences of the two expressions so far. In this study, quantitative analyses were performed to reveal differences in the im-pressions of politeness between these two types of expressions by using Scheffe's paired comparison method and statistical tests. Results suggest that in regard to difference in politeness from a plain form, “prefix GO+verb of Chinese word+aux-iliary verb, ” is smaller than “prefix O+verb of Japanese word+auxiliary verb.” It is suggested that these results are due to the difference between these expressions as to the recognition of honorific expressions.
This paper proposes a method of extracting a bilingual pair of a syntactically am-biguous named entity and its counterpart from a sentence-aligned English-Japanese parallel corpus.This method computes the degree of semantic and phonetic similar-ities between an English named entity and its translation candidate, and calculates the overall score of the pair as the weighted sum of the two kinds of scores. It avoids extracting English named entities with wrong prepositional phrase attach-ment and/or wrong scope of coordination. In an experiment using a parallel corpus of Yomiuri Shimbun and The Daily Yomiuri, the proposed method has achieved the F-value of 0.678, which surpasses 0.583 marked by a baseline method.
A human-like common sense and judgment is necessary to materialize a computer that can take communication with human.Because, when people talk to each other, we have the concept of time in our mind consciously or unconsciously.In the case, the ability to call concept in mind and to associate with many referred concepts will be an important matter.This paper will propose the method to systemize judgment concerning time, based on the mechanism to associate concept with many other referred concepts.In this research, the aim is rather for daily time-expression and an adaptable mechanism that can even deal with unknown expression.The feature of this paper is, using the knowledge in various ways by the viewpoint of time, from a small amount of given knowledge.As a result, the percentage of correct answers of the time judgment system is approximately 69.4%, and the precision is approximately 81.6%.Therefore, the time judgment system using the technique proposed in this paper is an effective system.
A simpler distribution that fits empirical word distribution about as well as a negative binomial is the Katz K mixture.In the K mixture model, the basic assumption is that the conditional probabilities of repeats for a given word are determined by a constant decay factor that is independent of the number of occurrences which have taken place.However, the probabilities of the repeat occurrences are generally lower than the constant decay factor for the content-bearing words with few occurrences that have taken place.To solve this deficiency of the K mixture model, in-depth exploration of the characteristics of the conditional probabilities of repetitions, decay factors and their influences on modeling term distributions was conducted.Based on the results of this study, it appears that both ends of the distribution can be used to fit models.That is, not only can document frequencies be used when the instances of a word are few, but also tail probabilities (the accumulation of document frequencies). Both document frequencies for few instances of a word and tail probabilities for large instances are often relatively easy to estimate empirically.Therefore, we propose an effective approach for improving the K mixture model, where the decay factor is the combination of two possible decay factors interpolated by a function depending on the number of instances of a word in a document.Results show that the proposed model can generate a statistically significant better estimation of frequencies, especially the frequency estimation for a word with two instances in a document.In addition, it is shown that the advantages of this approach will become more evident in two cases, modeling the term distribution for the frequently used content-bearing word and modeling the term distribution for a corpus with a wide range of document length.
We have collected both Web news-paper articles of several hundreds of characters, for three years and their counter parts distributed for mobile terminals, which consist of fifty to a hundred characters.Then, we extracted a number of candidates of paraphrases of the final part of sentences from them automatically.At first we have aligned these two types of corpus first at article level, then at sentence level.Next, we extract the final part of mobile article sentences using morphological analyzer, and collect their counterpart expressions of Web article sentences.Finally, we extracted the candidates of morpheme sequence from the final part of Web article sentence, then we propose the combination of two methods for them in order to improve the extraction accuracy of the sets: 1) ranking based on frequency, branching factor and length of string, and 2) filtering to remove inappropriate expressions which eliminate semantically indispensable nouns.
Using currently available Mongolian linguistic resources such as the lists of stems of nouns and verbs as well as the list of suffixes, this paper proposes a method for morphologically analyzing noun/verb phrases of the Mongolian language.More specifically, we first examine phonological and morphological constraints on connecting stems of nouns/verbs and suffixes, and invent inflection/conjugation rules for nouns/verbs.We experimentally show that, almost 100% cases, correct noun/verb phrases can be found among the candidates of noun/verb phrases generated by the proposed method.Then, we compile a table for mapping a stem-suffix pair and a phrase to be generated from the stem-suffix pair.Morphological analysis of noun/verb phrases is performed by simply consulting this mapping table.We experimentally show that, by the proposed method, correct candidate of stem-suffix pair can be obtained from the given noun/verb phrases.
We propose a method for acquiring knowledge from a single corpus on correspondences between abbreviations and their original words.This is an improvement of our previous method so that higher precision is attained for the same recall.This knowledge is useful for such tasks as information retrieval, word sense disambiguation and summarization.Our method searches “abbreviation candidates” and “original word candidates” corresponding to the abbreviation candidates by using information of characters composing them.Then, in order to decide a correspondence between an abbreviation and its original word, the similarity between the abbreviation candidate and the original word candidate is calculated by using statistical information in the single corpus.For example, a correspomdemce betweemabbreviatiom “gempatsu (a muclear power statiom)” amd origimal word “genshiryoku hatsudensho (a nuclear power station)” is extracted by our method.Here, our method does not presume that information whether each noun in the corpus is an abbreviation or an original word is given.Experimental results show that our method is promising, as the precision attains 73.4%.We compare our method with our previous method and experimental results suggest that our method is able to extract correspondences between abbreviations and original words more appropriately than our previous method.