This paper describes the microplanner of the SILK system which can generate texts appropriate for intermediate non-native users on discourse level. Four factors (i.e. nucleus position, between-text-span punctuation, embedded discourse markers and punctuation pattern) are regarded to affect the readability at discourse level. It is the preferences among these factors that decide the readability. Since the number of possible combinations of the preferences is huge, we use Genetic Algorithm to solve such a problem. We adopt two methods to evaluate the system: one is evaluating the reliability of the SILK system by analysing how often it re-generates corpus texts, another is judging readability by human subjects. The evaluation results show that the system is reliable and the generation results are appropriate for intermediate non-native speakers on discourse level.
Patent processing is important in various fields such as industry, business, and law. We used F-terms (Schellner 2002) to classify patent documents using the k-nearest neighborhood method. Because the F-term categories are fine-grained, they are useful when we classify patent documents. We clarified the following three points using experiments: i) which variations of the k-nearest neighborhood method are the best for patent classification, ii) which methods of calculating similarity are the best for patent classification, and iii) from which regions of a patent terms should be extracted. In our experiments, we used the patent data used in the F-term categorization task in the NTCIR-5 Patent Workshop (NTCIR committee 2005; Iwayama, Fujii, and Kando 2005). We found that the method of adding the scores of k extracted documents to classify patent documents was the most effective among the variations of the k-nearest neighborhood method used in this study. We also found that SMART (Singhal, Buckley, and Mitra 1996; Singhal, Choi, Hindle, and Pereira 1997), which is known to be effective in information retrieval, was the most effective method of calculating similarity. Finally, when extracting terms, we found that using the abstract and claim regions together was the best method among all the combinations of using abstract, claim, and description regions. The results were confirmed using a statistical test. Moreover, we experimented with changing the amount of training data and found that we obtained better performance when we used more data, which was limited to that provided in the NTCIR-5 Patent Workshop.