Abstract
In general, a text consists of multiple sentences, and there are some semantic relations among them. A certain range of sentences in a text, is widely assumed to form a coherent unit which is usually called a discourse segment. While sentences in a segment have semantic relations with each other, segments in a discourse have some relations with each other. The global discource structure of a text can be constructed by relating the segments with each other. Therefore, identifying the segment boundaries is a first step to recognize the structure of a text. There are many surface linguistic cues which help for identifing text segmentations in a text. In this paper, we describe a method for identifying segment boundaries of a Japanese text with the aid of multiple surface linguistic cues, though our experiments might be small-scale. We calculate a weighted sum of the scores for all cues that reflects their contribution to identifying the correct segment boundaries. We also present a method of training the weights for multiple linguistic cues automatically without the overfitting problem.