Recognizing logical paragraphs and relations of points in documents helps us to comprehend the documents. The logical paragraphs contain various segments such as "elabora- tion", "contrast" and "example". The authors often write their insistence using abstract terms. Therefore, they express their insistence in concrete cases with the examples, and accelerate to comprehend the documents. In this paper, we identify the example segments based on relations between hyponymy and themes in a document. We consider that sentences which contain concrete terms are divided into two types. The first type expresses themes, and the second type expresses examples. We calculate a rate of theme terms in a sentence, and capture whether the sentence expresses the themes or not. Thus, we find out that the sentence is likely to express the example if the sentence is not likely to express the theme. In our experimental evaluation, we confirmed that our proposed method scored better recall and F-measure than the baseline method.
In recent years, topic models have been widely used for many applications such as document summarization, document clustering etc. Labeled latent Dirichlet allocation (LLDA) was proposed based on latent Dirichlet allocation (LDA), and it regards the tags, i.e., labels, put on documents by humans as the ones expressing the contents of the documents, and uses them as supervised information to estimate latent topics of the documents. Moreover, it is reported that LLDA exceeds the ability of LDA in terms of topic estimation. However, normal documents usually do not have such tags with them, so, the use of LLDA is considerably limited.In this study, therefore, we make pseudo labels from the documents to be estimated their latent topics instead of tags put on documents by humans, and aim to make LLDA available for all documents.
We try raising the accuracy of multi-class document categorization using graph-based semi-supervised learning (GBSSL). With this end in view, we propose two methods. The first one is a method to construct a similarity graph by employing both surface information and latent information to express similarity between nodes. The second one is a method to select high-quality training data for GBSSL by means of PageRank algorithm. We experimented on Reuters-21578 corpus. We have confirmed that our proposed methods work well for raising the accuracy of multi-class document categorization.
In this paper, we propose a method to raise the accuracy of text classification based on latent topics, reconsidering the techniques necessary for good classification - for example, to decide important sentences in a document, the sentences with important words are usually regarded as important sentences. In this case, tf.idf is often used to decide important words. On the other hand, we apply the PageRank algorithm to rank important words in each document. Furthermore, before clustering documents, we refine the target documents by representing them as a collection of important sentences in each document. We then classify the documents based on latent information in the documents. As a clustering method, we employ the k-means algorithm and investigate how our proposed method works for good clustering.
The electronic medical records are written by nurses. There is a difference between the newcomer's description and veteran's description. In this study, the system creates the medical record sets of newcomers and veterans. The system supports users to discover features and differences in each medical record set. By using the system, newcomers can learn how veterans write the electronic medical record. Specifically, the system shrinks by the year of the length of experience and keywords that medical records include. The system draws and displays maps from shrunk electronic medical record sets. In addition, the system displays the words that are contained in medical record sets or in the map.
It is desired that messages don't become impolite when you send a message towards many unspecified persons. This paper aimed to extract the most likely sentence that contains impolite expression from message such as BBS and Twitter. The system uses the set of words that are more likely to be impolite expression to judge whether a message is impolite. The result of the system promotes the reconsideration of message contents to user. In addition, the result helps to avoid a contribution of the impolite expression.
As the general flow which performs exchange of opinions, there are an emission phase for which an idea is made to conceive broadly and a convergence phase which makes many ideas collect. In this study, we build the system which supports the broad way of thinking of an idea and its combination in the emission phase of exchange of opinions. That is, we display in a network the idea and its combination which the participant in exchange of opinions enumerated, and propose the system which presents and suggests the new idea and its combination which have not been enumerated yet. By the evaluation experiment, we checked that there was an effect to which the contents which the system presented and suggested urge a participant's broad way of thinking.
This paper proposes a system for retrieving time-series data based on a linguistic query given by a user. Our proposed system uses a line chart as a query. The system generates a linguistic query by verbalizing the chart first, then retrieves similar charts by using the obtained linguistic query.
We aim to support cross-modal information access triggered by photographs. Toward this purpose, we propose a method to facilitate information retrieval based on content (e.g., people, objects, events) or meta-data (e.g., date, place) of photographs. Among content or meta-data, we focus on the date on which a photograph is taken. We introduce Phickle, a system that facilitates time-series information retrieval. When a user focuses on the date of a photograph, the system provides access to other information at the time the photograph is taken. We conducted an experiment to find the points of improvement of the system. On the basis of the results of this experiment, we obtained positive opinions about a function for browsing time-series information related to users' photographs. However, we found that the availability of information access based on the date of photographs needs to be improved, because most participants didn't use this function.