A Method to Extract Sentences with Protein Functional Information from Literature by Iterative Learning of the Corpus

Md. Ahaduzzaman Munna; Takenao Ohkawa

doi:10.11185/imt.2.89

抄録

We are developing PROFESS, a system to assist with the extraction of protein functional site information from the literature related to protein structural analysis. In this system, the sentences with functional information are first extracted. This paper proposes the complementary use of the protein structure data, keywords and patterns to extract the target sentences. In the proposed method, the sentences in the literature are expressed in vector using these three features, which are learnt by the SVM. As the accuracy of the SVM depends on the number of effective vector elements, we propose a method to automatically extract patterns to add as new vector elements and obtain a higher value in accuracy. There is a problem of matching of the patterns to the sentences when any proper noun tag is expressed adjacent to residue tag. We defined two rules to eliminate these unnecessary tags so that the patterns can match to the sentences. The proposed method was applied to five documents related to structural analysis of protein for extracting sentences with protein functional information, where eight literatures were used for the feedback for each of the experiment literatures. The average recall value and F value were 0.96 and 0.69, respectively. It was confirmed that the increase of the number of the vector elements lead to a higher performance in the sentence extraction.

著者関連情報

お気に入り & アラート

閲覧履歴

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）