We proposed an evaluation method based on multiple correct answer summaries. Conventional evaluation methods had reliability problem due to adopting single model answer while multiple correct answer summaries may exist from various points of view. We aimed to increase the reliability of automatic evaluation, and focused on an evaluation method using multiple answer summaries. In our method, we introduced linear combinations of answer summaries, all denoted by vectors, and calculated its maximum value of the scalar product for the answers and the target summary. To verify the reliability of our method, 7 people created summaries for 4 newspaper articles in NTCIR-2 summarization test collection data. However, low agreement among these answer summaries showed these data inadequate to be used as answers for the evaluation method. These summaries showed some tendency of keeping the text configurations due to anaphoric relations and sentence cohesions. Those findings will be valuable in creating model summaries. To verify the feasibility of the evaluation method, some automatic methods were evaluated using the multiple correct summaries. Most feasible method was varied according to each correct summary. The result has proved our presupposed theory, that multiple correct answers were necessary to sufficiently evaluate the target summary data.
View full abstract