Evaluation methods whose targets are system outputs (summaries) themselves are often called “intrinsic methods”. Computer-produced summaries have been traditionally evaluated by comparing with human-written summaries using the F-measure. But, the F-measure has the following problem: the F-measure is not appropriate when alternative sentences are possible in a human-produced extract. For example, when there are two sentences 1 and 2, and sentence 1 is in a human-produced extract, if a system chooses sentence 2, it obtains lower score, even if sentences 1 and 2 are interchangeable. In this paper, we examine some of the evaluation methods devised to overcome the problem. Several methods that devised to overcome the problem have been proposed. Utility-based measure is one of them. However, the method requires a lot of effort for humans to make data for evaluation. In this paper, we first propose pseudo-utility-based measure that uses human-produced extracts at different compression ratios. In order to evaluate the effectiveness of pseudo-utilitybased measure, we compare our measure and the F-measure using the data of Text Summarization Challenge (TSC), a subtask of NTCIR workshop 2, and show that pseudo-utility-based measure can resolve the problem. Next, we focus on contentbased evaluation. Though it is reported that content-based measure is effective to resolve the problem, it has not been examined from a viewpoint of comparison of two extracts that are produced from different systems. We evaluated computer-produced summaries of the TSC by the content-based measure, and compared the results with a subjective evaluation. We found that the evaluation by the content-based measure matched those by humans in 93% of the cases, if the gap in the content-based scores between two abstracts is more than 0.2.
View full abstract