Extracting attribute-values related to entities from web texts is an important step in numerous web related tasks such as information retrieval, information extraction, and entity disambiguation (namesake disambiguation). For example, for a search query that contains a personal name, we can not only return documents that contain that personal name, but if we have attribute-values such as the organization for which that person works, we can also suggest documents that contain information related to that organization, thereby improving the user's search experience. Despite numerous potential applications of attribute extraction, it remains a challenging task due to the inherent noise in web data -- often a single web page contains multiple entities and attributes. We propose a graph-based approach to select the correct attribute-values from a set of candidate attribute-values extracted for a particular entity. First, we build an undirected weighted graph in which, attribute-values are represented by nodes, and the edge that connects two nodes in the graph represents the degree of relatedness between the corresponding attribute-values. Next, we find the maximum spanning tree of this graph that connects exactly one attribute-value for each attribute-type. The proposed method outperforms previously proposed attribute extraction methods on a dataset that contains 5000 web pages.
It is not easy to test software used in studies of machine learning with statistical frameworks. In particular, software for randomized algorithms such as Monte Carlo methods compromises testing process. Combined with underestimation of the importance of software testing in academic fields, many software programs without appropriate validation are being used and causing problems. In this article, we discuss the importance of writing test codes for software used in research, and present a practical way for testing, focusing on programs using Monte Carlo methods.