Measuring the weight of the relation between a pair of entities is necessary to use social networks for various purposes. Intuitively, a pair of entities has a stronger relation than another. It should therefore be weighted higher. We propose a method, using a Web search engine, to compute the weight of the relation existing between a pair of entities. Our method receives a pair of entities and various relations that exist between entities as input. It then outputs the weighted value for the pair of entities. The method explores how search engine results can be used as evidence for how strongly the two entities pertain to the relation.
The link structure of the Web is generally represented by the webgraph, and it is often used for web structure mining that mainly aims to find hidden communities on the Web. In this paper, we identify a common frequent substructure and give it a formal graph definition, which we call an isolated star (i-star), and propose an efficient enumeration algorithm of i-stars. We then investigate the structure of the Web by enumerating i-stars from real web data. As a result, we observed that most i-stars correspond to index structures in single domains, while some of them are verified to be candidates of communities, which implies the validity of i-stars as useful substructure for web structure mining and link spam detecting. We also observed that the distributions of i-star sizes show power-law, which is another new evidence of the scale-freeness of the webgraph.
Community detection in networks receives much attention recently. Most of the previous works are for unipartite networks composed of only one type of nodes. In real world situations, however, there are many bipartite networks composed of two types of nodes. In this paper, we propose a fast algorithm called LP&BRIM for community detection in large-scale bipartite networks. It is based on a joint strategy of two developed algorithms -- label propagation (LP), a very fast community detection algorithm, and BRIM, an algorithm for generating better community structure by recursively inducing divisions between the two types of nodes in bipartite networks. Through experiments, we demonstrate that this new algorithm successfully finds meaningful community structures in large-scale bipartite networks in reasonable time limit.
This paper proposes a fashion-related image gathering algorithm and a retrieval system. Since it is difficult to define the fashion-related image exactly in mathematical sense, computers can not recognize whether given images are fashion-related even if they use computer vision techniques. It is also difficult to gather and search only fashion-related images on the Internet automatically for the same reason. In order to overcome these difficulties, we focus on human computing power, which helps computers to find fashion-related images from tons of images on the Internet. This paper provides an algorithm to gather high quality fashion-related images and propses a fashion-related image retrieval system, both of which utilize the information and meta data obtained in a fashion-related image sharing site. Evaluation experiments show that the proposed algorithm can gather fashion-related images efficiently and that the proposed retrival system can find desired images more effectively than Google Image Search.
Directory services are popular among people who search their favorite information on the Web. Those services provide hierarchical categories for finding a user's favorite page. Pages on the Web are categorized into one of the categories by hand. Many existing studies classify a web page by using text in the page. Recently, some studies use text not only from a target page which they want to categorize, but also from the original pages which link to the target page. We have to narrow down the text part in the original pages, because they include many text parts that are not related to the target page. However these studies always use a unique extraction method for all pages. Although web pages usually differ so much in their formats, they do not change their extraction methods. We have already developed an extraction method of anchor-related text. We use text parts extracted by our method for classifying web pages. The results of the experiments showed that our extraction method improves the classification accuracy.
We propose a machine learning based method of sentiment classification of sentences using word-level polarity. The polarities of words in a sentence are not always the same as that of the sentence, because there can be polarity-shifters such as negation expressions. The proposed method models the polarity-shifters. Our model can be trained in two different ways: word-wise and sentence-wise learning. In sentence-wise learning, the model can be trained so that the prediction of sentence polarities should be accurate. The model can also combined with features used in previous work such as bag-of-words and n-grams. We empirically show that our method improves the performance of sentiment classification of sentences especially when we have only small amount of training data.
This paper proposes an interactive information visualization system that supports exploratory data analysis of spatiotemporal trend information. A trend generally means a general direction in which a situation is changing / developing. Recent growth of computer and network systems has enabled us to obtain trend information at less cost, and it becomes important how to utilize such information. Exploratory data analysis is one of necessary activities of users to utilize trend information, in which users examine data space from various viewpoints using various views, notice interesting trend, and find interpretation useful for decision making or problem solving. As exploratory data analysis essentially involves trial and error, an interactive information visualization system that supports users' exploratory analysis of trend information should encourage users' trial and error. In order to design such systems, adequate interaction model that covers various actions to data space is necessary. In this paper, the visualization cube is proposed as abstract data model of spatiotemporal trend information, based on which interaction model for exploratory data analysis of spatiotemporal trend information is defined. The visualization cube consists of 4 axes; spatial and temporal axes, statistical data axis, and type-of-views axis. Interactions for generating views are defined as the operations on the visualization cube, which include drill down / up, comparison, spin, and transition. The interactive information visualization system for spatiotemporal trend information is developed based on the concept of visualization cube. Experiment is performed to compare the operating time between users with / without experience of using the system. The result shows the operations of the system based on the proposed interaction model are easy to understand without training. The system was also used in actual classes of an elementary school, of which the result shows the system has enough usability for 5th-grade elementary school children to perform exploratory data analysis.
Mobile devices are becoming more and more difficult to use due to the sheer number of functions now supported. In this paper, we propose a menu customization system that ranks functions so as to make interesting functions including both frequently used and functions that are infrequently used but have the potential to satisfy the user, easy to access. Concretely, we define the features of the phone's functions by extracting keywords from the manufacturer's manual, and propose a method that uses the Ranking SVM (Support Vector Machine) to rank the functions based on user's operation history. We conduct a home-use test for one week to evaluate the efficiency of customization and the usability of menu customization. The results of this test show that the average rank at the last day was half that of the first day, and that the user could find, on average, 3.14 more kinds of new functions, ones that the user did not know about before the test, on a daily basis. This shows that the proposed customized menu supports the user by making it easier to access frequent items and to find new interesting functions. From interviews, almost 70% of the users were satisfied with the ranking provided by menu customization as well as the usability of the resulting menus. In addition, interviews show that automatic cell phone menu customization is more appropriate for mobile phone beginners than expert users.
In recent years, application of Social Networking Services (SNS) and Blogs are growing as new communication tools on the Internet. Several large-scale SNS sites are prospering; meanwhile, many sites with relatively small scale are offering services. Such small-scale SNSs realize small-group isolated type of communication while neither mixi nor MySpace can do that. However, the studies on SNS are almost about particular large-scale SNSs and cannot analyze whether their results apply for general features or for special characteristics on the SNSs. From the point of view of comparison analysis on SNS, comparison with just several types of those cannot reach a statistically significant level. We analyze many SNS sites with the aim of classifying them by using some approaches. Our paper classifies 50,000 sites for small-scale SNSs and gives their features from the points of network structure, patterns of communication, and growth rate of SNS. The result of analysis for network structure shows that many SNS sites have small-world attribute with short path lengths and high coefficients of their cluster. Distribution of degrees of the SNS sites is close to power law. This result indicates the small-scale SNS sites raise the percentage of users with many friends than mixi. According to the analysis of their coefficients of assortativity, those SNS sites have negative values of assortativity, and that means users with high degree tend to connect users with small degree. Next, we analyze the patterns of user communication. A friend network of SNS is explicit while users' communication behaviors are defined as an implicit network. What kind of relationships do these networks have? To address this question, we obtain some characteristics of users' communication structure and activation patterns of users on the SNS sites. By using new indexes, friend aggregation rate and friend coverage rate, we show that SNS sites with high value of friend coverage rate activate diary postings and their comments. Besides, they become activated when hub users with high degree do not behave actively on the sites with high value of friend aggregation rate and high value of friend coverage rate. On the other hand, activation emerges when hub users behave actively on the sites with low value of friend aggregation rate and high value of friend coverage rate. Finally, we observe SNS sites which are increasing the number of users considerably, from the viewpoint of network structure, and extract characteristics of high growth SNS sites. As a result of discrimination on the basis of the decision tree analysis, we can recognize the high growth SNS sites with a high degree of accuracy. Besides, this approach suggests mixi and the other small-scale SNS sites have different character trait.
The source of information is one of the crucial elements when judging the credibility of the information. On the current Web, however, the information about the source is not readily available to the users. In this paper, we formulate the problem of identifying the information source as the problem of identifying the information sender configuration (ISC) of a Web page. An information sender of a Web page is an entity which is involved in the publication of the information on the page. An information sender configuration of a Web page describes the information senders of the page and the relationship among them. Information sender identification is a sub-problem of identifying ISC, and we present a method for extracting information senders from Web pages, along with its evaluation. ISC provides a basis for deeper analysis of information on the Web.
In this paper, we describe a public web service, ``PodCastle'', that provides full-text searching of speech data (Japanese podcasts) on the basis of automatic speech recognition technologies. This is an instance of our research approach, ``Speech Recognition Research 2.0'', which is aimed at providing users with a web service based on Web 2.0 so that they can experience state-of-the-art speech recognition performance, and at promoting speech recognition technologies in cooperation with anonymous users. PodCastle enables users to find podcasts that include a search term, read full texts of their recognition results, and easily correct recognition errors by simply selecting from a list of candidates. Even if a state-of-the-art speech recognizer is used to recognize podcasts on the web, a number of errors will naturally occur. PodCastle therefore encourages users to cooperate by correcting these errors so that those podcasts can be searched more reliably. Furthermore, using the resulting corrections to train the speech recognizer, it implements a mechanism whereby the speech recognition performance is gradually improved. Our experience with this web service showed that user contributions we collected actually improved the performance of PodCastle.
In this paper we propose a method for generating simple but semantically correct replies to user inputs which are not related to a given task of a task-oriented information kiosk or any other natural language interface placed in a public place. We describe our method for retrieving meaningful associations from the Web and adding modality based on chatlog data. After showing the results of the evaluation experiments, we introduce an implementation of an affect analysis algorithm and pun generator to increase users' satisfaction level.
This paper proposes a method for implementing real-time synonym search systems. Our final aim is to provide users with an interface with which they can query the system for any length strings and the system returns a list of synonyms of the input string. We propose an efficient algorithm for this operation. The strategy involves indexing documents by suffix arrays and finding adjacent strings of the query by dynamically retrieving its contexts (i.e., strings around the query). The extracted contexts are in turn sent to the suffix arrays to retrieve the strings around the contexts, which are likely to contain the synonyms of the query string.
Recently, web pages for mobile devices are widely spread on the Internet and a lot of people can access web pages through search engines by mobile devices as well as personal computers. A summary of a retrieved web page is important because the people judge whether or not the page would be relevant to their information need according to the summary. In particular, the summary must be not only compact but also grammatical and meaningful when the users retrieve information using a mobile phone with a small screen. Most search engines seem to produce a snippet based on the keyword-in-context (KWIC) method. However, this simple method could not generate a refined summary suitable for mobile phones because of low grammaticality and content overlap with the page title. We propose a more suitable method to generate a snippet for mobile devices using sentence extraction and sentence compression methods. First, sentences are biased based on whether they include the query terms from the users or words that are relevant to the queries, as well as whether they do not overlap with the page title based on maximal marginal relevance (MMR). Second, the selected sentences are compressed based on their phrase coverage, which is measured by the scores of words, and their phrase connection probability measured based on the language model, according to the dependency structure converted from the sentence. The experimental results reveal the proposed method outperformed the KWIC method in terms of relevance judgment, grammaticality, non-redundancy and content coverage.
When users find information about people from the results of Web people searches, they often need to browse many obtained Web pages and check much unnecessary information. This task is time-consuming and complicates the understanding of the designated people. We investigate a method that integrates the useful information obtained from Web pages and displays them to understand people. We focus on curriculum vitae, which are widely used for understanding people. We propose a method that extracts event sentences from Web pages and displays them like a curriculum vita. The event sentence includes both time and events related to a person. Our method is based on the following: (1) extracting event sentences using heuristics and filtering them, (2) judging whether event sentences are related to a designated person by mainly using the patterns of HTML tags, (3) classifying these sentences to categories by SVM, and (4) clustering event sentences including both identical times and events. Experimental results revealed the usefulness of our proposed method.
The Web technology enables numerous people to collaborate in creation. We designate it as massively collaborative creation via the Web. As an example of massively collaborative creation, we particularly examine video development on Nico Nico Douga, which is a video sharing website that is popular in Japan. We specifically examine videos on Hatsune Miku, a version of a singing synthesizer application software that has inspired not only song creation but also songwriting, illustration, and video editing. As described herein, creators of interact to create new contents through their social network. In this paper, we analyzed the process of developing thousands of videos based on creators' social networks and investigate relationships among creation activity and social networks. The social network reveals interesting features. Creators generate large and sparse social networks including some centralized communities, and such centralized community's members shared special tags. Different categories of creators have different roles in evolving the network, e.g., songwriters gather more links than other categories, implying that they are triggers to network evolution.
We propose a method to extract a lot of correspondences between questions and answers from a Web message board automatically. We use Web message boards as information sources because Web messasge boards have a lot of articles posted by general users. We extract correspondences between questions and answers that can be used in question answering systems to support natural language sentence input. At first, our proposed method classifies messages of a Web message board into either questions or others. Next, our method extracts a set of root-node pairs from the thread tree of a Web message board, where we define the thread tree when the root is an article classified as a question, and nodes are articles classified as answer candidates. Our method finds correspondences between questions and answers using two clues, (1)similarity between their articles, (2)link count between their articles. We experimented the proposed method, discussed results, and analyzed errors.
e propose a novel multi-document generic summarization model based on the budgeted median problem, which is a facility location problem. The summarization method based on our model is an extractive method, which selects sentences from the given document cluster and generates a summary. Each sentence in the document cluster will be assigned to one of the selected sentences, where the former sentece is supposed to be represented by the latter. Our method selects sentences to generate a summary that yields a good sentence assignment and hence covers the whole content of the document cluster. An advantage of this method is that it can incorporate asymmetric relations between sentences such as textual entailment. Through experiments, we showed that the proposed method yields good summaries on the dataset of DUC'04.
In this paper, we will construct a new Brain Computer Interface (BCI), for the purpose of analyzing human's investment decision makings. The BCI is made up of three functional parts which take roles of, measuring brain information, determining market price in an artificial market, and specifying investment decision model, respectively. When subjects make decisions, their brain information is conveyed to the part of specifying investment decision model through the part of measuring brain information, whereas, their decisions of investment order are sent to the part of artificial market to form market prices. Both the support vector machine and the 3 layered perceptron are used to assess the investment decision model. In order to evaluate our BCI, we conduct an experiment in which subjects and a computer trader agent trade shares of stock in the artificial market and test how the computer trader agent can forecast market price formation and investment decision makings from the brain information of subjects. The result of the experiment shows that the brain information can improve the accuracy of forecasts, and so the computer trader agent can supply market liquidity to stabilize market volatility without his loss.
As the web grows larger, knowledge acquisition from the web has gained increasing attention. Web search logs are getting a lot more attention lately as a source of information for applications such as targeted advertisement and query suggestion. However, it may not be appropriate to use queries themselves because query strings are often too heterogeneous or inspecifiec to characterize the interests of the search user population. the web. Thus, we propose to use web clickthrough logs to learn semantic categories. We also explore a weakly-supervised label propagation method using graph Laplacian to alleviate the problem of semantic drift. Experimental results show that the proposed method greatly outperforms previous work using only web search query logs.
We propose a machine learning-based method for analyzing coordinate structure in Japanese sentences. Effective methods for disambiguating coordination scopes already exist for English, but these methods assume input sentences always contain coordinations. Since detecting coordinations is non-trivial in Japanese, this assumption is often violated. The proposed method mitigates this problem by detecting the presence of coordinations and disambiguating their scopes simultaneously. It is built upon the previous work on English coordination that uses alignment graphs to evaluate the similarity of conjuncts. A ``bypass'' is introduced in these graphs to explicitly represent the non-existence of coordinations in a sentence, so that the feature weights for coordinations are learned separately from the weights for sentences not containing coordinations. We also propose to make all features dependent on the distance between conjuncts. In an experiment with the EDR corpus, the proposed method outperforms existing methods.
We address the problem of ranking influential nodes in complex social networks by estimating diffusion probabilities from observed information diffusion data using the popular independent cascade (IC) model. For this purpose we formulate the likelihood for information diffusion data which is a set of time sequence data of active nodes and propose an iterative method to search for the probabilities that maximizes this likelihood. We apply this to two real world social networks in the simplest setting where the probability is uniform for all the links, and show that when there is a reasonable amount of information diffusion data, the accuracy of the probability is outstandingly good, and the proposed method can predict the high ranked influential nodes much more accurately than the well studied conventional four heuristic methods.
The recent explosive increase of Web pages has made it possible for us to obtain a variety of information with a search engine. However, by some estimates, as many as 40% of the pages on the Web are duplicates of the other pages. Therefore, there is a problem that some search results contain duplicate pages. This paper proposes a method for finding similar pages from a huge amount of Web pages: hundred million Japanese Web pages. Similar pages are defined as two pages that share some sentences, and are classified into mirror pages, citation pages and plagiaristic pages, etc. First, in each page, its content region is extracted since sentences in a non-content region do not tend to be utilized for the similar page detection. From the content region in each page, relatively long sentences are extracted. This is because two pages tend to be relevant when they share relatively long sentences. A pair of pages that has the identical sentences is regarded as similar pages. Next, similar pages are classified based on several information such as an overlap ratio, the number of inlinks/outlinks, and the URL similarity. We conducted the similar page detection and classification on the large scale Japanese Web page collection, and can find some mirror pages, citation pages, and plagiaristic pages.