With the recent spread of communication using social media, exchanging opinions each other on web has become more common irrespective of age and sex. On the other hand, a problem called as “Internet flaming” often occurs along with the increase of social network service users. One of the reasons might be that the users do not recognize meanings/intentions/emotions expressed by other users’ words. In this study, we focused on slangs (Internet slangs) that are often used on SNS but are not registered in dictionaries, then tried to convert them into standard words. We also intended to output more appropriate candidates by considering not only semantic similarity but also affective similarity. The proposed method conducts filtering and re-ranking over the semantically similar candidates obtained based on distributed representations to detect the inappropriate candidates as standard word by focusing on two points: (1) features of slang/standard word and (2) affective similarity between the inputted word and the candidate words. In the evaluation experiment, the proposed method obtained a higher MRR than the baseline method.
This paper proposes information visualization system for exploratory LOD (Linked Open Data) Analysis. The LOD is a framework to make data open to the public. Recently, it has been widely used to publish various kinds of data such as statistical data, geographical data, and academic data. The RDF (Resource Description Framework), which describes data as a set of triples consisting of subject, predicate and object, is commonly used to publish data as LOD. When we want to use LOD, it is necessary to understand its structure, such as graph structure of RDF data, used vocabularies and resources. Therefore, we often have to conduct exploratory analysis of LOD. In order to support the analysis, the proposed system analyzes the structure of LOD written with RDF, and visualizes the result of analysis. As most of currently available LOD have table structure, the proposed system identifies whether target dataset contains table structure or not using resource sampling with SPARQL queries. Graph structure of resources obtained by the resource sampling is visualized by assigning different colors to different tables. A user cannot only examine the visualized structure, but also conduct exploratory search by selecting a resource in the visualized result. Effectiveness of the proposed system is evaluated by applying it to several LOD resources. Experiments with test participants are also conducted, of which results show even users who are not familiar with RDF can perform exploratory analysis effectively.
Community detection is one of the methods for network analysis. It is useful for understanding, visualizing and compressing networks. A popular method for community detection is optimizing modularity which is the function for evaluating the result of community detection. Constrained community detection is a variation of community detection. It takes given constraints into account in order to improve the accuracy of community detection. Optimizing constrained Hamiltonian is one of the methods for constrained community detection. Constrained Hamiltonian consists of Hamiltonian which is generalized modularity and constrained term which takes given constraints into account. Nakata proposed a method for constrained community detection based on the optimization of constrained Hamiltonian by extended Louvain method. He showed his method is sperior to previous method based on simulated annealing. In this paper, we propose a new method for constrained community detection in multislice networks. Multislice networks are the combinations of multiple individual networks, which have abilities of representing temporal networks and those with several types of edges. While optimizing Mucha’s modularity is popular for community detection in multislice networks, our method optimizes the constrained Hamiltonian which we extend for multislice networks. By using our proposed method, we successfully detect communities taking constraints into account. We also successfully improve the accuracy of community detection by using our method repeatedly. Our method enables us to carry out constrained community detection interactively in multislice networks.
The purpose of this research is supporting information access based on the contents of comic books. To meet this purpose, it is necessary to obtain information related to the story and the characters of a comic. We propose a method to extract information from reviews on the Web by using term frequency-inversed document frequency (TFIDF) method and hierarchical Latent Dirichlet Allocation (hLDA) method, which intends to solve the problem. By using these methods, we build a prototype system for exploratory comic search. We conducted a user study to observe how a participant use the system. The user study showed that the system successfully supported the participants to find interesting unread comics.
People collect and use information about real world from internet to help their daily activities. In particular, the number of users in microblog such as Twitter is so large that users can get a diversity of information. They can elicit not only the information which they need from microblog posts but also the location which is indicated by the contents posted in microblog. While previous approaches apply corpus-based or machine learning that require various prior knowledge such as natural language processing and feature engineering, our approach is able to estimate the location without those requirements with extension of long-short term memory (LSTM). In our experiment, we apply our approach to geo-tagged tweets posted in Twitter and show that this approach is effective in outperforming corpus-based and previous works that use support vector machine (SVM) with bag-of-words (BoW).
This paper proposes a method for searching cooking recipes by a procedure such as “a tomato is fried.” Although most of methods for cooking recipe search treat recipe text as Bag-of-Words (BoW), it misdetects such a recipe that “fry an onion deeply and serve it with a tomato cube (in which the tomato is not heated).” Our method converts a procedural text to a flow-graph automatically in advance using a dependency parsing technique. In the flow-graph, action sequence that will be performed to an ingredient is easily extracted by tracing the path from the node corresponding to the ingredient to the root node corresponding to the last action. We evaluate our method comparing with a task adapted BoW model as a baseline and the proposed method achieved a precision of 68.8% while the baseline method achieved it of 61.5%.
In recall-oriented tasks, where collecting extensive information from different aspects of a topic is required, searchers often have difficulty formulating queries to explore diverse aspects and deciding when to stop searching. With the goal of helping searchers discover unexplored aspects and find the appropriate timing for search stopping in recall-oriented tasks, this paper proposes a query suggestion interface displaying the amount of missed information (i.e., information that a user potentially misses collecting from search results) for individual queries. We define the amount of missed information for a query as the additional gain that can be obtained from unclicked search results of the query, where gain is formalized as a set-wise metric based on aspect importance, aspect novelty, and per-aspect document relevance and is estimated by using a state-of-the-art algorithm for subtopic mining and search result diversification. Results of a user study involving 24 participants showed that the proposed interface had the following advantages when the gain estimation algorithm worked reasonably: (1) users of our interface stopped examining search results after collecting a greater amount of relevant information; (2) they issued queries whose search results contained more missed information; (3) they obtained higher gain, particularly at the late stage of their sessions; and (4) they obtained higher gain per unit time. These results suggest that the simple query visualization helps make the search process of recall-oriented tasks more efficient, unless inaccurate estimates of missed information are displayed to searchers.
Twitter evidently stirred a popular trend of personal update sharing. Twitter users can be kept up to date with current information from Twitter; however, users cannot obtain the most recent information, while they browse web pages since these are not updated in real time. Meanwhile, there are many events happen at any time such as crowded restaurants and time sales in different floors or areas at composite facilities in urban areas. To solve them, it is thought that an appropriate method is to detect tweets of small-scale facilities at a composite facility to enrich their traditional web pages. Therefore, we developed a tweet visualization system to support users grasp event happens over time and space from tweets while they browse any web pages based on spatio-temporal analysis of tweets. In order to detect and analyze tweets of a composite facility, the system maps geo-tagged tweets to web pages by matching their location names, and classifies the tweets into different categories of small-scale facilities by utilizing machine learning algorithms. Thus, the system can visualize tweets in a tag cloud is associated with a web page to help users immediately gain a quick overview of events through space and time while they browse this web page, and it can also effectively present a list of most related tweets to help users obtain more detailed information about events. In this paper, we discuss our spatio-temporal analysis method and we have also included an evaluation of tweet classification into small-scale facilities and tag cloud generation that feature words of tweets are changed over time.
In this paper, we propose a method to find query suggestions of a verbal query, which contains a verb in the query, from the Web. People sometimes cannot obtain appropriate search results even if they consider they have formulated a query that clearly describes their search intents. The idea of the proposed method is to find the relationship between verb and noun in the query, and mine the appropriate representation of the verb based on the relationship. The proposed method estimates the relationship between verb and noun based on particles between them. Based on the estimated relationship, we then obtain candidates of the verb in the query by using either the Web search results or the case frame. Next, we compute the effectiveness of the candidates by considering the similarity between a candidate and the verb and the co-occurrence between the candidates and the noun, and finally rank the candidates to generate queries. To investigate the effectiveness of our proposed method, we conducted the experiment by comparing with the query suggestions of a commercial search engine as our baseline. The experimental result of 20 queries showed that our proposed method, which finds candidates from the Web search results, outperformed the baseline method in terms of AvgRelNum, which measures the the number of relevant pages obtained by the generated query that can retrieve a relevant page, and achieved the similar performance in terms ofContain@10andMRR@10.
This paper proposes the following methods to search VOCALOID creators who publish music videos in Niconico video hosting service. For VOCALOID creator search, the user can utilize three clues: VOCALOID character name, music genre, and impressions. We defined the music genre by extending generic digital music genre with considering social tags annotated on VOCALOID music videos. We also implemented SVM-based music impression estimator utilizing viewer comments being over 0.8 points in F-values. We compared the proposal with three comparison methods in 12 search tasks and clarified the effectiveness of music genres and impressions.
During web search and browsing, people often accept misinformation due to their inattention to information credibility and biases. To obtain correct web information and support effective decision making, it is important to enhance searcher credibility assessment and develop algorithms to detect suspicious information. In this paper, we investigate how credibility alarms for web search results affect searcher behavior and decision making in information access systems. This study focuses on disputed topic suggestion as a credibility alarm approach. We conducted an online user study in which 92 participants performed a search task for health information. Through log analysis and user surveys, we confirmed the following. (1) Disputed topic suggestion in a search results list makes participants spend more time browsing pages than ordinary search conditions, thereby promoting careful information seeking. (2) Disputed topic suggestion during web browsing does not change participant behaviors but works as complementary information. This study contributes to system designs to enhance user engagement in critical and careful information seeking.
The home locations of Twitter users can be estimated using a social network, which is generated by various relationships between users. There are many network-based location estimation methods with user relationships. However, the estimation accuracy of various methods and relationships is unclear. In this study, we estimate the users’home locations using four network-based location estimation methods on four types of social networks in Japan. We have obtained two results. (1) In the location estimation methods, the method that selects the most frequent location among the friends of the user shows the highest precision and recall. (2) In the four types of social networks, the relationship of follower has the highest precision and recall.
Image database is one of the important research topics in image recognition. A manual collection of images would causes biased collection and a lot of human efforts. Huge image database where varied and many unbiased objects are stored is required for the image learning. Recent researches for image database try to automatically/semiautomatically generate image database. Here, it is important to remove noise images and this paper proposes an automatic generation method of the Web image database. The proposed method uses noise image removal by visual feature and semantic feature in a hybrid. However, which type of features, and how to combine the two types of feature are not clear and should be investigated. In this paper, six kinds of noise image detection method are prepared: The method using visual feature, the method using semantic feature, two methods using both features in parallel and two methods using both features in serial. Through the comparison in experiments, it was confirmed that the method using both visual and semantic features in parallel focusing on noise images showed over 82% Precision values，76% Recall values and 77% Fmeasure values in average. Also, the usability of the generated database for image recognition was confirmed through the experiments; It was equal to or higher than the human-made database. It was confirmed that the proposed method constructed precise image database full-automatically.
In this paper, we propose a recommendation model of application customer to reduce the time and cost of the salesperson by recommending the application customer to salesperson. The model is built based on customer information, premise information, and two new features which are extracted from salesperson’s feedback. Estimation precision is evaluated by three algorithms: SVM, Decision Tree and Random Forest. We applied three algorithm against five data sets. As a result, the highest estimation precision was 49.2% using RF against the data which selected by some important features (data 5). In the case of using the customer information and property information (data 3), estimation precision was 38.7%. Moreover in the case of adding new two features against that data (data 4), estimate precision was 44.2%. From comparison between data 3 and data 4, we clarified that the new two features increased the estimation precision. Also, from comparison between data 4 and data 5, we clarified that selecting some important features increased the estimation precision. In addition, according to the feedback from the ABC TENPO Inc which we are conducting joint research, our model increased the precision compared with the veteran salesperson.
Total Environment for Text Data Mining (TETDM) has been constructed as a theme of near future challenge since 2010. This environment has been developed not only for experts of text mining but also for everyone who uses electronic texts. TETDM includes both various text mining tools and a framework for knowledge emergence, the process of collecting analysis results and attaching them a general interpretation. Therefore, TETDM can be an environment of collaboration between AI techniques and human beings. This paper describes social practices that have been executed and their plans using TETDM. We expect that this total environment will be utilized for various purposes on public or private circumstances in the near future.
Dynamical systems, which are described using differential equations, present numerous benefits for time-series information processing. They can accommodate continuous changes and dynamic features. However, they are not good for processing complex spatiotemporal patterns such as a temporal order of motions. Therefore, they are often combined with symbol-processing systems or discrete-event systems to produce hybrid systems. As described herein, we propose a method of processing sequences of elementary motions based only on distributed representations and a neurodynamical system. To assess the method’s possibilities, we constructed a human motion estimation system using a trajectory attractor model: a recurrent neural network with continuous-time dynamics. This system can deal analogically with novel hand and arm motions based on similarity between code patterns. Additionally, it can process complex sequences of motions in a robust manner because the network state is attracted to a long trajectory attractor formed in a series of subspaces corresponding to elementary motions. Then the network makes stable state transitions along the trajectory. Experimentally obtained results obtained from surface myoelectric signals show that the system estimated 15 complex hand and arm motions with average accuracy of about 86%, demonstrating the great potential of this system.
Machine learning on RDF data has become important in the field of the Semantic Web. However, RDF graph structures are redundantly represented by noisy and incomplete data on theWeb. In order to apply SVMs to such RDF data, we propose a kernel function to compute the similarity between resources on RDF graphs. This kernel function is defined by selected features on RDF paths that eliminate the redundancy on RDF graphs with information gain ratio filtering. Kernel functions are a very flexible framework and cannot be applied to only SVMs but also principal component analysis, canonical correlation analysis, clustering and so on. However, the calculation of the proposed kernel function requires high costs for time and memory due to the exponential increase of features in RDF graphs. Therefore, we propose an efficient algorithm that calculates the kernel for redundant features from RDF graphs. Our experiments show the performance of the proposed kernel with SVMs on classification tasks for RDF resources and the advantages over existing kernels.
This paper reports progress from 2014 to 2015 on development of solvers of Japanese comprehension questions in university entrance exam. Target questions are the multiple-choice questions in the essay section (Question No.1) in Japanese Language (Kokugo) of National Center Test. In 2014, we introduced a new scoring function using clause boundaries, which are automatically detected by our newly developed tool. The score of a choice is calculated as the average clause-similarity between the choice and a selected part of text body. In 2015, we developed a machinelearning based method, which uses seventeen features to determine the answer. They includes surface-similarity based features, clause-similarity based features, and choice-discriminative features. In addtion to the first formal run of Torobo Project in 2013, we participated in the two formal runs in 2014 and 2015; We were only a participant who submitted the result in Contemporary Japanese Language until now. After the 2015 formal run, we conducted an experiment using 276 questions to compare all developed solvers with various parameters. The best performance was obtained by a 2015 solver, which produced 117 (42%) correct answers. For the subset of 56 previous official questions in National Center Test, a 2014 solver was the best, which produced 32 (57%) correct answers. However, there is no statistical significance between the best 2015 solver and our first solver developed in 2013.
In this paper, we propose an agent-based urban model in which the relationship between a central urban area and a suburban area is expressed simply. Allocation and bustle of a public facility where people stop off in daily life are implemented in the model. We clarify that transportation selection and their residence selection of residents make an effect to change the urban structure and environment. We also discuss how a compact urban structure and a reduction in carbon dioxide emissions are achieved with urban development policies and improvements on attractiveness of the facility for pedestrians and cyclists. In addition, we conduct an experiment of the exclusion of cars from the center of the city. The experimental results confirmed that the automobile control measure would be effective in decreasing the use of automobiles along with a compact urban structure.
We propose a method for extracting semantic structure from procedural texts for more intelligent search or analysis. Procedural texts represent a sequence of procedures to create an object or to make an object be in a certain state, and have many potential applications in artificial intelligence. Procedural texts are relatively clear without modality nor dependence on viewpoints, etc. Thus they can be described their procedures using flow graphs. We adopt recipe texts as procedural text examples and directed acyclic graphs (DAGs) to represent semantic structure. Nodes of a flow graph are important terms in a recipe text and vertices are relationships between the terms such as language phenomena including dependency, predicate-argument structure, and coreference. Because trees can not represent the procedures of recipes sufficiently, DAGs are adopted as the representation of recipes. We first apply word segmentation, automatic term recognition, and then convert the entire text into a flow graphs. For word segmentation and automatic term recognition, we adopt existing methods. Then we propose a flow graph estimation method from term recognition results. Our method is based on the maximum spanning tree algorithm, which is popular in dependency parsing, and simultaneously deals with language phenomena listed above. We experimentally evaluate our method on a flow graph corpus created from various recipe texts on the Internet.