In this work, we address the topic classification of spoken inquiries in Japanese that are received by a speech-oriented guidance system operating in a real environment. The classification of spoken inquiries is often hindered by automatic speech recognition (ASR) errors, the sparseness of features and the shortness of spontaneous speech utterances. Here, we compare the performances of a support vector machine (SVM) with a radial basis function (RBF) kernel, PrefixSpan boosting (pboost) and the maximum entropy (ME) method, which are supervised learning methods. We also combine their predictions using a stacked generalization (SG) scheme. We also perform an evaluation using words or characters as features for the classifiers. Using characters as features is possible in Japanese owing to the presence of kanji, ideograms originating from Chinese characters that represent not only sounds but also meanings. We performed analyses on the performance of the above methods and their combination in dealing with the indicated problems. Experimental results show an F-measure of 86.87% for the classification of ASR results from children's inquiries with an average performance improvement of 2.81% compared with the performance of individual classifiers, and an F-measure of 93.96% with an average improvement of 1.89% for adults' inquiries when using the SG scheme and character features.
In this paper, we present our work on collecting training texts from the Web for constructing language models in colloquial and spontaneous Chinese automatic speech recognition systems. The selection involves two steps: first, web texts are selected using a perplexity-based approach in which the style-related words are strengthened by omitting infrequent topic words. Second, the selected texts are then clustered based on non-noun part-of-speech words and optimal clusters are chosen by referring to a set of spontaneous seed sentences. With the proposed method, we selected over 3.80M sentences. By qualitative analysis on the selected results, the colloquial and spontaneous-speech like texts are effectively selected. The effectiveness of the selection is also quantitatively verified by the speech recognition experiments. Using the language model interpolated with the one trained by these selected sentences and a baseline model, speech recognition evaluations were conducted on an open domain colloquial and spontaneous test set. We effectively reduced the character error rate 4.0% over the baseline model meanwhile the word coverage was also greatly increased. We also verified that the proposed method is superior to a conventional perplexity-based approach with a difference of 1.57% in character error rate.
Spoken Term Detection (STD) that considers the out-of-vocabulary (OOV) problem has generated significant interest in the field of spoken document processing. This study describes STD with false detection control using phoneme transition networks (PTNs) derived from the outputs of multiple speech recognizers. PTNs are similar to subword-based confusion networks (CNs), which are originally derived from a single speech recognizer. Since PTN-formed index is based on the outputs of multiple speech recognizers, it is robust to recognition errors. Therefore, PTN should also be robust to recognition errors in an STD task, when compared to the CN-formed index from a single speech recognition system. Our PTN-formed index was evaluated on a test collection. The experiment showed that the PTN-based approach effectively detected OOV terms, and improved the F-measure value from 0.370 to 0.639 when compared with a baseline approach. Furthermore, we applied two false detection control parameters, one is based on the majority voting scheme. The other is a measure of the ambiguity of CN, to the calculation of detection score. By introducing these parameters, the performance of STD was found to be better (0.736 for the F-measure value) than that without any parameters (0.639).
Spam over Internet Telephony (SPIT) will become a serious threat in the near future because of the growing number of Voice over IP (VoIP) users, the ease of spam implementation, and the low cost of VoIP service. Due to the real-time processing requirements of voice communication, SPIT is more difficult to filter than email spam. In this paper, we propose a trust-based mechanism that uses the duration of calls and call direction between users to distinguish legitimate callers from spammers. The trust value is adjustable according to the calling behavior. We also propose a trust inference mechanism in order to calculate a trust value for an unknown caller to a callee. Realistic simulation results show that our approaches are effective in discriminating spam calls from legitimate calls.
In this paper, we propose a method to estimate the node distribution for pedestrians with information terminals. The method enables us to provide situation-aware services such as intellectual navigation that tells the user the best route to go around congested regions. In the proposed method, each node is supposed to know its location roughly (i.e., within some error range) and to maintain a density map covering its surroundings. This map is updated when a node receives a density map from a neighboring node. Each node also updates the density map in a timely fashion by estimating the change of the density due to node mobility. Node distribution is obtained from the density map by choosing cells with the highest density in a greedy fashion. The simulation experiments have been conducted and the results have shown that the proposed method could keep average position errors less than 10m.
For WLANs, the efficiency of MAC protocol is related to throughput and power saving, which is an important item for wireless communication with limited bandwidth. Much research work has been carried out and some of the proposed schemes are effective. However, most proposals were ether based on contention mode or schedule mode and neither possessed both good characters of two methods. In this paper, we propose a MAC protocol named OSRAP that Scheduled Random access Protocol for one hop WLAN. OSRAP works in two modes, i.e., schedule and contention mode, which is able to dynamically adapt to traffic load and achieves high throughput which is close to transmission capacity in saturated case. Unlike conventional hybrid protocols, every node does not have to intentionally reset any parameter according to the changing traffic load except its queue length. A distinguishing feature of this scheme is the novel way of allowing nodes to work with low delay, as in the contention-based mode, and achieve a high throughput, as in the schedule-based mode, without complicated on-line estimation required in previous schemes. This makes OSRAP simpler and more reliable. Through our analysis results, we show that our scheme can greatly improve the probability of successful transmission which means a high throughput and low delay.
In wide-area disaster situations, wireless mesh networks lose data communication reachability among arbitrary pairs of base stations due to the loss of routing information propagation and synchronization. This paper uses a Delaunay overlay approach to propose a distributed networking method in which detour overlay paths are incrementally added to a wireless mesh network in wide-area disaster situations. For this purpose, the following functions are added to each base station for wireless multi-hop communication: obtaining the spatial location, exchanging spatial location messages between base stations, transferring data based on spatial locations of base stations. The proposed method always constructs a Delaunay overlay network with detour paths on the condition that a set of wireless links provides a connected graph even if it does not initially provide reachability among arbitrary base stations in the connected graph. This is different from the previous method that assumes a connected graph and reachability. This paper therefore also shows a new convergence principle and implementation guidelines that do not interfere with the existing convergence principle. A simulation is then used to evaluate the detour length and table size of the proposed method. It shows that the proposed method has scalability. This scalability provides adaptable low-link quality and increases the number of nodes in wide-area disaster situations.
A content distribution network (CDN) where the information provider can distribute copies of contents to a group of cache servers is a very useful solution in various on-line services. An application-layer multicasting (ALM) system is a candidate technology for constructing the CDN, and can be achieved by utilizing a Locator/Identifier Separation Protocol (LISP) which is actively discussed in IETF. A mapping system which manages relationship between each multicast group and the group members (i.e., cache servers) is a core component of the system, but the centralized system requires costly resources for handling a large-scale CDN. In this study, we propose a new mapping system for the LISP-based application-layer multicasting system using distributed cloud computing technologies. The proposed system utilizes a distributed hash table (DHT)-based network consisting of a large number of LISP routers to manage the membership information of multicast groups, and shortens the start-up time needed for newly-arrived multicast members to start communicating with other members. This paper considers the performance of the proposed system by using a realistic and a large-scale computer simulation and clarifies that the mapping system can halve the start-up time compared with the simple DHT-based system.
Locating malicious bots in a large network is problematic because the internal firewalls and network address translation (NAT) routers of the network unintentionally contribute to hiding the bots' host address and malicious packets. However, eliminating firewalls and NAT routers merely for locating bots is generally not acceptable. In the present paper, we propose an easy to deploy, easy to manage network security control system for locating a malicious host behind internal secure gateways. The proposed network security control system consists of a remote security device and a command server. The remote security device is installed as a transparent link (implemented as an L2 switch), between the subnet and its gateway in order to detect a host that has been compromised by a malicious bot in a target subnet, while minimizing the impact of deployment. The security device is controlled remotely by ‘polling’ the command server in order to eliminate the NAT traversal problem and to be firewall friendly. Since the remote security device exists in transparent, remotely controlled, robust security gateways, we regard this device as a beneficial bot. We adopt a web server with wiki software as the command server in order to take advantage of its power of customization, ease of use, and ease of deployment of the server.
We present a Bayesian analysis method that estimates the harmonic structure of musical instruments in music signals on the basis of psychoacoustic evidence. Since the main objective of multipitch analysis is joint estimation of the fundamental frequencies and their harmonic structures, the performance of harmonic structure estimation significantly affects fundamental frequency estimation accuracy. Many methods have been proposed for estimating the harmonic structure accurately, but no method has been proposed that satisfies all these requirements: robust against initialization, optimization-free, and psychoacoustically appropriate and thus easy to develop further. Our method satisfies these requirements by explicitly incorporating Terhardt's virtual pitch theory within a Bayesian framework. It does this by automatically learning the valid weight range of the harmonic components using a MIDI synthesizer. The bounds are termed “overtone corpus.” Modeling demonstrated that the proposed overtone corpus method can stably estimate the harmonic structure of 40 musical pieces for a wide variety of initial settings.
Time-span tree in Lerdahl and Jackendoff's theory  has been regarded as one of the most dependable representations of musical structure. We first show how to formalize the time-span tree in feature structure, introducing head and span features. Then, we introduce join and meet operations among them. The span feature represents the temporal length during which the head pitch event is most salient. Here, we regard this temporal length as the amount of information which the pitch event carries; i.e., when the pitch event is reduced, the information comparable to the length is lost. This allows us to define the notion of distance as the sum of lost time-spans. Then, we employ the distance as a promising candidate of stable and consistent metric of similarity. We show the distance possesses proper mathematical properties, including the uniqueness of the distance among the shortest paths. After we show examples with concrete music pieces, we discuss how our notion of distance is positioned among other notions of distance/similarity. Finally, we summarize our contributions and discuss open problems.
Given a relatively small selection of guitar scores for a large population of guitarists, there should be a certain demand for systems that can automatically arrange an arbitrary score for guitars. Our aim in this paper is to formulate the “fingering decision” and “arrangement” in a unified framework that can be cast as a decoding problem of a hidden Markov model (HMM). The left hand forms on the fingerboard are considered as the hidden states and the note sequence of a given score as an observed sequence generated by the HMM. Finding the most likely sequence of the hidden states thus corresponds to performing fingering decision or arrangement. The manual setting of HMM parameters reflecting preference of beginner guitarists lets the framework generate natural fingerings and arrangements suitable for beginners. Some examples of fingering and arrangement produced by the proposed method are presented.
In this work, I propose an interface for musical instruments for assigning arbitrary timbres to arbitrary objects including personal belongings such as a table or cup, or actions such as vocalization by audio signal processing, to enable the users to play music as if they were playing the actual acoustic musical instrument which generates the simulated timbres. This system requires no special device, only a standard microphone. The assigned timbres are produced not by a triggered PCM (pulse-code modulation) waveform in response to the detected attacks in the microphone input source but by the modeling process of the system that generates the timbres by modifying the microphone input source itself. It thereby enables the users to play music with very sensitive expression, including very small sounds, fast passages, and the effects of playing style. Additionally, in this system, we can assign separate individual timbres to each of a set of objects at a time and play polyphonic music.
Some patients with dementia repeat stereotypical utterances and/or scream in agitation for several hours. Music therapy is a method known to alleviate the symptoms of dementia. Altshuler explained that a music therapist should first play music that matches the current mood of a patient according to the iso-principle, principle of music therapy. We thought that if certain types of music can calm patients down, a music therapy system that is usable for musical novices could be useful in nursing homes. Therefore, we present a music therapy system, “MusiCuddle, ” that automatically plays a short musical phrase (tune) in response to a caregiver's simple key entry. This music overlaps with patients' utterances and/or screaming. The first note of the tune is same as the fundamental pitch (F0) of the patient's utterances. We compiled four types of tunes (chords, cadences, Japanese school songs, and phrases created from the patients' utterances) into a database. The cadences were selected from established music scores and began with an unsteady or/and agitated chord in order to resonate with the patient's mental instability. We conducted a case study to investigate how MusiCuddle changes a patient's behaviors. In the case study, the pitches extracted from the patient's utterances were varied and wide-ranging. We thought her level of agitation might be reflected in her pitches. Pitch differences in the first note affect and change the entire mood of the music. Therefore, it may be said that the MusiCuddle can play music to resonate with his/her mood by extracting pitch from her utterance in accordance with the iso-principle. Moreover, we recorded the patient's utterances and compared them with vs. without using MusiCuddle to estimate the influence of MusiCuddle. The results suggested that tunes presented by MusiCuddle may give patients an opportunity to stop repeating stereotypical utterances.
We propose a telepresence system with a real human face-shaped screen. This system tracks the remote user's face and extracts the head motion and the face image. The face-shaped screen moves along three degree-of-freedom (DOF) by reflecting the user's head gestures. We expect this system can accurately convey the user's non-verbal information in remote communication. In particular, it can transmit the user's gaze direction in the 3D space that is not correctly transmitted by using a 2D screen, which is known as “the Mona Lisa effect.” To evaluate how this system can contribute to communication, we conducted three experiments. As the results of these evaluations, we found that the recognizable angles of the face-shaped screen were bigger, and the recognition of the head directions was better than those of the flat 2D screen. More importantly, we also found the face-shaped screen could accurately convey the gaze directions and it solves the Mona Lisa effect problem even when screen size is reduced.
In this paper, the authors propose a new remote robot control interface that reduces the complexity of robot control. The proposed interface constraints the robot's movements depending on the target object that the operator wants to observe. The interface displays the constraints to the operator on a screen with the help of Augmented Reality (AR) technology. We named the interface “Object-defined remote robot control interface” because the interface provides suitable procedures for the objects that need to be operated on. The interface receives information about the robot and candidate objects from a camera that has been set up to capture a bird's-eye view of the target environment and displays this information on a touch screen display. When the operator selects an object as the target by touching it on the display, constrained tracks for the robot's movements and their corresponding AR representations are generated on the screen. A block assembly task was conducted to evaluate this interface. The results showed the system's effectiveness in terms of both task completion time and operation time.
We propose new algorithms for computing the n-th root of a quad-double number. We construct an iterative scheme that has quartic convergence and propose algorithms that require only about 50% to 60% of the double-precision arithmetic operations of the existing algorithms. The proposed algorithms perform about 1.7 times faster than the existing algorithms, yet maintain the same accuracy. They are sufficiently effective and efficient to replace the existing algorithms.
In this paper, we focus on a monitoring environment with wireless sensor network in which multiple mobile sink nodes traverse a given sensing field in different spatial-temporal patterns and collect various types of environmental data with different deadline constraints. For such an environment, we propose an energy-efficient data collection method that reduces intermediate transmission in multi-hop communication while meeting predetermined deadlines. The basic approach of the proposed method is to temporarily gather (or buffer) the observed data into several sensor nodes around the moving path of the mobile sink that would meet their deadlines at the next visit. Then, the buffered data is transferred to the mobile sink node when it visits the buffering nodes. We also propose a mobile sink-initiated proactive routing protocol with low cost (MIPR-LC) that efficiently constructs routes to the buffering nodes on each sensor node. Moreover, we simulate the proposed collection method and routing protocol to show their effectiveness. Our results confirm that the proposed method can gather almost all of the observed data within the deadline, while reducing the intermediate transmissions by 30%, as compared with an existing method. In addition, the MIPR-LC method can reduce the transmissions for the route construction by up to 12% when compared with a simple routing protocol.
When coupling data mining (DM) and learning agents, one of the crucial challenges is the need for the Knowledge Extraction (KE) process to be lightweight enough so that even resource (e.g., memory, CPU etc.) constrained agents are able to extract knowledge. We propose the Stratified Ordered Selection (SOS) method for achieving lightweight KE using dynamic numerosity reduction of training examples. SOS allows for agents to retrieve different-sized training subsets based on available resources. The method employs ranking-based subset selection using a novel Level Order (LO) ranking scheme. We show representativeness of subsets selected using the proposed method, its noise tolerance nature and ability to preserve KE performance over different reduction levels. When compared to subset selection methods of the same category, the proposed method offers the best trade-off between cost, reduction and the ability to preserve performance.
In this paper, we propose a method to identify high-quality Wikipedia articles by using implicit positive ratings. One of the major approaches for assessing Wikipedia articles is a text survival ratio based approach. In this approach, when a text survives beyond multiple edits, the text is assessed as high quality. However, the problem is that many low quality articles are misjudged as high quality, because every editor does not always read the whole article. If there is a low quality text at the bottom of a long article, and the text has not seen by the other editors, then the text survives beyond many edits, and the text is assessed as high quality. To solve this problem, we use a section and a paragraph as a unit instead of a whole page. In our method, if an editor edits an article, the system considers that the editor gives positive ratings to the section or the paragraph that the editor edits. From experimental evaluation, we confirmed that the proposed method could improve the accuracy of quality values for articles.
Fast computation of multiple reflections and scattering among complex objects is very important in photorealistic rendering. This paper applies the plane-parallel scattering theory to the rendering of densely distributed objects such as trees. We propose a simplified plane-parallel scattering model that has very simple analytic solutions, allowing efficient evaluation of multiple scattering. A geometric compensation method is also introduced to cope with the infinite plane condition, required by the plane-parallel model. The scattering model was successfully applied to tree rendering. Comparison with a Monte Carlo method was made and reasonable agreement was confirmed. A rendering system based on the model was implemented and multiple inter-reflections were effectively obtained. The view-independent feature of the model allows fast display of scenes. The pre-computation is also modest, permitting interactive control of lighting conditions.
Recently several large-scale databases of motion-capture data streams have been constructed. We present a novel method to index motion-capture data streams in such databases. We pay attention to posture variation; the impression of the visual aspect of the whole body is regarded as important. The spatial distribution of body segments is statistically summarized as a feature vector having only 12 dimensions. The experimental results showed that the feature vector we introduced provided properties comparable to those of the methods previously proposed, even though its dimensionality is extremely low.
This paper presents two passive pointing systems for a distant screen based on an acoustic position estimation technology. These systems are designed to interact with a distant screen such as a television set at home or digital signage in public as an alternative to a touch screen. The first system consists of a distant screen, three loudspeakers set around the screen, and two microphones as a pointing device. The second system consists of a distant screen, two loudspeakers set around the screen, and a smartphone equipping a microphone and a gravity sensor inside as a pointing device. The position of the pointer on the screen is theoretically determined by the position and direction of the pointing device in the space. The second system approximates the position and direction by the two-dimensional position of the microphone horizontally and the pitch angle from the gravity sensor vertically. In this paper, we report experiments to evaluate the performance of these systems. The loudspeakers of these systems radiate burst signal from 18 to 24kHz. The position of the microphone is estimated at a frame rate of 15 frames per second with a latency of 0.4s. The accuracy of the pointer was measured as an angle error below 10 degrees for 100% of all frames. We confirmed that it has enough accuracy to point to one of several partitioned areas in the screen.