IPSJ Online Transactions
Online ISSN : 1882-6660
ISSN-L : 1882-6660
Volume 2
Displaying 1-21 of 21 articles from this issue
  • Takafumi Miyata, Yusaku Yamamoto, Shao-Liang Zhang
    Article type: Numerical Computation
    Subject area: Regular Paper
    2009 Volume 2 Pages 1-14
    Published: 2009
    Released on J-STAGE: January 05, 2009
    JOURNAL FREE ACCESS
    In this paper, we propose a fully pipelined multishift QR algorithm to compute all the eigenvalues of a symmetric tridiagonal matrix on parallel machines. Existing approaches for parallelizing the tridiagonal QR algorithm, such as the conventional multishift QR algorithm and the deferred shift QR algorithm, have suffered from either inefficiency of processor utilization or deterioration of convergence properties. In contrast, our algorithm realizes both efficient processor utilization and improved convergence properties at the same time by adopting a new shifting strategy. Numerical experiments on a shared memory parallel machine (Fujitsu PrimePower HPC2500) with 32 processors show that our algorithm is up to 1.9 times faster than the conventional multishift algorithm and up to 1.7 times faster than the deferred shift algorithm.
    Download PDF (589K)
  • Hoang Anh Tuan, Katsuhiro Yamazaki, Shigeru Oyanagi
    Article type: Special Purpose System
    Subject area: Regular Paper
    2009 Volume 2 Pages 15-26
    Published: 2009
    Released on J-STAGE: January 05, 2009
    JOURNAL FREE ACCESS
    The MD5 (Message Digest 5) hash algorithm is useful for verifying the correctness and integrity of an arbitrary message, but the data dependency in the critical path in its iterations causes a huge computational delay and reduces the system's throughput. This paper describes three-stage and four-stage pipeline MD5 implementations (3SMD5 and 4SMD5) on FPGA, which removes the data dependency in the iteration by the data forwarding method, and breaks that single step computation into 3 or 4 pipeline stages. The four-stage pipeline with both the keys and the constant table located in the BRAM could operate at the highest frequency, because its critical paths are shortened to one adder and some data movements at all stages. The processing of two messages in the alternative form enabled the four-stage pipeline architecture to achieve a higher frequency and throughput than related fine-grained pipelining architectures. Thus, the implementations achieve a good trade-off between the hardware size and the throughput.
    Download PDF (1243K)
  • Chirag Shah, Koji Eguchi
    Article type: Research Papers
    Subject area: Regular Paper
    2009 Volume 2 Pages 27-35
    Published: 2009
    Released on J-STAGE: January 07, 2009
    JOURNAL FREE ACCESS
    Several information organization, access, and filtering systems can benefit from different kind of document representations than those used in traditional Information Retrieval (IR). Topic Detection and Tracking (TDT) is an example of such a domain. In this paper we demonstrate that traditional methods for term weighing do not capture topical information and this leads to inadequate representation of documents for TDT applications. We present various hypotheses regarding the factors that can help in improving the document representation for Story Link Detection (SLD) — a core task of TDT. These hypotheses are tested using various TDT collections. From our experiments and analysis we found that in order to obtain a faithful representation of documents in TDT domain, we not only need to capture a term's importance in traditional IR sense, but also evaluate its topical behavior. Along with defining this behavior, we propose a novel measure that captures a term's importance at the collection level as well as its discriminating power for topics. This new measure leads to a much better document representation as reflected by the significant improvements in the results.
    Download PDF (239K)
  • Dong Jin, Tatsuo Tsuji, Ken Higuchi
    Article type: Research Papers
    Subject area: Regular Paper
    2009 Volume 2 Pages 36-48
    Published: 2009
    Released on J-STAGE: January 07, 2009
    JOURNAL FREE ACCESS
    Data cube construction is a commonly used operation in data warehouses. Since both the volume of data stored and analyzed in a data warehouse and the amount of computation involved in data cube construction are very large, incremental maintenance of data cube is really effective. In this paper, we employ an extendible multidimensional array model to maintain data cubes. Such an array enables incremental cube maintenance without relocating any data dumped at an earlier time, while computing the data cube efficiently by utilizing the fast random accessing capability of arrays. In this paper, we first present our data cube scheme and related maintenance methods, and then present the corresponding physical implementation scheme. We developed a prototype system based on the physical implementation scheme and performed evaluation experiments based on the prototype system.
    Download PDF (1348K)
  • Hideto Kasuya, Masahiko Sakai, Kiyoshi Agusa
    Article type: Regular Papers
    Subject area: Regular Paper
    2009 Volume 2 Pages 49-70
    Published: 2009
    Released on J-STAGE: April 02, 2009
    JOURNAL FREE ACCESS
    The present paper discusses a head-needed strategy and its decidable classes of higher-order rewrite systems (HRSs), which is an extension of the head-needed strategy of term rewriting systems (TRSs). We discuss strong sequential and NV-sequential classes having the following three properties, which are mandatory for practical use: (1) the strategy reducing a head-needed redex is head normalizing (2) whether a redex is head-needed is decidable, and (3) whether an HRS belongs to the class is decidable. The main difficulty in realizing (1) is caused by the β-reductions induced from the higher-order reductions. Since β-reduction changes the structure of higher-order terms, the definition of descendants for HRSs becomes complicated. In order to overcome this difficulty, we introduce a function, PV, to follow occurrences moved by β-reductions. We present a concrete definition of descendants for HRSs by using PV and then show property (1) for orthogonal systems. We also show properties (2) and (3) using tree automata techniques, a ground tree transducer (GTT), and recognizability of redexes.
    Download PDF (426K)
  • Hideto Kasuya, Masahiko Sakai, Kiyoshi Agusa
    Article type: Regular Papers
    Subject area: Regular Paper
    2009 Volume 2 Pages 71-80
    Published: 2009
    Released on J-STAGE: April 02, 2009
    JOURNAL FREE ACCESS
    It is known that the set of all redexes for a left-linear term rewriting system is recognizable by a tree automaton, which means that we can construct a tree automaton that accepts redexes. The present paper extends this result to Nipkow's higher-order rewrite systems, in which every left-hand side is a linear fully-extended pattern. A naive extension of the first-order method causes the automata to have infinitely many states in order to distinguish bound variables in λ-terms, even if they are closed. To avoid this problem, it is natural to adopt de Bruijn notation, in which bound variables are represented as natural numbers (possibly finite symbols, such as 0, s(0), and s(s(0))). We propose a variant of de Bruijn notation in which only bound variables are represented as natural numbers because it is not necessary to represent free variables as natural numbers.
    Download PDF (257K)
  • Shimpei Sato, Naoki Fujieda, Akira Moriya, Kenji Kise
    Article type: Processor Architecture
    Subject area: Regular Paper
    2009 Volume 2 Pages 81-92
    Published: 2009
    Released on J-STAGE: April 06, 2009
    JOURNAL FREE ACCESS
    We developed a new open source multi-core processor simulator SimCell from scratch. SimCell is modeled around the Cell Broadband Engine. In this paper, we describe the advantages of the functional level simulator SimCell. From the verification of the simulation speed, we confirm that SimCell achieves a practical simulation speed. And, we show the features of a cycle-accurate version of SimCell called SimCell/CA (CA stands for cycle accurate). The gap of execution cycles between SimCell/CA and IBM simulator is 0.8% on average. Through a real case study using SimCell, we clarify the usefulness of SimCell for processor architecture research.
    Download PDF (1047K)
  • Yasunori Yakiyama, Niwat Thepvilojanapong, Masayuki Iwai, Oru Mihirogi ...
    Article type: Numerical Algorithms
    Subject area: Regular Paper
    2009 Volume 2 Pages 93-106
    Published: 2009
    Released on J-STAGE: April 06, 2009
    JOURNAL FREE ACCESS
    Although human activities in the World Wide Web are increasing rapidly due to the advent of many online services and applications, we still need to appraise how things such as a merchandise in a store or pictures in a museum receive attention in the real world. To measure people's attention in the physical world, we propose SPAL, a Sensor of Physical-world Attention using Laser scanning. It is challenging to use a laser scanner because it provides only front-side circumference of any detected objects in a measurement area. Unlike cameras, a laser scanner poses no privacy problem because it does not recognize and record an individual. SPAL includes many important factors when calculating people's attention, i.e., lingering time, direction of people, distance to a target object. To obtain such information for calculation, we develop three processing modules to extract information from raw data measured by a laser scanner. We define two attention metrics and two measurement models to compute people's attention. To validate the proposed system, we implemented a prototype of SPAL and conducted experiments in the real-world environment. The results show that the proposed system is a good candidate for determining people's attention.
    Download PDF (1481K)
  • Yuxin Wang, Keizo Oyama
    Article type: Research Papers
    Subject area: Regular Paper
    2009 Volume 2 Pages 107-121
    Published: 2009
    Released on J-STAGE: July 09, 2009
    JOURNAL FREE ACCESS
    We propose a web page classification method that is suitable for building web page collections and show its effectiveness through experimentation. First, we describe a model that represents a surrounding page group structure that takes the link relation and directory hierarchy relation into consideration and a method for extracting features based on the model. The method is tested through classification experimentation on two data sets and using the support vector machine (SVM) as the classification algorithm, and its effectiveness is confirmed through comparison with a baseline and the results of previous studies. The contribution of each part of the surrounding pages is also analyzed. Next, we test the method's performance on overall recall-precision range and find that it is superior in the high recall range. Finally, we estimate the performance of a three-grade classifier composed with the method and the amount of manual assessment required to build a web page collection.
    Download PDF (821K)
  • Jun Yao, Kosuke Ogata, Hajime Shimada, Shinobu Miwa, Hiroshi Nakashima ...
    Article type: Processor Architecture
    Subject area: Regular Paper
    2009 Volume 2 Pages 122-139
    Published: 2009
    Released on J-STAGE: July 14, 2009
    JOURNAL FREE ACCESS
    To reduce the processor energy consumption under low workload and low clock frequency executions, a possible solution is to use ALU cascading while keeping the supply voltage unchanged. This cascading scheme uses a single cycle to execute multiple ALU instructions which have a data dependence relationship between them and thus saves clock cycles for the whole execution. Since the processor energy consumption is the product result of both power and execution time, ALU cascading is expected to help energy optimization for microprocessors operating under low frequency status. To implement ALU cascading in a current superscalar processor, a specific instruction scheduler is required to wakeup a pair of cascadable instructions simultaneously despite there being a data dependence relationship between them. Furthermore, ALU cascading is only applied under low clock frequency execution mode so that the instruction scheduler must support standard scheduling for the normal clock frequency execution. In this paper, we propose an instruction scheduling method that enables the additional wakeup features for the utilization of ALU cascading without large hardware extensions. With this scheduler, the average IPC improvement becomes 3.7% in SPECint2000 and 6.4% in Mediabench, as compared to the baseline execution. The delay of additional hardware required for the ALU cascading purpose is also evaluated to study the complexity of ALU cascading.
    Download PDF (1932K)
  • Jorji Nonaka, Kenji Ono, Hideo Miyachi
    Article type: System Evaluation
    Subject area: Regular Paper
    2009 Volume 2 Pages 140-148
    Published: 2009
    Released on J-STAGE: July 14, 2009
    JOURNAL FREE ACCESS
    This paper presents a performance evaluation of large-scale parallel image compositing on a T2K Open Supercomputer. Traditional image compositing algorithms were not primarily designed for exploiting the combined message passing and the shared address space parallelism provided by systems such as T2K Open Supercomputer. In this study, we investigate the Binary-Swap image compositing method because of its promising potential for scalability. We propose some improvements to the Binary-Swap method aiming to fully exploit the hybrid programming model. We obtained encouraging results from the performance evaluation conducted on Todai Combined Cluster, a T2K Open Supercomputer at the University of Tokyo. The proposed improvements have also shown a high potential to tackle the large-scale image compositing problem on leading-edge HPC systems where an ever increasing number of processing cores is involved.
    Download PDF (728K)
  • Eric M. Heien, Yoshiyuki Asai, Taishin Nomura, Kenichi Hagihara
    Article type: Parallel Computing
    Subject area: Regular Paper
    2009 Volume 2 Pages 149-161
    Published: 2009
    Released on J-STAGE: July 14, 2009
    JOURNAL FREE ACCESS
    Recent work in biophysical science increasingly focuses on modeling and simulating human biophysical systems to better understand the human physiome. One program to generate such models is insilicoIDE. These models may consist of thousands or millions of components with complex relations. Simulations of such models can require millions of time steps and take hours or days to run on a single machine. To improve the speed of biophysical simulations generated by insilicoIDE, we propose techniques for augmenting the simulations to support parallel execution in an MPI-enabled environment. In this paper we discuss the methods involved in efficient parallelization of such simulations, including classification and identification of model component relationships and work division among multiple machines. We demonstrate the effectiveness of the augmented simulation code in a parallel computing environment by performing simulations of large scale neuron and cardiac models.
    Download PDF (1213K)
  • Yoshiharu Kojima, Masahiko Sakai, Naoki Nishida, Keiichirou Kusakari, ...
    Article type: Regular Papers
    Subject area: Regular Paper
    2009 Volume 2 Pages 162-174
    Published: 2009
    Released on J-STAGE: July 23, 2009
    JOURNAL FREE ACCESS
    The reachability problem for given an initial term, a goal term, and a term rewriting system (TRS) is to decide whether the initial one is reachable to the goal one by the TRS or not. A term is shallow if each variable in the term occurs at depth 0 or 1. Innermost reduction is a strategy that rewrites innermost redexes, and context-sensitive reduction is a strategy in which rewritable positions are indicated by specifying arguments of function symbols. In this paper, we show that the reachability problem under context-sensitive innermost reduction is decidable for linear right-shallow TRSs. Our approach is based on the tree automata technique that is commonly used for analysis of reachability and its related properties. We show a procedure to construct tree automata accepting the sets of terms reachable from a given term by context-sensitive innermost reduction of a given linear right-shallow TRS.
    Download PDF (626K)
  • Keisuke Nakano, Sebastian Maneth
    Article type: Regular Papers
    Subject area: Regular Paper
    2009 Volume 2 Pages 175-186
    Published: 2009
    Released on J-STAGE: September 09, 2009
    JOURNAL FREE ACCESS
    Macro tree transducers are a classical formal model for structural-recursive tree transformation with accumulative parameters. They have recently been applied to model XML transformations and queries. Typechecking a tree transformation means checking whether all valid input trees are transformed into valid output trees, for the given regular tree languages of input and output trees. Typechecking macro tree transducers is generally based on inverse type inference, because of the advantageous property that inverse transformations effectively preserve regular tree languages. It is known that the time complexity of typechecking an n-fold composition of macro tree transducers is non-elementary. The cost of typechecking can be reduced if transducers in the composition have special properties, such as being deterministic or total, or having no accumulative parameters. In this paper, the impact of such properties on the cost of typechecking is investigated. Reductions in cost are achieved by applying composition and decomposition constructions to tree transducers. Even though these constructions are well-known, they have not yet been analyzed with respect to the precise sizes of the transducers involved. The results can directly be applied to typechecking XML transformations, because type formalisms for XML are captured by regular tree languages.
    Download PDF (267K)
  • Takashi Yokota, Kanemitsu Ootsu, Takanobu Baba
    Article type: Network
    Subject area: Regular Paper
    2009 Volume 2 Pages 187-199
    Published: 2009
    Released on J-STAGE: October 06, 2009
    JOURNAL FREE ACCESS
    This paper addresses the quantitative evaluation methodology of interconnection networks. In the conventional evaluation method, performance curves are drawn by a series of simulation runs, and network methods are discussed by comparing the shape of performance curves. We present the Ramp Load Method that does not require repetitive simulation runs and produces continuous performance curves. Based on the continuous curves, we give a formal definition of critical load ratio. Furthermore, we introduce a feature quantity to represent both throughput and average latency, and propose a new measure called Network Performance Measure. Through detailed evaluation and some application examples, the effectiveness of the proposed evaluation methodology is confirmed.
    Download PDF (1302K)
  • Tetsuya Yoshida, Hiroshi Yamada, Kenji Kono
    Article type: Virtualization
    Subject area: Regular Paper
    2009 Volume 2 Pages 200-214
    Published: 2009
    Released on J-STAGE: October 06, 2009
    JOURNAL FREE ACCESS
    The use of time-sensitive software has been popular for embedded systems like mobile phones and portable video players. Embedded software is usually developed in parallel with real hardware devices due to a tight time-to-market constraint, and therefore it is quite difficult to verify the sensory responsiveness of time-sensitive applications such as GUIs and multimedia players. To verify the responsiveness, it is useful for developers to observe the software's behavior in a test environment in which the software runs in real time rather than in simulation time. To provide such a test environment, we need a mechanism that slows down the CPU speed of test machines because test machines are usually equipped with high-end desktop CPUs. A CPU slowdown mechanism needs to provide various CPU speeds, keep a constant CPU speed in the short term, and be sensitive toward hardware interrupts. Although there are a couple of ways of slowing down CPU speed, they do not satisfy all the above requirements. This paper describes FoxyLargo that smoothly slows down CPU speed with a virtual machine monitor (VMM). FoxyLargo carefully schedules a virtual machine (VM) to provide an illusion that the VM is running slowly from the viewpoint of time-sensitive applications. For this purpose, FoxyLargo combines three techniques: 1) fine-grained, 2) interrupt-sensitive, and 3) clock-tick based VM scheduling. We applied our techniques to Xen VMM, and conducted three experiments. The experimental results show that FoxyLargo adequately meet all the above requirements. Also, we successfully reproduced the decoding behavior of an MPEG player. This result demonstrates that FoxyLargo can reproduce the behavior of real applications.
    Download PDF (1360K)
  • Hiroya Matsuba, Yutaka Ishikawa
    Article type: Virtualization
    Subject area: Regular Paper
    2009 Volume 2 Pages 215-224
    Published: 2009
    Released on J-STAGE: October 06, 2009
    JOURNAL FREE ACCESS
    At a cluster of clusters used for parallel computing, it is important to fully utilize the inter-cluster network. Existing MPI implementations for cluster of clusters have two issues: 1) Single point-to-point communication cannot utilize the bandwidth of the high-bandwidth inter-cluster network because a Gigabit Ethernet interface is used at each node for inter-cluster communication, while more bandwidth is available between clusters. 2) Heavy packet loss and performance degradation occur on the TCP/IP protocol when many nodes generate short-term burst traffic. In order to overcome these issues, this paper proposes a novel method called the aggregate router method. In this method, multiple router nodes are set up in each cluster and inter-cluster communication is performed via these router nodes. By striping a single message to multiple routers, the bottleneck caused by network interfaces is reduced. The packet congestion issue is also avoided by using high-speed interconnects in a cluster, instead of the TCP/IP protocol. The aggregated router method is evaluated using the HPC Challenge Benchmarks and the NAS Parallel Benchmarks. The result shows that the proposed method outperforms the existing method by 24% in the best case.
    Download PDF (358K)
  • Hernán Aguirre, Kiyoshi Tanaka
    Article type: Regular Papers
    Subject area: Regular Paper
    2009 Volume 2 Pages 225-239
    Published: 2009
    Released on J-STAGE: December 24, 2009
    JOURNAL FREE ACCESS
    This work proposes a method to enhance selection of multiobjective evolutionary algorithms aiming to improve their performance on many objective optimization problems. The proposed method uses a randomized sampling procedure combined with ε-dominance to fine grain the ranking of solutions after they have been ranked by Pareto dominance. The sampling procedure chooses a subset of initially equal ranked solutions to give them selective advantage, favoring a good distribution of the sample based on dominance regions wider than conventional Pareto dominance. We enhance NSGA-II with the proposed method and analyze its performance on a wide range of non-linear problems using MNK-Landscapes with up to M=10 objectives. Experimental results show that convergence and diversity of the solutions found can improve remarkably on 3 ≤ M ≤ 10 objective problems.
    Download PDF (1771K)
  • Neil Rubens, Ryota Tomioka, Masashi Sugiyama
    Article type: Regular Papers
    Subject area: Regular Paper
    2009 Volume 2 Pages 240-249
    Published: 2009
    Released on J-STAGE: December 24, 2009
    JOURNAL FREE ACCESS
    We address the task of active learning for linear regression models in collaborative settings. The goal of active learning is to select training points that would allow accurate prediction of output values. We propose a new active learning criterion that is aimed at directly improving the accuracy of the output value estimation by analyzing the effect of the new training points on the estimates of the output values. The advantages of the proposed method are highlighted in collaborative settings, in which most of the data points are missing, and the number of training data points is much smaller than the number of the parameters of the model.
    Download PDF (405K)
  • Satoshi Matsumoto, Yusuke Suzuki, Takayoshi Shoudai, Tetsuhiro Miyahar ...
    Article type: Regular Papers
    Subject area: Regular Paper
    2009 Volume 2 Pages 250-260
    Published: 2009
    Released on J-STAGE: December 24, 2009
    JOURNAL FREE ACCESS
    The exact learning model by Angluin (1988) is a mathematical model of learning via queries in computational learning theory. A term tree is a tree pattern consisting of ordered tree structures and repeated structured variables, which occur more than once. Thus, a term tree is suited for representing common tree structures based on tree-structured data, such as HTML and XML files on the Web. In this paper, we consider the learnability of finite unions of term trees with repeated variables in the exact learning model. We present polynomial time learning algorithms for finite unions of term trees with repeated variables by using superset and restricted equivalence queries. Moreover, we show that there exists no polynomial time learning algorithm for finite unions of term trees by using restricted equivalence, membership, and subset queries. This result indicates the hardness of learning finite unions of term trees in the exact learning model.
    Download PDF (802K)
  • Taku Shimosawa, Yutaka Ishikawa
    Article type: Operating Systems
    Subject area: Regular Paper
    2009 Volume 2 Pages 261-279
    Published: 2009
    Released on J-STAGE: December 29, 2009
    JOURNAL FREE ACCESS
    We propose a new inter-kernel communication mechanism for multicore architectures in order to communicate with multiple kernels within a single machine. Using this mechanism, multiple kernels share I/O devices, such as network and disk devices. The mechanism has been integrated into another mechanism called SHIMOS that partitions the CPUs, the memory, and I/O devices. Multiple Linux kernels on multicore architectures have been realized using the integrated SHIMOS mechanism. Several sets of benchmark results demonstrate that SHIMOS is faster than modern virtual machines. For system calls, SHIMOS achieves about seven times faster than the Xen virtual machine. When two Linux compilation jobs run on two Linux kernels, SHIMOS is 1.35 and 1.005 times faster than Xen and the native single Linux, respectively.
    Download PDF (1205K)
feedback
Top