IPSJ Transactions on System and LSI Design Methodology
Online ISSN : 1882-6687
ISSN-L : 1882-6687
Volume 4
Displaying 1-18 of 18 articles from this issue
  • Hidetoshi Onodera
    Article type: Editorial
    Subject area: Editorial
    2011 Volume 4 Pages 1
    Published: 2011
    Released on J-STAGE: February 08, 2011
    JOURNAL FREE ACCESS
    Download PDF (29K)
  • Subhasish Mitra, Hyungmin Cho, Ted Hong, Young Moon Kim, Hsiao-Heng Ke ...
    Article type: Invited Paper
    Subject area: Reliable Circuit Design
    2011 Volume 4 Pages 2-30
    Published: 2011
    Released on J-STAGE: February 08, 2011
    JOURNAL FREE ACCESS
    Robust system design is essential to ensure that future electronic systems perform correctly despite rising complexity and increasing disturbances. In contrast, today's mainstream systems typically assume that transistors and interconnects operate correctly over their useful lifetime. Future systems cannot rely on such assumptions for several reasons: 1. With enormous complexity, future systems are significantly vulnerable to design flaws. 2. For coming generations of silicon technologies, several causes of hardware failures, largely benign in the past, are becoming significant at the system-level. 3. Emerging nanotechnologies, such as carbon nanotubes, are inherently highly subject to imperfections. At the same time, there is explosive growth in our dependency on electronic systems. This paper addresses the following major robust system design goals: 1. New approaches to thorough validation that can cope with tremendous growth in complexity. 2. Cost-effective tolerance and prediction of failures in hardware during system operation. 3. Practical ways to overcome substantial inherent imperfections in emerging nanotechnologies. Significant recent progress in robust system design impacts almost every aspect of future systems, from ultra-large-scale computing and storage systems, all the way to their nanoscale components.
    Download PDF (4649K)
  • Kiyoung Choi
    Article type: Invited Paper
    Subject area: System-Level Synthesis
    2011 Volume 4 Pages 31-46
    Published: 2011
    Released on J-STAGE: February 08, 2011
    JOURNAL FREE ACCESS
    Coarse-grained reconfigurable arrays, or CGRAs in short, have drawn increasing attention recently due to their performance and flexibility. They provide flexibility through reconfiguration, which is not attained with fixed harware such as traditional ASICs. They also provide performance through highly parallel architecture, which is hardly achieved with basically sequential software running on full-blown processors. There have been many researches on CGRAs, and many of them are commercialized or in practical use. However, they still face some challenges that are to be addressed for their widespread use. In this paper, we survey existing CGRA architectures as well as existing approaches to the mapping of applications onto the architectures.
    Download PDF (2716K)
  • Ryuta Nara, Nozomu Togawa, Masao Yanagisawa, Tatsuo Ohtsuki
    Article type: Regular Paper
    Subject area: Circuit-Level Security Analysis
    2011 Volume 4 Pages 47-59
    Published: 2011
    Released on J-STAGE: February 08, 2011
    JOURNAL FREE ACCESS
    A scan-path test is one of the most important testing techniques, but it can be used as a side-channel attack against a cryptography circuit. Scan-based attacks are techniques to decipher a secret key using scanned data obtained from a cryptography circuit. Public-key cryptography, such as RSA and elliptic curve cryptosystem (ECC), is extensively used but conventional scan-based attacks cannot be applied to it, because it has a complicated algorithm as well as a complicated architecture. This paper proposes a scan-based attack which enables us to decipher a secret key in ECC. The proposed method is based on detecting intermediate values calculated in ECC. We focus on a 1-bit sequence which is specific to some intermediate values. By monitoring the 1-bit sequence in the scan path, we can find out the register position specific to the intermediate value in it and we can know whether this intermediate value is calculated or not in the target ECC circuit. By using several intermediate values, we can decipher a secret key. The experimental results demonstrate that a secret key in a practical ECC circuit can be deciphered using 29 points over the elliptic curve E within 40 seconds.
    Download PDF (2606K)
  • Youhei Tsukamoto, Masao Yanagisawa, Tatsuo Ohtsuki, Nozomu Togawa
    Article type: Regular Paper
    Subject area: Arithmetic Design Optimization
    2011 Volume 4 Pages 60-69
    Published: 2011
    Released on J-STAGE: February 08, 2011
    JOURNAL FREE ACCESS
    Large-scale network and multimedia application LSIs include application specific arithmetic units. A multiply-accumulator unit or a MAC unit which is one of these optimized units arranges partial products and decreases carry propagations. However, there is no method similar to MAC to execute “subtract-multiplication”. In this paper, we propose a high-speed subtract-multiplication unit that decreases latency of a subtract operation by bit-level transformation using selector logics. By using bit-level transformation, its partial products are calculated directly. The proposed subtract-multiplication units can be applied to any types of systems using subtract-multiplications and a butterfly operation in FFT is one of their suitable applications. We apply them effectively to Radix-2 butterfly units and Radix-4 butterfly units. Experimental results show that our proposed operation units using selector logics improves the performance by up to 13.92%, compared to a conventional approach.
    Download PDF (818K)
  • Hiroaki Yoshida, Masahiro Fujita
    Article type: Regular Paper
    Subject area: Logic Synthesis
    2011 Volume 4 Pages 70-79
    Published: 2011
    Released on J-STAGE: February 08, 2011
    JOURNAL FREE ACCESS
    This paper presents an exact method which finds the minimum factored form of an incompletely specified Boolean function. The problem is formulated as a Quantified Boolean Formula (QBF) and is solved by general-purpose QBF solver. We also propose a novel graph structure, called an X-B (eXchanger Binary) tree, which compactly and implicitly enumerates binary trees. Leveraged by this graph structure, the factoring problem is transformed into a QBF. Using three sets of benchmark functions: artificially-created, randomly-generated and ISCAS 85 benchmark functions, we empirically demonstrate the quality of the solutions and the runtime complexity of the proposed method.
    Download PDF (789K)
  • Hiroki Noguchi, Yusuke Iguchi, Hidehiro Fujiwara, Shunsuke Okumura, Ko ...
    Article type: Regular Paper
    Subject area: Low-Power Circuit Design
    2011 Volume 4 Pages 80-90
    Published: 2011
    Released on J-STAGE: February 08, 2011
    JOURNAL FREE ACCESS
    As process technology is scaled down, a large-capacity SRAM will be used. Its power must be lowered. The Vth variation of the deep-submicron process affects the SRAM operation and its power. This paper compares the macro area, readout power, and operating frequency among dual-port SRAMs: an 8T SRAM, 10T single-end SRAM, and 10T differential SRAM considering the multi-media applications. The 8T SRAM has the lowest transistor count, and is the most area efficient. However, the readout power becomes large and the access time increases because of peripheral circuits. The 10T single-end SRAM, in which a dedicated inverter and transmission gate are appended as a single-end read port, can reduce the readout power by 74%. The operating frequency is improved by 195%, over the 8T SRAM. However, the 10T differential SRAM can operate fastest (256% faster than the 8T SRAM) because its small differential voltage of 50mV achieves high-speed operation. In terms of the power efficiency, however, the readout current is affected by the Vth variation and the timing of sense cannot be optimized singularly among all memory cells in a 45-nm technology. The readout power remains 34% lower than that of the 8T SRAM (33% higher than the 10T single-end SRAM); even its operating voltage is the lowest of the three. The 10T single-end SRAM always consumes less readout power than the 8T or 10T differential SRAM.
    Download PDF (2541K)
  • Philip Axer, Jonas Diemer, Mircea Negrean, Maurice Sebastian, Simon Sc ...
    Article type: Invited Paper
    Subject area: System-Level Design
    2011 Volume 4 Pages 91-116
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    Multi-Processor Systems-on-Chips (MPSoCs) emerge as the predominant platform in embedded real-time applications. A large variety of ubiquitous services should be implemented by embedded systems in a cost- and power-efficient way, yet providing a maximum degree of performance, usability and dependability. By using a scalable Network-on-Chip (NoC) architecture which replaces the traditional point-to-point and bus connections in conjunction with performant IP cores it is possible to use the available performance to consolidate functionality on a single MPSoC platform. But especially when uncritical best-effort applications (e.g., entertainment) and critical applications (e.g., pedestrian detection, electronic stability control) are combined on the same architecture (mixed-criticality), validation faces new challenges. Due to complex resource sharing in MPSoCs the timing behavior becomes more complex and requires new analysis methods. Additionally, applications that may exhibit multiple behaviors corresponding to different operating modes (e.g., initialization mode, fault-recovery mode) need to be also considered in the design of mixed-critical MPSoCs. In this paper, challenges in the design of mixed-critical systems are discussed and formal analysis solutions which consider shared resources, NoC communication, multi-mode applications and their reliabilities are proposed.
    Download PDF (2899K)
  • Seiji Kajihara, Satoshi Ohtake, Tomokazu Yoneda
    Article type: Invited Paper
    Subject area: Delay Testing
    2011 Volume 4 Pages 117-130
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    Delay testing is one of key processes in production test to ensure high quality and high reliability for logic circuits. Test escape missing defective chips can be reduced by introducing delay testing. On the other hand, we need to concern yield loss caused by delay testing, i.e., over-testing. Many methods and techniques have been developed to solve problems on delay testing. In this paper, we introduce fundamental techniques of delay testing and survey recent problems and solutions. Especially we focus on techniques to enhance test quality, to avoid over-testing, and to make test design efficient by treating circuits described at register transfer level.
    Download PDF (413K)
  • Hirotaka Kawashima, Naofumi Takagi
    Article type: Regular Paper
    Subject area: Arithmetic Design Optimization
    2011 Volume 4 Pages 131-139
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    We propose a novel method to generate partial products for reduced area parallel multipliers. Our method reduces the total number of partial product bits of parallel multiplication by about half. We call partial products generated by our method Compound Partial Products (CPPs). Each CPP has four candidate values: zero, a part of the multiplicand, a part of the multiplier and a part of the sum of the operands. Our method selects one from the four candidates according to a pair of a multiplicand bit and a multiplier bit. Multipliers employing the CPPs are approximately 30% smaller than array multipliers without radix-4 Booth's method, and approximately up to 10% smaller than array multipliers with radix-4 Booth's method. We also propose an acceleration method of the multipliers using CPPs.
    Download PDF (595K)
  • Kiyonori Matsumoto, Kazuteru Namba, Hideo Ito
    Article type: Regular Paper
    Subject area: Testing
    2011 Volume 4 Pages 140-149
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    Scan architecture is one of designs for tests (DFTs). In scan architecture, some or all of flip-flops (FFs) in a circuit are serially connected and form a scan chain. The Chiba-scan is one of scan architectures facilitating delay testing. The Chiba-scan has many advantages such as small area overhead comparable to that of the standard scan architecture and complete fault coverage for robust testable path delay faults. However, its test volume is much larger than that of other scan architectures. This paper presents a test volume reduction method for robust path delay fault testing on the Chiba-scan. In this method, scan FFs are reordered. The experimental results give evidence that the proposed method reduces the number of test vectors by 18.4% for ISCAS89 benchmark circuits. Furthermore, the proposed method enables testing with 16.8% shorter test application time (TAT) and 18.3% lower required memory size for automatic test equipment (ATE) compared with those for the enhanced scan architecture.
    Download PDF (739K)
  • Sho Tanaka, Masao Yanagisawa, Tatsuo Ohtsuki, Nozomu Togawa
    Article type: Regular Paper
    Subject area: Behavioral Synthesis
    2011 Volume 4 Pages 150-165
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    As device feature size decreases, the reliability improvement against soft errors becomes quite necessary. A fault-secure system, in which concurrent error detection is realized, is one of the solutions to this problem. On the other hand, average interconnection delays exceed gate delays which leads to a serious timing closure problem. By using regular-distributed-register architecture (RDR architecture), we can estimate interconnection delays very accurately and their influence can be much reduced even in behavioral-level design. In this paper, we propose a fault-secure high-level synthesis algorithm for an RDR architecture. In fault-secure high-level synthesis, a recomputation CDFG as well as a normal-computation CDFG must be scheduled to control steps and bound to functional units. Firstly, our algorithm re-uses vacant areas on RDR islands to allocate new function units additionally for the recomputation CDFG. Secondly, we propose an efficient edge-break algorithm which considers comparison nodes' scheduling/binding. We can have small-latency scheduling/binding for both the normal CDFG and recomputation CDFG. Our algorithm reduces the required control steps by up to 53% compared with the conventional approach.
    Download PDF (1213K)
  • Masashi Tawada, Masao Yanagisawa, Tatsuo Ohtsuki, Nozomu Togawa
    Article type: Regular Paper
    Subject area: Architectural Performance Analysis
    2011 Volume 4 Pages 166-181
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    Since target applications running on an embedded processor are much limited in embedded systems, we can optimize its cache configuration based on the number of sets, block size, and associativities. An extremely fast cache configuration simulation method, CRCB (Configuration Reduction approach by the Cache Behavior), has been recently proposed which can calculate cache hit/miss counts accurately for possible cache configurations when the three parameters above are changed. The CRCB method assumes LRU-based (Least Recently Used-based) cache but many recent processors use FIFO-based (First In First Out-based) cache or PLRU-based (Pseudo LRU-based) cache due to its hardware cost. In this paper, we propose exact and fast L1 cache configuration simulation algorithms for embedded applications that use PLRU or FIFO as a cache replacement policy. Firstly, we prove that the CRCB method can be applied not only to LRU but also to other cache replacement policies including FIFO and PLRU. Secondly, we prove several properties for FIFO- and PLRU-based caches and we propose associated cache simulation algorithms which can simulate simultaneously more than one cache configurations with different cache associativities accurately for FIFO or PLRU. Finally, many experimental results demonstrate that our cache configuration simulation algorithms obtain accurate cache hit/miss counts and run up to 249 times faster than a conventional cache simulator.
    Download PDF (1670K)
  • Zhao Lei, Daisuke Ikebuchi, Kimiyoshi Usami, Mitaro Namiki, Masaaki Ko ...
    Article type: Regular Paper
    Subject area: Architectural Low-Power Design
    2011 Volume 4 Pages 182-192
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    In this paper, we present a prototype MIPS R3000 processor, which integrates the fine-grained power gating technique into its functional units. To reduce the leakage power consumption, functional units, such as multiplier and divider can be power-gated individually according to the workload of the execution program. The prototype chip - Geyser-1 has been implemented with Fujitsu's 65nm CMOS technology; and to facilitate the design process with fine-grained power gating, a fully automated design flow has also been proposed. Comprehensive real-chip evaluations have been performed to verify the leakage reduction efficiency. According the evaluation results with benchmark programs, the fine-grained power gating can reduce the power of the processor by 5% at 25°C and 23% at 80°C.
    Download PDF (1247K)
  • Ratna Krishnamoorthy, Saptarsi Das, Keshavan Varadarajan, Mythri Alle, ...
    Article type: Regular Paper
    Subject area: System-Level Synthesis
    2011 Volume 4 Pages 193-209
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    Coarse Grain Reconfigurable Architectures (CGRA) support spatial and temporal computation to speedup execution and reduce reconfiguration time. Thus compilation involves partitioning instructions spatially and scheduling them temporally. The task of partitioning is governed by the opposing forces of being able to expose as much parallelism as possible and reducing communication time. We extend Edge-Betweenness Centrality scheme, originally used for detecting community structures in social and biological networks, for partitioning instructions of a dataflow graph. We also implement several other partitioning algorithms from literature and compare the execution time obtained by each of these partitioning algorithms on a CGRA called REDEFINE. Centrality based partitioning scheme outperforms several other schemes with 6-20% execution time speedup for various Cryptographic kernels. REDEFINE using centrality based partitioning performs 9× better than a General Purpose Processor, as opposed to 7.76× better without using centrality based partitioning. Similarly, centrality improves the execution time comparison of AES-128 Decryption from 11× to 13.2×.
    Download PDF (720K)
  • Yosuke Kakiuchi, Tomofumi Nakagawa, Kiyoharu Hamaguchi, Tadaaki Tanimo ...
    Article type: Regular Paper
    Subject area: System-level Verification
    2011 Volume 4 Pages 210-221
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    Message sequence charts (MSCs) and high-level MSCs (HMSCs) have been standardized to model interactions of parallel processes as message exchanges. We can flexibly express parallel behaviors with MSCs, but alternatively, it is possible to put unintended orders of messages into the MSCs. This paper focuses on detection of such unintended orders as discord. We propose an encoding scheme in which the analysis of an HMSC is converted into a boolean SAT problem. Experimental results show that our approach achieves efficient analysis of HMSCs which have a large number of processes or a large size of graphs. This contributes efficient analysis of specification on complex interactions.
    Download PDF (703K)
  • Karaduman Arda, Stubdal Iver, Hideharu Amano
    Article type: Regular Paper
    Subject area: Architectural Design
    2011 Volume 4 Pages 222-231
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    Code size is an important issue for embedded systems. Reducing the code size helps us reduce the memory requirements of our programs. This enables us to design power efficient cores which can make maximum use of resources. In this paper, we propose implementation schemes for a new type of instruction called the Echo instruction, and propose a solution to reduce performance overhead when using Echo instructions. On average, we can reduce the performance overhead from 4.3% to 2.24% with our method.
    Download PDF (697K)
  • Keisuke Inoue, Mineo Kaneko
    Article type: Regular Paper
    Subject area: Behavioral Synthesis
    2011 Volume 4 Pages 232-244
    Published: 2011
    Released on J-STAGE: August 10, 2011
    JOURNAL FREE ACCESS
    In recent application-specific integrated circuit design, using transparent latches as storage elements has been intensively studied, since designs using latches (latch-based design) has a large potential to improve the performance yield. However, latch-based design is prone to violate the hold constraint because it is difficult for latches to hold the output data when input switches during the transparent state. To address the hold problem, this paper proposes a novel hold guarantee framework in latch-based design, based on the lifetime extension approach. Since the lifetime extension-based design suffers from an increase in registers due to the strict sharing condition, another method referred to as minimum-delay compensation (MDC) is introduced to accelerate the register sharing. The excessive use of MDC increases the total design cost on the contrary, thereby, this paper formulates the MDC cost (especially, area cost) minimization problem under the specified number of registers, and reveals the computational complexities of the problem. To tackle the problem, two algorithms are proposed: the left-edge-based algorithm and the integer linear programming-based algorithm. They are applied to benchmark circuits, and experimental results showed that the proposed framework can reduce the area cost compared to conventional design with keeping the hold constraint.
    Download PDF (586K)
feedback
Top