IPSJ Transactions on System and LSI Design Methodology

Message from the Editor-in-Chief

Hidetoshi Onodera

Article type: Editorial
Subject area: Editorial
2011 Volume 4 Pages 1
Published: 2011
Released on J-STAGE: February 08, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.1

JOURNAL FREE ACCESS

Download PDF (29K)
Robust System Design

Subhasish Mitra, Hyungmin Cho, Ted Hong, Young Moon Kim, Hsiao-Heng Ke ...

Article type: Invited Paper
Subject area: Reliable Circuit Design
2011 Volume 4 Pages 2-30
Published: 2011
Released on J-STAGE: February 08, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.2

JOURNAL FREE ACCESS

Show abstractHide abstract

Robust system design is essential to ensure that future electronic systems perform correctly despite rising complexity and increasing disturbances. In contrast, today's mainstream systems typically assume that transistors and interconnects operate correctly over their useful lifetime. Future systems cannot rely on such assumptions for several reasons: 1. With enormous complexity, future systems are significantly vulnerable to design flaws. 2. For coming generations of silicon technologies, several causes of hardware failures, largely benign in the past, are becoming significant at the system-level. 3. Emerging nanotechnologies, such as carbon nanotubes, are inherently highly subject to imperfections. At the same time, there is explosive growth in our dependency on electronic systems. This paper addresses the following major robust system design goals: 1. New approaches to thorough validation that can cope with tremendous growth in complexity. 2. Cost-effective tolerance and prediction of failures in hardware during system operation. 3. Practical ways to overcome substantial inherent imperfections in emerging nanotechnologies. Significant recent progress in robust system design impacts almost every aspect of future systems, from ultra-large-scale computing and storage systems, all the way to their nanoscale components.

View full abstract

Download PDF (4649K)
Coarse-Grained Reconfigurable Array: Architecture and Application Mapping

Kiyoung Choi

Article type: Invited Paper
Subject area: System-Level Synthesis
2011 Volume 4 Pages 31-46
Published: 2011
Released on J-STAGE: February 08, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.31

JOURNAL FREE ACCESS

Show abstractHide abstract

Coarse-grained reconfigurable arrays, or CGRAs in short, have drawn increasing attention recently due to their performance and flexibility. They provide flexibility through reconfiguration, which is not attained with fixed harware such as traditional ASICs. They also provide performance through highly parallel architecture, which is hardly achieved with basically sequential software running on full-blown processors. There have been many researches on CGRAs, and many of them are commercialized or in practical use. However, they still face some challenges that are to be addressed for their widespread use. In this paper, we survey existing CGRA architectures as well as existing approaches to the mapping of applications onto the architectures.

View full abstract

Download PDF (2716K)
Scan Vulnerability in Elliptic Curve Cryptosystems

Ryuta Nara, Nozomu Togawa, Masao Yanagisawa, Tatsuo Ohtsuki

Article type: Regular Paper
Subject area: Circuit-Level Security Analysis
2011 Volume 4 Pages 47-59
Published: 2011
Released on J-STAGE: February 08, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.47

JOURNAL FREE ACCESS

Show abstractHide abstract

A scan-path test is one of the most important testing techniques, but it can be used as a side-channel attack against a cryptography circuit. Scan-based attacks are techniques to decipher a secret key using scanned data obtained from a cryptography circuit. Public-key cryptography, such as RSA and elliptic curve cryptosystem (ECC), is extensively used but conventional scan-based attacks cannot be applied to it, because it has a complicated algorithm as well as a complicated architecture. This paper proposes a scan-based attack which enables us to decipher a secret key in ECC. The proposed method is based on detecting intermediate values calculated in ECC. We focus on a 1-bit sequence which is specific to some intermediate values. By monitoring the 1-bit sequence in the scan path, we can find out the register position specific to the intermediate value in it and we can know whether this intermediate value is calculated or not in the target ECC circuit. By using several intermediate values, we can decipher a secret key. The experimental results demonstrate that a secret key in a practical ECC circuit can be deciphered using 29 points over the elliptic curve E within 40 seconds.

View full abstract

Download PDF (2606K)
A Fast Selector-Based Subtract-Multiplication Unit and Its Application to Butterfly Unit

Youhei Tsukamoto, Masao Yanagisawa, Tatsuo Ohtsuki, Nozomu Togawa

Article type: Regular Paper
Subject area: Arithmetic Design Optimization
2011 Volume 4 Pages 60-69
Published: 2011
Released on J-STAGE: February 08, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.60

JOURNAL FREE ACCESS

Show abstractHide abstract

Large-scale network and multimedia application LSIs include application specific arithmetic units. A multiply-accumulator unit or a MAC unit which is one of these optimized units arranges partial products and decreases carry propagations. However, there is no method similar to MAC to execute “subtract-multiplication”. In this paper, we propose a high-speed subtract-multiplication unit that decreases latency of a subtract operation by bit-level transformation using selector logics. By using bit-level transformation, its partial products are calculated directly. The proposed subtract-multiplication units can be applied to any types of systems using subtract-multiplications and a butterfly operation in FFT is one of their suitable applications. We apply them effectively to Radix-2 butterfly units and Radix-4 butterfly units. Experimental results show that our proposed operation units using selector logics improves the performance by up to 13.92%, compared to a conventional approach.

View full abstract

Download PDF (818K)
Exact Minimum Factoring of Incompletely Specified Logic Functions via Quantified Boolean Satisfiability

Hiroaki Yoshida, Masahiro Fujita

Article type: Regular Paper
Subject area: Logic Synthesis
2011 Volume 4 Pages 70-79
Published: 2011
Released on J-STAGE: February 08, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.70

JOURNAL FREE ACCESS

Show abstractHide abstract

This paper presents an exact method which finds the minimum factored form of an incompletely specified Boolean function. The problem is formulated as a Quantified Boolean Formula (QBF) and is solved by general-purpose QBF solver. We also propose a novel graph structure, called an X-B (eXchanger Binary) tree, which compactly and implicitly enumerates binary trees. Leveraged by this graph structure, the factoring problem is transformed into a QBF. Using three sets of benchmark functions: artificially-created, randomly-generated and ISCAS 85 benchmark functions, we empirically demonstrate the quality of the solutions and the runtime complexity of the proposed method.

View full abstract

Download PDF (789K)
Design Choice in 45-nm Dual-Port SRAM — 8T, 10T Single End, and 10T Differential

Hiroki Noguchi, Yusuke Iguchi, Hidehiro Fujiwara, Shunsuke Okumura, Ko ...

Article type: Regular Paper
Subject area: Low-Power Circuit Design
2011 Volume 4 Pages 80-90
Published: 2011
Released on J-STAGE: February 08, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.80

JOURNAL FREE ACCESS

Show abstractHide abstract

As process technology is scaled down, a large-capacity SRAM will be used. Its power must be lowered. The Vth variation of the deep-submicron process affects the SRAM operation and its power. This paper compares the macro area, readout power, and operating frequency among dual-port SRAMs: an 8T SRAM, 10T single-end SRAM, and 10T differential SRAM considering the multi-media applications. The 8T SRAM has the lowest transistor count, and is the most area efficient. However, the readout power becomes large and the access time increases because of peripheral circuits. The 10T single-end SRAM, in which a dedicated inverter and transmission gate are appended as a single-end read port, can reduce the readout power by 74%. The operating frequency is improved by 195%, over the 8T SRAM. However, the 10T differential SRAM can operate fastest (256% faster than the 8T SRAM) because its small differential voltage of 50mV achieves high-speed operation. In terms of the power efficiency, however, the readout current is affected by the Vth variation and the timing of sense cannot be optimized singularly among all memory cells in a 45-nm technology. The readout power remains 34% lower than that of the 8T SRAM (33% higher than the 10T single-end SRAM); even its operating voltage is the lowest of the three. The 10T single-end SRAM always consumes less readout power than the 8T or 10T differential SRAM.

View full abstract

Download PDF (2541K)
Mastering MPSoCs for Mixed-critical Applications

Philip Axer, Jonas Diemer, Mircea Negrean, Maurice Sebastian, Simon Sc ...

Article type: Invited Paper
Subject area: System-Level Design
2011 Volume 4 Pages 91-116
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.91

JOURNAL FREE ACCESS

Show abstractHide abstract

Multi-Processor Systems-on-Chips (MPSoCs) emerge as the predominant platform in embedded real-time applications. A large variety of ubiquitous services should be implemented by embedded systems in a cost- and power-efficient way, yet providing a maximum degree of performance, usability and dependability. By using a scalable Network-on-Chip (NoC) architecture which replaces the traditional point-to-point and bus connections in conjunction with performant IP cores it is possible to use the available performance to consolidate functionality on a single MPSoC platform. But especially when uncritical best-effort applications (e.g., entertainment) and critical applications (e.g., pedestrian detection, electronic stability control) are combined on the same architecture (mixed-criticality), validation faces new challenges. Due to complex resource sharing in MPSoCs the timing behavior becomes more complex and requires new analysis methods. Additionally, applications that may exhibit multiple behaviors corresponding to different operating modes (e.g., initialization mode, fault-recovery mode) need to be also considered in the design of mixed-critical MPSoCs. In this paper, challenges in the design of mixed-critical systems are discussed and formal analysis solutions which consider shared resources, NoC communication, multi-mode applications and their reliabilities are proposed.

View full abstract

Download PDF (2899K)
Delay Testing: Improving Test Quality and Avoiding Over-testing

Seiji Kajihara, Satoshi Ohtake, Tomokazu Yoneda

Article type: Invited Paper
Subject area: Delay Testing
2011 Volume 4 Pages 117-130
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.117

JOURNAL FREE ACCESS

Show abstractHide abstract

Delay testing is one of key processes in production test to ensure high quality and high reliability for logic circuits. Test escape missing defective chips can be reduced by introducing delay testing. On the other hand, we need to concern yield loss caused by delay testing, i.e., over-testing. Many methods and techniques have been developed to solve problems on delay testing. In this paper, we introduce fundamental techniques of delay testing and survey recent problems and solutions. Especially we focus on techniques to enhance test quality, to avoid over-testing, and to make test design efficient by treating circuits described at register transfer level.

View full abstract

Download PDF (413K)
Partial Product Generation Utilizing the Sum of Operands for Reduced Area Parallel Multipliers

Hirotaka Kawashima, Naofumi Takagi

Article type: Regular Paper
Subject area: Arithmetic Design Optimization
2011 Volume 4 Pages 131-139
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.131

JOURNAL FREE ACCESS

Show abstractHide abstract

We propose a novel method to generate partial products for reduced area parallel multipliers. Our method reduces the total number of partial product bits of parallel multiplication by about half. We call partial products generated by our method Compound Partial Products (CPPs). Each CPP has four candidate values: zero, a part of the multiplicand, a part of the multiplier and a part of the sum of the operands. Our method selects one from the four candidates according to a pair of a multiplicand bit and a multiplier bit. Multipliers employing the CPPs are approximately 30% smaller than array multipliers without radix-4 Booth's method, and approximately up to 10% smaller than array multipliers with radix-4 Booth's method. We also propose an acceleration method of the multipliers using CPPs.

View full abstract

Download PDF (595K)
Scan FF Reordering for Test Volume Reduction in Chiba-scan Architecture

Kiyonori Matsumoto, Kazuteru Namba, Hideo Ito

Article type: Regular Paper
Subject area: Testing
2011 Volume 4 Pages 140-149
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.140

JOURNAL FREE ACCESS

Show abstractHide abstract

Scan architecture is one of designs for tests (DFTs). In scan architecture, some or all of flip-flops (FFs) in a circuit are serially connected and form a scan chain. The Chiba-scan is one of scan architectures facilitating delay testing. The Chiba-scan has many advantages such as small area overhead comparable to that of the standard scan architecture and complete fault coverage for robust testable path delay faults. However, its test volume is much larger than that of other scan architectures. This paper presents a test volume reduction method for robust path delay fault testing on the Chiba-scan. In this method, scan FFs are reordered. The experimental results give evidence that the proposed method reduces the number of test vectors by 18.4% for ISCAS89 benchmark circuits. Furthermore, the proposed method enables testing with 16.8% shorter test application time (TAT) and 18.3% lower required memory size for automatic test equipment (ATE) compared with those for the enhanced scan architecture.

View full abstract

Download PDF (739K)
A Fault-Secure High-Level Synthesis Algorithm for RDR Architectures

Sho Tanaka, Masao Yanagisawa, Tatsuo Ohtsuki, Nozomu Togawa

Article type: Regular Paper
Subject area: Behavioral Synthesis
2011 Volume 4 Pages 150-165
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.150

JOURNAL FREE ACCESS

Show abstractHide abstract

As device feature size decreases, the reliability improvement against soft errors becomes quite necessary. A fault-secure system, in which concurrent error detection is realized, is one of the solutions to this problem. On the other hand, average interconnection delays exceed gate delays which leads to a serious timing closure problem. By using regular-distributed-register architecture (RDR architecture), we can estimate interconnection delays very accurately and their influence can be much reduced even in behavioral-level design. In this paper, we propose a fault-secure high-level synthesis algorithm for an RDR architecture. In fault-secure high-level synthesis, a recomputation CDFG as well as a normal-computation CDFG must be scheduled to control steps and bound to functional units. Firstly, our algorithm re-uses vacant areas on RDR islands to allocate new function units additionally for the recomputation CDFG. Secondly, we propose an efficient edge-break algorithm which considers comparison nodes' scheduling/binding. We can have small-latency scheduling/binding for both the normal CDFG and recomputation CDFG. Our algorithm reduces the required control steps by up to 53% compared with the conventional approach.

View full abstract

Download PDF (1213K)
Exact, Fast and Flexible L1 Cache Configuration Simulation for Embedded Systems

Masashi Tawada, Masao Yanagisawa, Tatsuo Ohtsuki, Nozomu Togawa

Article type: Regular Paper
Subject area: Architectural Performance Analysis
2011 Volume 4 Pages 166-181
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.166

JOURNAL FREE ACCESS

Show abstractHide abstract

Since target applications running on an embedded processor are much limited in embedded systems, we can optimize its cache configuration based on the number of sets, block size, and associativities. An extremely fast cache configuration simulation method, CRCB (Configuration Reduction approach by the Cache Behavior), has been recently proposed which can calculate cache hit/miss counts accurately for possible cache configurations when the three parameters above are changed. The CRCB method assumes LRU-based (Least Recently Used-based) cache but many recent processors use FIFO-based (First In First Out-based) cache or PLRU-based (Pseudo LRU-based) cache due to its hardware cost. In this paper, we propose exact and fast L1 cache configuration simulation algorithms for embedded applications that use PLRU or FIFO as a cache replacement policy. Firstly, we prove that the CRCB method can be applied not only to LRU but also to other cache replacement policies including FIFO and PLRU. Secondly, we prove several properties for FIFO- and PLRU-based caches and we propose associated cache simulation algorithms which can simulate simultaneously more than one cache configurations with different cache associativities accurately for FIFO or PLRU. Finally, many experimental results demonstrate that our cache configuration simulation algorithms obtain accurate cache hit/miss counts and run up to 249 times faster than a conventional cache simulator.

View full abstract

Download PDF (1670K)
Design and Implementation Fine-grained Power Gating on Microprocessor Functional Units

Zhao Lei, Daisuke Ikebuchi, Kimiyoshi Usami, Mitaro Namiki, Masaaki Ko ...

Article type: Regular Paper
Subject area: Architectural Low-Power Design
2011 Volume 4 Pages 182-192
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.182

JOURNAL FREE ACCESS

Show abstractHide abstract

In this paper, we present a prototype MIPS R3000 processor, which integrates the fine-grained power gating technique into its functional units. To reduce the leakage power consumption, functional units, such as multiplier and divider can be power-gated individually according to the workload of the execution program. The prototype chip - Geyser-1 has been implemented with Fujitsu's 65nm CMOS technology; and to facilitate the design process with fine-grained power gating, a fully automated design flow has also been proposed. Comprehensive real-chip evaluations have been performed to verify the leakage reduction efficiency. According the evaluation results with benchmark programs, the fine-grained power gating can reduce the power of the processor by 5% at 25°C and 23% at 80°C.

View full abstract

Download PDF (1247K)
Data Flow Graph Partitioning Algorithms and Their Evaluations for Optimal Spatio-temporal Computation on a Coarse Grain Reconfigurable Architecture

Ratna Krishnamoorthy, Saptarsi Das, Keshavan Varadarajan, Mythri Alle, ...

Article type: Regular Paper
Subject area: System-Level Synthesis
2011 Volume 4 Pages 193-209
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.193

JOURNAL FREE ACCESS

Show abstractHide abstract

Coarse Grain Reconfigurable Architectures (CGRA) support spatial and temporal computation to speedup execution and reduce reconfiguration time. Thus compilation involves partitioning instructions spatially and scheduling them temporally. The task of partitioning is governed by the opposing forces of being able to expose as much parallelism as possible and reducing communication time. We extend Edge-Betweenness Centrality scheme, originally used for detecting community structures in social and biological networks, for partitioning instructions of a dataflow graph. We also implement several other partitioning algorithms from literature and compare the execution time obtained by each of these partitioning algorithms on a CGRA called REDEFINE. Centrality based partitioning scheme outperforms several other schemes with 6-20% execution time speedup for various Cryptographic kernels. REDEFINE using centrality based partitioning performs 9× better than a General Purpose Processor, as opposed to 7.76× better without using centrality based partitioning. Similarly, centrality improves the execution time comparison of AES-128 Decryption from 11× to 13.2×.

View full abstract

Download PDF (720K)
Symbolic Discord Computation for Efficient Analysis of Message Sequence Charts

Yosuke Kakiuchi, Tomofumi Nakagawa, Kiyoharu Hamaguchi, Tadaaki Tanimo ...

Article type: Regular Paper
Subject area: System-level Verification
2011 Volume 4 Pages 210-221
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.210

JOURNAL FREE ACCESS

Show abstractHide abstract

Message sequence charts (MSCs) and high-level MSCs (HMSCs) have been standardized to model interactions of parallel processes as message exchanges. We can flexibly express parallel behaviors with MSCs, but alternatively, it is possible to put unintended orders of messages into the MSCs. This paper focuses on detection of such unintended orders as discord. We propose an encoding scheme in which the analysis of an HMSC is converted into a boolean SAT problem. Experimental results show that our approach achieves efficient analysis of HMSCs which have a large number of processes or a large size of graphs. This contributes efficient analysis of specification on complex interactions.

View full abstract

Download PDF (703K)
Design and Implementation of Echo Instructions for an Embedded Processor

Karaduman Arda, Stubdal Iver, Hideharu Amano

Article type: Regular Paper
Subject area: Architectural Design
2011 Volume 4 Pages 222-231
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.222

JOURNAL FREE ACCESS

Show abstractHide abstract

Code size is an important issue for embedded systems. Reducing the code size helps us reduce the memory requirements of our programs. This enables us to design power efficient cores which can make maximum use of resources. In this paper, we propose implementation schemes for a new type of instruction called the Echo instruction, and propose a solution to reduce performance overhead when using Echo instructions. On average, we can reduce the performance overhead from 4.3% to 2.24% with our method.

View full abstract

Download PDF (697K)
Framework for Latch-based High-level Synthesis Using Minimum-delay Compensation

Keisuke Inoue, Mineo Kaneko

Article type: Regular Paper
Subject area: Behavioral Synthesis
2011 Volume 4 Pages 232-244
Published: 2011
Released on J-STAGE: August 10, 2011

DOIhttps://doi.org/10.2197/ipsjtsldm.4.232

JOURNAL FREE ACCESS

Show abstractHide abstract

In recent application-specific integrated circuit design, using transparent latches as storage elements has been intensively studied, since designs using latches (latch-based design) has a large potential to improve the performance yield. However, latch-based design is prone to violate the hold constraint because it is difficult for latches to hold the output data when input switches during the transparent state. To address the hold problem, this paper proposes a novel hold guarantee framework in latch-based design, based on the lifetime extension approach. Since the lifetime extension-based design suffers from an increase in registers due to the strict sharing condition, another method referred to as minimum-delay compensation (MDC) is introduced to accelerate the register sharing. The excessive use of MDC increases the total design cost on the contrary, thereby, this paper formulates the MDC cost (especially, area cost) minimization problem under the specified number of registers, and reveals the computational complexities of the problem. To tackle the problem, two algorithms are proposed: the left-edge-based algorithm and the integer linear programming-based algorithm. They are applied to benchmark circuits, and experimental results showed that the proposed framework can reduce the area cost compared to conventional design with keeping the hold constraint.

View full abstract

Download PDF (586K)

Register with J-STAGE for free!