IPSJ Transactions on System LSI Design Methodology
Online ISSN : 1882-6687
ISSN-L : 1882-6687
Volume 8
Showing 1-15 articles out of 15 articles from the selected issue
  • Hiroyuki Tomiyama
    Type: Editorial
    Subject area: Editorial
    2015 Volume 8 Pages 1
    Published: 2015
    Released: February 12, 2015
    JOURNALS FREE ACCESS
    Download PDF (32K)
  • Jishen Zhao, Cong Xu, Ping Chi, Yuan Xie
    Type: Invited Paper
    Subject area: Architectural Design
    2015 Volume 8 Pages 2-11
    Published: 2015
    Released: February 12, 2015
    JOURNALS FREE ACCESS
    The memory and storage system, including processor caches, main memory, and storage, is an important component of various computer systems. The memory hierarchy is becoming a fundamental performance and energy bottleneck, due to the widening gap between the increasing bandwidth and energy demands of modern applications and the limited performance and energy efficiency provided by traditional memory technologies. As a result, computer architects are facing significant challenges in developing high-performance, energy-efficient, and reliable memory hierarchies. New byte-addressable nonvolatile memories (NVMs) are emerging with unique properties that are likely to open doors to novel memory hierarchy designs to tackle the challenges. However, substantial advancements in redesigning the existing memory and storage organizations are needed to realize their full potential. This article reviews recent innovations in rearchitecting the memory and storage system with NVMs, producing high-performance, energy-efficient, and scalable computer designs.
    Download PDF (450K)
  • Zhiru Zhang, Deming Chen, Steve Dai, Keith Campbell
    Type: Invited Paper
    Subject area: High Level Synthesis
    2015 Volume 8 Pages 12-25
    Published: 2015
    Released: February 12, 2015
    JOURNALS FREE ACCESS
    Power and energy efficiency have emerged as first-order design constraints across the computing spectrum from handheld devices to warehouse-sized datacenters. As the number of transistors continues to scale, effectively managing design complexity under stringent power constraints has become an imminent challenge of the IC industry. The manual process of power optimization in RTL design has been increasingly difficult, if not already unsustainable. Complexity scaling dictates that this process must be automated with robust analysis and synthesis algorithms at a higher level of abstraction. Along this line, high-level synthesis (HLS) is a promising technology to improve design productivity and enable new opportunities for power optimization for higher design quality. By allowing early access to the system architecture, high-level decisions during HLS can have a significant impact on the power and energy efficiency of the synthesized design. In this paper, we will discuss the recent research development of using HLS to effectively explore a multi-dimensional design space and derive low-power implementations. We provide an in-depth coverage of HLS low-power optimization techniques and synthesis algorithms proposed in the last decade. We will also describe the key power optimization challenges facing HLS today and outline potential opportunities in tackling these challenges.
    Download PDF (2260K)
  • Salita Sombatsiri, Yoshinori Takeuchi, Masaharu Imai
    Type: Regular Paper
    Subject area: System-Level Design
    2015 Volume 8 Pages 26-37
    Published: 2015
    Released: February 12, 2015
    JOURNALS FREE ACCESS
    This paper proposes an efficient performance estimation method for configurable multi-layer bus-based SoC, which evaluates system performance in an early stage of design process. The proposed method uses data flow information obtained from a system-level profiling, an architecture-independent loosely-timed transaction level simulation, and constructs a system-level execution dependency graph. Then, based on each architecture-level model, the architecture-level execution dependency graph is constructed and analyzed to estimate the performance of each architecture. In the analysis, the behavior details of shared buses and multi-layer bus are determined based on the analyzed dynamic bus contention and bus protocols' features. Experiments were conducted by modeling the multi-layer AHB and applying the method to estimate performance of the architectures executing JPEG encoder application. The proposed method estimates the performance of SoC with less than 8% of errors comparing to the results from accurate RTL simulations.
    Download PDF (1431K)
  • Arif Ullah Khan, Tsuyoshi Isshiki, Dongju Li, Hiroaki Kunieda
    Type: Regular Paper
    Subject area: System-Level Design
    2015 Volume 8 Pages 38-50
    Published: 2015
    Released: February 12, 2015
    JOURNALS FREE ACCESS
    In order to meet the increased computational requirement of today's consumer portable devices, heterogeneous multiprocessor system-on-chip (MPSoC) architectures have become widespread. These MPSoCs include not only multiple processors but also multiple dedicated hardware accelerators. Due to the increase complexity of the MPSoC, fast and accurate design space exploration (DSE) for best system performance at early stage of the design process is desired. Any DSE solution is desired to provide best system partitioning scheme for best performance with efficient area utilization. In this paper we propose a design space exploration framework for heterogeneous MPSoC based on tightly-coupled thread (TCT) parallel programing model which can handles system partition exploration and HW synthesis exploration. The proposed framework drastically reduces the exponential size design space into near-linear size by utilizing the accurate HW timing models as the indicator for system bottleneck and guiding the enumeration process of HW version combinations. Experimental results shows the accuracy of the proposed method with an average estimation error of 1.38% for HW timing of each thread, and 2.80% estimation error for the system-level simulation, where the simulation speedup factor was in the order of 5, 000 times. Currently the proposed framework partially depends on a high level synthesis (HLS) tool eXCite, but other HLS tools can be easily integrated into the proposed framework.
    Download PDF (1890K)
  • Tulika Mitra
    Type: Invited Paper
    Subject area: Architectural Design
    2015 Volume 8 Pages 51-62
    Published: 2015
    Released: August 01, 2015
    JOURNALS FREE ACCESS
    Transistor count continues to increase for silicon devices following Moore's Law. But the failure of Dennard scaling has brought the computing community to a crossroad where power has become the major limiting factor. Thus future chips can have many cores; but only a fraction of them can be switched on at any point in time. This dark silicon era, where significant fraction of the chip real estate remains dark, has necessitated a fundamental rethinking in architectural designs. In this context, heterogeneous multi-core architectures combining functionality and performance-wise divergent mix of processing cores (CPU, GPU, special-purpose accelerators, and reconfigurable computing) offer a promising option. Heterogeneous multi-cores can potentially provide energy-efficient computation as only the cores most suitable for the current computation need to be switched on. This article presents an overview of the state-of-the-art in heterogeneous multi-core landscape.
    Download PDF (1536K)
  • Matthias Jung, Christian Weis, Norbert Wehn
    Type: Invited Paper
    Subject area: Architectural Design
    2015 Volume 8 Pages 63-74
    Published: 2015
    Released: August 01, 2015
    JOURNALS FREE ACCESS
    In systems ranging from mobile devices to servers, Dynamic Random Access Memories (DRAM) have a big impact on performance and contributes a significant part of the total consumed power. Conventional DDR3-based solutions are stretched thin as their maximum bandwidth is limited by the I/O count and interface speed. As new solutions are coming onto the market (JEDEC DDR4, JEDEC WIDE I/O, Micron's hybrid memory cube: HMC or JEDEC's high bandwidth memory: HBM) it is critical to evaluate the performance of these solutions and assess their suitability for specific applications. Furthermore, in systems with 3D stacking, the challenges of high power densities and thermal dissipation are exacerbated. It is crucial to have a flexible and holistic DRAM subsystem framework for exhaustive design space explorations, which can handle all this different types of memories, as well as the aspects of performance, power and temperature.
    Download PDF (4302K)
  • Ran Zhang, Tieyuan Pan, Li Zhu, Takahiro Watanabe
    Type: Regular Paper
    Subject area: Physical Design
    2015 Volume 8 Pages 75-84
    Published: 2015
    Released: August 01, 2015
    JOURNALS FREE ACCESS
    In recent printed circuit board (PCB) design, due to the high density of integration, the signal propagation delay or skew has become an important factor for a circuit performance. As the routing delay is proportional to the wire length, the controllability of the wire length is usually focused on. In this research, a heuristic algorithm to get equal-length routing for disordered pins in PCB design is proposed. The approach initially checks the longest common subsequence of source and target pin sets to assign layers for pins. Single commodity flow is then carried out to generate the base routes. Finally, considering target length requirement and available routing region, R-flip and C-flip are adopted to adjust the wire length. The experimental results show that the proposed method is able to obtain the routes with better wire length balance and smaller worst length error in reasonable CPU times.
    Download PDF (3653K)
  • Lian Zeng, Xin Jiang, Takahiro Watanabe
    Type: Regular Paper
    Subject area: Architectural Design
    2015 Volume 8 Pages 85-94
    Published: 2015
    Released: August 01, 2015
    JOURNALS FREE ACCESS
    With rapid progress in semiconductor technology, Network-on-Chip (NoC) becomes an attractive solution for future systems on chip (SoC). The network performance depends critically on the performance of packets routing. The delay of router and packets contention can significantly affect network latency and throughput. As the network becomes more congested, packets will be blocked more frequently. It would result in degrading the network performance. In this article, we propose an innovative dual-switch allocation (DSA) design. By introducing DSA design, we can make utmost use of idle output ports to reduce packets contention delay, meanwhile, without increasing router delay. Experimental results show that our design significantly achieves the performance improvement in terms of throughput and latency at the cost of very little power and area overhead.
    Download PDF (1255K)
  • Yuki Ando, Yukihito Ishida, Shinya Honda, Hiroaki Takada, Masato Edahi ...
    Type: Short Paper
    Subject area: System-Level Design
    2015 Volume 8 Pages 95-99
    Published: 2015
    Released: August 01, 2015
    JOURNALS FREE ACCESS
    This paper introduces an automatic synthesis technique and tool to implement inter-heterogeneous-processor communication for programmable system-on-chips (PSoCs). PSoCs have an ARM-based hard processor system connected to an FPGA fabric. By implementing the soft processors in the FPGA fabric, PSoCs realize heterogeneous multiprocessors. Since the number and type of soft processors are configurable, PSoCs can be various heterogeneous multiprocessors. However, the inter-heterogeneous-processor communications are not supported by single binary operating systems. Proposed method automatically synthesizes the inter-heterogeneous-processor communications at an application layer from a general model description. The case study shows that automatically generated inter-heterogeneous-processor communication exactly runs the system on heterogeneous multiprocessors.
    Download PDF (778K)
  • Takuya Hatayama, Hideki Takase, Kazuyoshi Takagi, Naofumi Takagi
    Type: Short Paper
    Subject area: System-Level Design
    2015 Volume 8 Pages 100-104
    Published: 2015
    Released: August 01, 2015
    JOURNALS FREE ACCESS
    In this paper, we propose the use of a memory system which has a partially reliable scratch-pad memory (SPM). The reliable region of the SPM employing the ECC is higher soft error tolerant but larger energy consumption than the normal region. We propose an allocation method in order to optimize energy consumption while ensuring required reliability. An allocation method about instruction and data to proposed memory system is formulated as integer linear programming, where the solution archives optimal energy consumption and required reliability. Evaluation result shows that the proposed method is effective when overhead for error correction is large.
    Download PDF (206K)
  • Takaaki Miyajima, David Thomas, Hideharu Amano
    Type: Regular Paper
    Subject area: System-Level Design
    2015 Volume 8 Pages 105-115
    Published: 2015
    Released: August 01, 2015
    JOURNALS FREE ACCESS
    Computationally intensive applications using an open-source library such as OpenCV, BLAS or FFT are widely available on various research or industry applications. Although the optimized code of such libraries has been prepared for an accelerator, off-loading is difficult for non-expert users, especially when only binary of applications can be accessed. This paper presents a new toolchain for application acceleration called Courier. It only requires a executable binary of the target application and a corresponding function code for an accelerator. Besides, it doesn't require a source code of the application nor re-compilation of the binary. A work-flow of Courier is a simple and intended for non-expert users. It extracts runtime information from running binary, generates task graph, and then replaces the original function with a corresponding accelerator function. Many steps along with the application acceleration process are automatically executed. The users can refer to the acceleration result and modify the task graph if needed. In our case studies, Courier was used for acceleration of three applications; image processing, matrix multiplication and spectrum analysis. Functions are off-loaded to a GPU without any modification to the original source code. Applications are sped up 8.89, 8.16 and 1.23 times, respectively.
    Download PDF (1603K)
  • Motoki Amagasaki, Qian Zhao, Masahiro Iida, Morihiro Kuga, Toshinori S ...
    Type: Short Paper
    Subject area: Emerging Technology
    2015 Volume 8 Pages 116-122
    Published: 2015
    Released: August 01, 2015
    JOURNALS FREE ACCESS
    To balance between cost and performance, and to explore 3D field-programmable gate array (FPGA) with realistic 3D integration processes, we propose spatially distributed and functionally distributed types of 3D FPGA architectures. The functionally distributed architecture consists of two wafers, a logic layer and a routing layer, and is stacked by a face-down process technology. Since vertical wires pass through microbumps, no TSVs are needed. In contrast, the spatially distributed architecture is divided into multiple layers with the same structure, unlike in the functionally distributed type. This architecture can be expanded to more than two layers by stacking multiples of the same die. The goal of this paper is to elucidate the advantages and disadvantages of these two types of 3D FPGAs. According to our evaluation, when only two layers are used, the functionally distributed architecture is more effective. When higher performance is achieved by using more than two layers, the spatially distributed architecture achieves better performance.
    Download PDF (903K)
  • Yusaku Hirai, Shinya Yano, Toshimasa Matsuoka
    Type: Regular Paper
    Subject area: Analog Circuit Design
    2015 Volume 8 Pages 123-130
    Published: 2015
    Released: August 01, 2015
    JOURNALS FREE ACCESS
    An application of the stochastic A/D conversion to multi-bit delta-sigma modulators is considered, and a novel correction technique for D/A converter (DAC) error is proposed. The stochastic A/D conversion can reduce the area of the quantizer and allows large mismatches. The proposed calibration technique corrects DAC errors using a programmable quantizer. The programmable quantizer has a non-linear characteristic that cancels DAC errors. Using this technique, we can decrease the influence of DAC errors without using conventional dynamic element matching. This A/D converter has a non-linear quantization characteristic, so output digital code must be corrected using a programmable encoder. This code correction and setting of the quantization levels are carried out based on calibration data obtained using genetic algorithm.
    Download PDF (1060K)
  • Shinichi Nishizawa, Tohru Ishihara, Hidetoshi Onodera
    Type: Short Paper
    Subject area: Physical Design
    2015 Volume 8 Pages 131-135
    Published: 2015
    Released: August 01, 2015
    JOURNALS FREE ACCESS
    This paper discusses a standard cell layout generator that can be used to generate a standard cell library optimized to a target application. It can generate an area efficient layout from a virtual-grid symbolic layout with the ability of flexible grid positioning that considers local design rules enforced in a scaled technology. The generator reduces the cost of library design and enables an optimization of each cell with detailed layout information that can be used to estimate the performance of the cell under design. A standard cell library has been generated for commercial 28-nm FDSOI CMOS process using the proposed layout generator, and used for circuit design. Correct operation of designed circuit is observed form fabricated chip test.
    Download PDF (962K)
feedback
Top