IPSJ Transactions on System LSI Design Methodology
Online ISSN : 1882-6687
ISSN-L : 1882-6687
Current issue
Showing 1-12 articles out of 12 articles from the selected issue
  • Nozomu Togawa
    Type: Editorial
    Subject area: Editorial
    2019 Volume 12 Pages 1
    Published: 2019
    Released: February 22, 2019
    JOURNALS FREE ACCESS
    Download PDF (32K)
  • A.K.M. Mahfuzul Islam, Hidetoshi Onodera
    Type: Invited Paper
    2019 Volume 12 Pages 2-12
    Published: 2019
    Released: February 22, 2019
    JOURNALS FREE ACCESS

    In the quest of operating a circuit at its minimum energy, both of the design-time and run-time optimizations are being employed aggressively in mobile devices today. This paper overviews some of the energy reduction techniques and highlights the role of device-circuit interaction in achieving minimum energy operation. We show that run-time energy minimization can be achieved by keeping the ratio between static energy and dynamic energy to an optimum value. Focusing on the ease-of-design, we then review several delay-based sensor circuits to monitor transistor performance, leakage-current and temperature. We also discuss activity monitors to estimate the dynamic energy. We then discuss a cell-based design approach for both of the sensing and the back-gate voltage tuning circuits for area and cost reduction. Using the sensing and the voltage tuning circuits, the system can operate at its minimum energy all the time.

    Download PDF (1074K)
  • Kenshu Seto
    Type: Regular Paper
    Subject area: High Level Synthesis
    2019 Volume 12 Pages 13-21
    Published: 2019
    Released: February 22, 2019
    JOURNALS FREE ACCESS

    Scalar replacement is one of effective array access optimizations that can be applied before High-level synthesis (HLS). The successful application of scalar replacement removes local memories, and as a result, it decreases hardware area. In addition, scalar replacement reduces the numbers of hardware execution cycles by reducing memory access conflicts. In scalar replacement, shift registers are introduced to remove local arrays, and reuse distances corresponds to the lengths of the shift registers. Previous scalar replacement methods implement the shift registers with chains of registers, so that the hardware area becomes large when the reuse distances are large. In addition, when reuse distances are unknown at compile time, previous scalar replacement methods require multiplexers with large numbers of inputs, which further increase on hardware area. In this paper, we propose a new technique to resolve the issues. In particular, we implement the shift registers with circular buffers instead of chains of registers. Large shift registers implemented by RAM-based circular buffers are more compact than those implemented by the chains of registers. We also show that the proposed method requires no multiplexers to realize scalar replacement for loops with statically unknown reuse distances, which leads to area-efficient hardware implementation. We developed a tool that implements the method and applied the tool to the benchmark programs which require large shift registers or have statically unknown reuse distances. We found that the hardware area is reduced with the proposed method compared to the previous method without sacrificing the hardware performance. We conclude that the proposed method is an area efficient scalar replacement method for programs that have large or unknown reuse distances at compile time.

    Download PDF (894K)
  • Salita Sombatsiri, Seiya Shibata, Yuki Kobayashi, Hiroaki Inoue, Takas ...
    Type: Regular Paper
    Subject area: System-Level Design
    2019 Volume 12 Pages 22-37
    Published: 2019
    Released: February 22, 2019
    JOURNALS FREE ACCESS

    The performance of recent CNN accelerators falls behind their peak performance because they fail to maximize parallel computation in every convolutional layer from the parallelism that varies throughout the CNN. Furthermore, the exploitation of multiple parallelisms may reduce calculation-skip ability. This paper proposes a convolution core for sparse CNN that leverages multiple types of parallelism and weight sparsity efficiently to achieve high performance. It alternates dataflow and scheduling of parallel computation according to the available parallelism of each convolutional layer by exploiting both intra- and inter-output parallelism to maximize multiplier utilization. In addition, it eliminates redundant multiply-accumulate (MACC) operations due to weight sparsity. The proposed convolution core enables both abilities with ease of dataflow control by using a parallelism controller for scheduling parallel MACCs on the processing elements (PEs) and a weight broadcaster for broadcasting non-zero weights to the PEs according to the scheduling. The proposed convolution core was evaluated on 13 convolutional layers in a sparse VGG-16 benchmark. It outperforms the baseline architecture for dense CNN that exploits intra-output parallelism by 4x speedup. It achieves 3x effective GMACS over prior arts of CNN accelerator in total performance.

    Download PDF (2339K)
  • Koichi Fujiwara, Kazushi Kawamura, Masao Yanagisawa, Nozomu Togawa
    Type: Short Paper
    Subject area: Architectural Design
    2019 Volume 12 Pages 38-41
    Published: 2019
    Released: February 22, 2019
    JOURNALS FREE ACCESS

    In order to reduce the effects of interconnection delays in recent FPGA chips, circuit designs based on distributed-register architectures (which we call DR-based circuit designs) are important. Several methods for DR-based circuit designs have been proposed, where high-level synthesis techniques are effectively utilized. However, no methods have been proposed yet to practically implement DR-based circuits on FPGA chips. In this paper, we propose an FPGA implementation method based on DR architectures and apply it to a DR-based circuit. The implementation result shows that it operates on an FPGA chip with 21% faster than the circuit based on a traditional architecture.

    Download PDF (668K)
  • Seiya Shirakuni, Ittetsu Taniguchi, Hiroyuki Tomiyama
    Type: Short Paper
    Subject area: Architectural Design
    2019 Volume 12 Pages 42-45
    Published: 2019
    Released: February 22, 2019
    JOURNALS FREE ACCESS

    Due to the advances in semiconductor technologies, recent FPGA devices are able to implement a number of CPU cores to realize high-performance embedded systems. This paper presents a case study on design, implementation and evaluation of manycore architectures on an FPGA. Two types of 32-core architectures with different topologies, i.e., asymmetric and symmetric architectures, are designed and implemented on an FPGA, together with an OpenCL-based software framework. The performance of the two architectures is evaluated based on actual measurement using various application programs.

    Download PDF (442K)
  • Takafumi Miyazaki, Shunsuke Takai, Ittetsu Taniguchi, Hiroyuki Tomiyam ...
    Type: Short Paper
    Subject area: Architectural Design
    2019 Volume 12 Pages 46-49
    Published: 2019
    Released: February 22, 2019
    JOURNALS FREE ACCESS

    This paper presents an OpenCL-based software framework which we have developed for a heterogeneous multicore architecture on Zynq-7000 SoC. In this work, the heterogeneous architecture is designed with two hard-macro Cortex-A9 cores and two soft-macro MicroBlaze cores. A major advantage of our OpenCL framework is that it can execute OpenCL kernel programs in three ways. Experiments show the usefulness of the OpenCL framework.

    Download PDF (356K)
  • Hiroki Koyasu, Yasuhiro Takahashi
    Type: Circuit Design
    Subject area: Short Paper
    2019 Volume 12 Pages 50-52
    Published: 2019
    Released: February 22, 2019
    JOURNALS FREE ACCESS

    We propose a new adiabatic logic for cryptographic circuits, called as the Current Pass Optimized Symmetric Pass Gate Adiabatic Logic (CPO-SPGAL). The proposed circuit realizes a flat current waveform by considering the current path. The simulation results demonstrate that the proposed circuit can reduce the current fluctuation by approximately 84% and reduce the energy consumption fluctuation by approximately 79% as compared to the existing SPGAL circuits. This shows that it is more resistant to differential power analysis attacks than conventional circuits.

    Download PDF (342K)
  • Chaofei Yang, Ximing Qiao, Yiran Chen
    Subject area: Invited Paper
    2019 Volume 12 Pages 53-64
    Published: 2019
    Released: August 01, 2019
    JOURNALS FREE ACCESS

    The end of Moore's Law and von Neumann bottleneck motivate researchers to seek alternative architectures that can fulfill the increasing demand for computation resources which cannot be easily achieved by traditional computing paradigm. As one important practice, neuromorphic computing systems (NCS) are proposed to mimic biological behaviors of neurons and synapses, and accelerate computation of neural networks. Traditional CMOS-based implementation of NCS, however, are subject to large hardware cost required to precisely replicate the biological properties. In very recent decade, emerging nonvolatile memory (eNVM) was introduced to NCS design due to its high computing efficiency and integration density. Similar to the circuits built on other nanoscale devices, eNVM-based NCS also suffers from many reliability issues. In this paper, we give a short survey about CMOS- and eNVM-based NCS, including their basic implementations and training and inference schemes in various applications. We also discuss the design challenges of these NCS and introduce some techniques that can improve the reliability, precision, scalability, and security of the NCS. At the end, we provide our insights on the design trend and future challenges of the NCS.

    Download PDF (1653K)
  • Kana Shimada, Ittetsu Taniguchi, Hiroyuki Tomiyama
    Type: High Level Synthesis
    Subject area: Regular Paper
    2019 Volume 12 Pages 65-73
    Published: 2019
    Released: August 01, 2019
    JOURNALS FREE ACCESS

    Task scheduling has a significant impact on multicore computing systems. This paper studies scheduling of data-parallel tasks on multicore architectures. Unlike traditional task scheduling, this work allows individual tasks to run on multiple cores in a data-parallel fashion. In this paper, the inter-task communication overhead is taken into account during scheduling. The communication happens if main threads of two tasks with data-dependencies are mapped onto the different processors. This paper proposes two methods for data-parallel task scheduling with communication overhead. One is two-step method, which schedules tasks without communication and then assigns threads in the task on cores. The other is integrated method, which performs task scheduling and thread assignment simultaneously. Both of the two methods are based on integer linear programming. The proposed methods are evaluated through experiments and encouraging results are obtained.

    Download PDF (1227K)
  • Yang Liu, Lin Meng, Hiroyuki Tomiyama
    Type: High Level Synthesis
    Subject area: Short Paper
    2019 Volume 12 Pages 74-77
    Published: 2019
    Released: August 01, 2019
    JOURNALS FREE ACCESS

    This paper proposes a genetic algorithm for scheduling of multiple data-parallel tasks on multicores. Unlike traditional task scheduling, this work allows individual tasks to run on multiple cores in a data-parallel fashion. Experimental results show the effectiveness of the proposed algorithm over state-of-the-art algorithms.

    Download PDF (379K)
  • Nobutaka Kito, Kazuyoshi Takagi, Naofumi Takagi
    Type: Logic Circuit Design
    Subject area: Short Paper
    2019 Volume 12 Pages 78-80
    Published: 2019
    Released: August 01, 2019
    JOURNALS FREE ACCESS

    A conversion method of a netlist consisting of conventional logic gates for superconducting rapid single flux quantum (RSFQ) circuit realization is proposed. The method detects OR gates which can be replaced with confluence buffers (CBs) which converge their input pulses into their outputs. The detection problem of replaceable OR gates is treated as a SAT problem. Each OR gate requires clock input in RSFQ circuits. By replacing OR gates with CBs, wiring for clocking those OR gates are eliminated and the number of active devices known as Josephson junctions is reduced.

    Download PDF (203K)
feedback
Top