In the quest of operating a circuit at its minimum energy, both of the design-time and run-time optimizations are being employed aggressively in mobile devices today. This paper overviews some of the energy reduction techniques and highlights the role of device-circuit interaction in achieving minimum energy operation. We show that run-time energy minimization can be achieved by keeping the ratio between static energy and dynamic energy to an optimum value. Focusing on the ease-of-design, we then review several delay-based sensor circuits to monitor transistor performance, leakage-current and temperature. We also discuss activity monitors to estimate the dynamic energy. We then discuss a cell-based design approach for both of the sensing and the back-gate voltage tuning circuits for area and cost reduction. Using the sensing and the voltage tuning circuits, the system can operate at its minimum energy all the time.
Scalar replacement is one of effective array access optimizations that can be applied before High-level synthesis (HLS). The successful application of scalar replacement removes local memories, and as a result, it decreases hardware area. In addition, scalar replacement reduces the numbers of hardware execution cycles by reducing memory access conflicts. In scalar replacement, shift registers are introduced to remove local arrays, and reuse distances corresponds to the lengths of the shift registers. Previous scalar replacement methods implement the shift registers with chains of registers, so that the hardware area becomes large when the reuse distances are large. In addition, when reuse distances are unknown at compile time, previous scalar replacement methods require multiplexers with large numbers of inputs, which further increase on hardware area. In this paper, we propose a new technique to resolve the issues. In particular, we implement the shift registers with circular buffers instead of chains of registers. Large shift registers implemented by RAM-based circular buffers are more compact than those implemented by the chains of registers. We also show that the proposed method requires no multiplexers to realize scalar replacement for loops with statically unknown reuse distances, which leads to area-efficient hardware implementation. We developed a tool that implements the method and applied the tool to the benchmark programs which require large shift registers or have statically unknown reuse distances. We found that the hardware area is reduced with the proposed method compared to the previous method without sacrificing the hardware performance. We conclude that the proposed method is an area efficient scalar replacement method for programs that have large or unknown reuse distances at compile time.
The performance of recent CNN accelerators falls behind their peak performance because they fail to maximize parallel computation in every convolutional layer from the parallelism that varies throughout the CNN. Furthermore, the exploitation of multiple parallelisms may reduce calculation-skip ability. This paper proposes a convolution core for sparse CNN that leverages multiple types of parallelism and weight sparsity efficiently to achieve high performance. It alternates dataflow and scheduling of parallel computation according to the available parallelism of each convolutional layer by exploiting both intra- and inter-output parallelism to maximize multiplier utilization. In addition, it eliminates redundant multiply-accumulate (MACC) operations due to weight sparsity. The proposed convolution core enables both abilities with ease of dataflow control by using a parallelism controller for scheduling parallel MACCs on the processing elements (PEs) and a weight broadcaster for broadcasting non-zero weights to the PEs according to the scheduling. The proposed convolution core was evaluated on 13 convolutional layers in a sparse VGG-16 benchmark. It outperforms the baseline architecture for dense CNN that exploits intra-output parallelism by 4x speedup. It achieves 3x effective GMACS over prior arts of CNN accelerator in total performance.
In order to reduce the effects of interconnection delays in recent FPGA chips, circuit designs based on distributed-register architectures (which we call DR-based circuit designs) are important. Several methods for DR-based circuit designs have been proposed, where high-level synthesis techniques are effectively utilized. However, no methods have been proposed yet to practically implement DR-based circuits on FPGA chips. In this paper, we propose an FPGA implementation method based on DR architectures and apply it to a DR-based circuit. The implementation result shows that it operates on an FPGA chip with 21% faster than the circuit based on a traditional architecture.
Due to the advances in semiconductor technologies, recent FPGA devices are able to implement a number of CPU cores to realize high-performance embedded systems. This paper presents a case study on design, implementation and evaluation of manycore architectures on an FPGA. Two types of 32-core architectures with different topologies, i.e., asymmetric and symmetric architectures, are designed and implemented on an FPGA, together with an OpenCL-based software framework. The performance of the two architectures is evaluated based on actual measurement using various application programs.
This paper presents an OpenCL-based software framework which we have developed for a heterogeneous multicore architecture on Zynq-7000 SoC. In this work, the heterogeneous architecture is designed with two hard-macro Cortex-A9 cores and two soft-macro MicroBlaze cores. A major advantage of our OpenCL framework is that it can execute OpenCL kernel programs in three ways. Experiments show the usefulness of the OpenCL framework.
We propose a new adiabatic logic for cryptographic circuits, called as the Current Pass Optimized Symmetric Pass Gate Adiabatic Logic (CPO-SPGAL). The proposed circuit realizes a flat current waveform by considering the current path. The simulation results demonstrate that the proposed circuit can reduce the current fluctuation by approximately 84% and reduce the energy consumption fluctuation by approximately 79% as compared to the existing SPGAL circuits. This shows that it is more resistant to differential power analysis attacks than conventional circuits.
The end of Moore's Law and von Neumann bottleneck motivate researchers to seek alternative architectures that can fulfill the increasing demand for computation resources which cannot be easily achieved by traditional computing paradigm. As one important practice, neuromorphic computing systems (NCS) are proposed to mimic biological behaviors of neurons and synapses, and accelerate computation of neural networks. Traditional CMOS-based implementation of NCS, however, are subject to large hardware cost required to precisely replicate the biological properties. In very recent decade, emerging nonvolatile memory (eNVM) was introduced to NCS design due to its high computing efficiency and integration density. Similar to the circuits built on other nanoscale devices, eNVM-based NCS also suffers from many reliability issues. In this paper, we give a short survey about CMOS- and eNVM-based NCS, including their basic implementations and training and inference schemes in various applications. We also discuss the design challenges of these NCS and introduce some techniques that can improve the reliability, precision, scalability, and security of the NCS. At the end, we provide our insights on the design trend and future challenges of the NCS.
Task scheduling has a significant impact on multicore computing systems. This paper studies scheduling of data-parallel tasks on multicore architectures. Unlike traditional task scheduling, this work allows individual tasks to run on multiple cores in a data-parallel fashion. In this paper, the inter-task communication overhead is taken into account during scheduling. The communication happens if main threads of two tasks with data-dependencies are mapped onto the different processors. This paper proposes two methods for data-parallel task scheduling with communication overhead. One is two-step method, which schedules tasks without communication and then assigns threads in the task on cores. The other is integrated method, which performs task scheduling and thread assignment simultaneously. Both of the two methods are based on integer linear programming. The proposed methods are evaluated through experiments and encouraging results are obtained.
This paper proposes a genetic algorithm for scheduling of multiple data-parallel tasks on multicores. Unlike traditional task scheduling, this work allows individual tasks to run on multiple cores in a data-parallel fashion. Experimental results show the effectiveness of the proposed algorithm over state-of-the-art algorithms.
A conversion method of a netlist consisting of conventional logic gates for superconducting rapid single flux quantum (RSFQ) circuit realization is proposed. The method detects OR gates which can be replaced with confluence buffers (CBs) which converge their input pulses into their outputs. The detection problem of replaceable OR gates is treated as a SAT problem. Each OR gate requires clock input in RSFQ circuits. By replacing OR gates with CBs, wiring for clocking those OR gates are eliminated and the number of active devices known as Josephson junctions is reduced.