Scalar replacement is an effective technique to improve the performance of the RTL code generated by high-level synthesis (HLS) from C programs with intensive array accesses. In scalar replacement, data accessed from arrays are stored into shift registers, and later array accesses on the same data are replaced with the accesses to the shift registers instead of the arrays. Namely, scalar replacement replaces array accesses with shift register accesses. Since arrays in C programs are usually mapped to RAMs with limited numbers of ports, reducing array accesses with scalar replacement leads to the memory access reduction, which in turn improves the performance of the resulting RTL code. In real-life C programs, sometimes, shift registers must be initialized conditionally using multiple array accesses, which increases the number of array accesses in main loops. To reduce the conditional array access in the main loops, the previous scalar replacement method proposed the use of a loop transformation called loop peeling. Loop peeling brings significant increase in code size, leading to the negative impacts on performance or circuit area of the synthesized hardware. In this paper, we propose a new method to initialize shift registers without loop peeling. The proposed method works as a preprocessing of the input C program prior to scalar replacement. With experimental results, we demonstrate the proposed method reduces the numbers of execution cycles of the synthesized hardware compared to the previous method.
Stochastic computing is a computation method which can implement arithmetic operations by simple logic circuits. Stochastic numbers are used in this method, whose values are defined by their bit streams' appearance rates of 1's. As a nature of stochastic computing, changing the length of the input stochastic numbers will change the whole circuit's accuracy. However, in some implementations with re-convergence paths, the circuit itself will cause errors, i.e., the length of the input stochastic numbers does not change that circuit's accuracy. This paper proposes a stochastic number duplicator whose outputs differ every time and are all independent. This stochastic number duplicator has a scalable structure by changing the numbers of flip-flops for bit re-arrangement. From the experimental evaluations and discussions, we clarify that the proposed stochastic number duplicator enables accuracy-flexible circuits.
With the progress of semiconductor process miniaturization, delay degradation by aging increases and threatens the reliability of fabricated chips. The amount of delay degradation is known to be circuit and workload dependent, but previous evaluations are based on simulations, and delay degradation measurement of real circuit under realistic workload has not been reported yet. This paper proposes real circuit delay measurement method, which achieves enough accuracy to measure circuit and workload dependent delay degradation. In the proposed method, on-chip oscillator supplies fine resolution variable frequency clock to internal circuit. Internal circuit execute test pattern to activate critical paths at various frequency and determine the maximum frequency at which correct results can be obtained. The maximum frequency corresponds to the delay of the critical paths activated by the test pattern. Clock multiplication improves delay resolution, and repetitive measurement reduces measurement error caused by time dependent random delay variation. The proposed method has been implemented on a 65nm low power process test chip. Variable frequency oscillator utilizes only standard cells and is designed with automatic layout flow without any timing tuning. The area overhead of the proposed method is 0.09% of the total random logic. The evaluation result show that 0.18% average measurement accuracy has been achieved.
Recently, there have been more chances to calculate matrix-vector multiplication due to the growing use of the neural network. We have proposed the method to automatically synthesize the optimum parallel algorithm for the given environment and synthesized an algorithm for matrix-vector multiplication of a specific size matrix with 4 nodes connected in a oneway ring. This paper proposes a method to generalize the synthesized algorithm to deal with any size matrix. We generalized the synthesized algorithm for the 32×32 matrix to calculate N × N matrix-vector multiplication.
In this paper, we propose a logic optimization method to remove the redundancy in the circuit. The incremental Automatic Test Pattern Generation method is used to find the redundant multiple faults. In order to remove as many redundancies as possible, instead of removing the redundant single faults first, we clear up the redundant faults from higher cardinality to lower cardinality. The experiments prove that the proposed method can successfully eliminate more redundancies comparing to the redundancy removal command in the synthesis tool SIS.
This paper presents experimental evaluations of operating speed, power consumption, and chip temperature under various load conditions on commercial FPGAs. To measure the operating speed of FPGAs, we count the oscillation frequency of a ring oscillator (RO), which is implemented on one 4-input look-up table (LUT). The load condition on FPGAs can be controlled by the number of activated ROs in parallel around the measurement RO. The measurement results show that operating speed may be decreased more than10% with 142 ROs as load circuits and the degradation rates depend on the measurement location in an FPGA. The evaluation of die-to-die variations using four more FPGA boards show that there exists a non-negligible speed variation.