Addition is a key fundamental function for many error-tolerant applications. Approximate addition is considered to be an efficient technique for trading off energy against performance and accuracy. This paper proposes a carry-maskable adder whose accuracy can be configured at runtime. The proposed scheme can dynamically select the length of the carry propagation to satisfy the quality requirements flexibly. Compared with a conventional ripple carry adder and a conventional carry look-ahead adder, the proposed 16-bit adder reduced the power consumption by 54.1% and 57.5%, respectively, and the critical path delay by 72.5% and 54.2%, respectively. In addition, results from an image processing application indicate that the quality of processed images can be controlled by the proposed adder. Good scalability of the proposed adder is demonstrated from the evaluation results using a 32-bit length.
In this paper, we describe a novel low-delay 4K 120-fps real-time HEVC decoder with a parallel processing architecture that conforms to the HEVC main 4:2:2 10 profile. It supports the hierarchical temporal scalable streams required for Ultra High Definition high-frame-rate broadcasting and also supports low-delay and high-bitrate decoding for video transmission uses. To achieve this support, the decoding processes are parallelized and pipelined at the frame level, slice level, and coding tree unit row level. The proposed decoder was implemented on three FPGAs operated at 133 and 150 MHz, and it achieved 300-Mbps stream decoding and 37-msec end-to-end delay with our concurrently developed 4K 120-fps encoder.
The advancement of multicore technology has made hundreds or even thousands of cores processor on a single chip possible. However, on a larger scale multicore, a hardware-based cache coherency mechanism becomes overwhelmingly complicated, hot, and expensive. Therefore, we propose a software coherence scheme managed by a parallelizing compiler for shared-memory multicore systems without a hardware cache coherence mechanism. Our proposed method is simple and efficient. It is built into OSCAR automatic parallelizing compiler. The OSCAR compiler parallelizes the coarse grain task, analyzes stale data and line sharing in the program, then solves those problems by simple program restructuring and data synchronization. Using our proposed method, we compiled 10 benchmark programs from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB), and MediaBench II. The compiled binaries then are run on Renesas RP2, an 8 cores SH-4A processor, and a custom 8-core Altera Nios II system on Altera Arria 10 FPGA. The cache coherence hardware on the RP2 processor is only available for up to 4 cores. The RP2's cache coherence hardware can also be turned off for non-coherence cache mode. The Nios II multicore system does not have any hardware cache coherence mechanism; therefore, running a parallel program is difficult without any compiler support. The proposed method performed as good as or better than the hardware cache coherence scheme while still provided the correct result as the hardware coherence mechanism. This method allows a massive array of shared memory CPU cores in an HPC setting or a simple non-coherent multicore embedded CPU to be easily programmed. For example, on the RP2 processor, the proposed software-controlled non-coherent-cache (NCC) method gave us 2.6 times speedup for SPEC 2000 “equake” with 4 cores against sequential execution while got only 2.5 times speedup for 4 cores MESI hardware coherent control. Also, the software coherence control gave us 4.4 times speedup for 8 cores with no hardware coherence mechanism available.
Utilization of local memory from real-time embedded systems to high performance systems with multi-core processors has become an important factor for satisfying hard deadline constraints. However, challenges lie in the area of efficiently managing the memory hierarchy, such as decomposing large data into small blocks to fit onto local memory and transferring blocks for reuse and replacement. To address this issue, this paper presents a compiler optimization method that automatically manage local memory of multi-core processors. The method selects and maps multi-dimensional data onto software specified memory blocks called Adjustable Blocks. These blocks are hierarchically divisible with varying sizes defined by the features of the input application. Moreover, the method introduces mapping structures called Template Arrays to maintain the indices of the decomposed multi-dimensional data. The proposed work is implemented on the OSCAR automatic parallelizing compiler and evaluations were performed on the Renesas RP2 8-core processor. Experimental results from NAS Parallel Benchmark, SPEC benchmark, and multimedia applications show the effectiveness of the method, obtaining maximum speed-ups of 20.44 with 8 cores utilizing local memory from single core sequential versions that use off-chip memory.
This paper analyzes and compares two methods to estimate electromagnetically coupled noises introduced to an antenna due to the nearby circuits at a circuit design stage. One of them is to estimate the power spectrum, and the other one is to estimate the active S11 parameter at the victim antenna, respectively, and both of them use simulated standard S-parameters for the electromagnetic coupling in the circuit. They also need the assumed or measured excitation of noise sources. To confirm the validness of the two methods, an evaluation board consisting of an antenna and noise sources were designed and fabricated in which voltage controlled oscillator (VCO) chips are placed as noise sources. The generated electromagnetic noises are transferred to an antenna via loop-shaped transmission lines, degrading the performance of the antenna. In this paper, detailed analysis procedures are described using the evaluation board, and it is shown that the two methods are equivalent to each other in terms of the induced voltages in the antenna. Finally, a procedure to estimate antenna performance degradation at the design stage is summarized.
The degradation of a SiC-MOSFET-based DC-AC converter-circuit efficiency due to aging of the electrically active devices is investigated. The newly developed compact aging model HiSIM_HSiC for high-voltage SiC-MOSFETs is used in the investigation. The model considers explicitly the carrier-trap-density increase in the solution of the Poisson equation. Measured converter characteristics during a 3-phase line-to-ground (3LG) fault is correctly reproduced by the model. It is verified that the MOSFETs experience additional stress due to the high biases occurring during the fault event, which translates to severe MOSFET aging. Simulation results predict a 0.5% reduction of converter efficiency due to a single 70ms-3LG, which is equivalent to a year of operation under normal conditions, where no additional stress is applied. With the developed compact model, prediction of the efficiency degradation of the converter circuit under prolonged stress, for which measurements are difficult to obtain and typically not available, is also feasible.
Here, we present a novel spectroscopic imaging method based on the boundary-extraction scheme for wide-beam terahertz (THz) three-dimensional imaging. Optical-lens-focusing systems for THz subsurface imaging generally require the depth of the object from the surface to be input beforehand to achieve the desired azimuth resolution. This limitation can be alleviated by incorporating a wide-beam THz transmitter into the synthetic aperture to automatically change the focusing depth in the post-signal processing. The range point migration (RPM) method has been demonstrated to have significant advantages in terms of imaging accuracy over the synthetic-aperture method. Moreover, in the RPM scheme, spectroscopic information can be easily associated with each scattering center. Thus, we propose an RPM-based terahertz spectroscopic imaging method. The finite-difference time-domain-based numerical analysis shows that the proposed algorithm provides accurate target boundary imaging associated with each frequency-dependent characteristic.