This paper presents a design and an implementation of an on-line quality control method for a TRNG (True Random Number Generator) on an FPGA. It is based on a TRNG with RS latches and a temporal XOR corrector, which can make a trade-off between throughput and randomness quality by changing the number of accumulations by XOR. The goal of our method is to increase the throughput within the range of keeping the quality of output random numbers. In order to detect a sign of the loss of quality from the TRNG in parallel with random number generation, our method distinguishes random bitstrings to be tested from those to be output. The test bitstring is generated with the fewer number of accumulations than that of the output bitstring. The number of accumulations will be increased if the test bitstring fails in the randomness test. We designed and evaluated a prototype of on-line quality control system, using a Zynq-7000 FPGA SoC. The results indicate that the TRNG with the proposed method achieved 1.91-2.63 Mbits/s of throughput with 16 latches, following the change of the quality of output random numbers. The total number of logic elements in the prototype system with 16 latches was comparable to an existing system with 256 latches, without quality control capabilities.
In the Robot Operating System (ROS), a major middleware for robots, the Transform Library (TF) is a mandatory package that manages transformation information between coordinate systems by using a directed forest data structure and providing methods for registering and computing the information. However, the structure has two fundamental problems. The first is its poor scalability: since it accepts only a single thread at a time due to using a single giant lock for mutual exclusion, the access to the tree is sequential. Second, there is a lack of data freshness: it retrieves non-latest synthetic data when computing coordinate transformations because it prioritizes temporal consistency over data freshness. In this paper, we propose methods based on transactional techniques. This will allow us to avoid anomalies, achieve high performance, and obtain fresh data. These transactional methods show a throughput of up to 429 times higher than the conventional method on a read-only workload and a freshness of up to 1276 times higher than the conventional one on a read-write combined workload.
One of the performance bottlenecks of a processor is the front-end that supplies instructions. Various techniques, such as cache replacement algorithms and hardware prefetching, have been investigated to facilitate smooth instruction supply at the front-end and to improve processor performance. In these approaches, one of the most important factors has been the reduction in the number of instruction cache misses. By using the number of instruction cache misses or derived factors, previous studies have explained the performance improvements achieved by their proposed methods. However, we found that the number of instruction cache misses does not always explain performance changes well in modern processors. This is because the front-end in modern processors handles subsequent instruction cache misses in overlap with earlier ones. Based on this observation, we propose a novel factor: the number of miss regions. We define a region as a sequence of instructions from one branch misprediction to the next, while we define a miss region as a region that contains one or more instruction cache misses. At the boundary of each region, the pipeline is flushed owing to a branch misprediction. Thus, cache misses after this boundary are not handled in overlap with cache misses before the boundary. As a result, the number of miss regions is equal to the number of cache misses that are processed without overlap. In this paper, we demonstrate that the number of miss regions can well explain the variation in performance through mathematical models and simulation results. The results show that the model explains cycles per instruction with an average error of 1.0% and maximum error of 4.1% when applying an existing prefetcher to the instruction cache. The idea of miss regions highlights that instruction cache misses and branch mispredictions interact with each other in processors with a decoupled front-end. We hope that considering this interaction will motivate the development of fast performance estimation methods and new microarchitectural methods.
Annealing computation has recently attracted attention as it can efficiently solve combinatorial optimization problems using an Ising spin-glass model. Stochastic cellular automata annealing (SCA) is a promising algorithm that can realize fast spin-update by utilizing its parallel computing capability. However, in SCA, pinning effect control to suppress the spin-flip probability is essential, making escaping from local minima more difficult than serial spin-update algorithms, depending on the problem. This paper proposes a novel approach called APC-SCA (Autonomous Pinning effect Control SCA), where the pinning effect can be controlled autonomously by focusing on individual spin-flip. The evaluation results using max-cut, N-queen, and traveling salesman problems demonstrate that APC-SCA can obtain better solutions than the original SCA that uses pinning effect control pre-optimized by a grid search. Especially in solving traveling salesman problems, we confirm that the tour distance obtained by APC-SCA is up to 56.3% closer to the best-known compared to the conventional approach.
The Order/Radix Problem (ORP) is an optimization problem that can be solved to find an optimal network topology in distributed memory systems. It is important to find the optimum number of switches in the ORP. In the case of a regular graph, a good estimation of the preferred number of switches has been proposed, and it has been shown that simulated annealing (SA) finds a good solution given a fixed number of switches. However, generally the optimal graph does not necessarily satisfy the regular condition, which greatly increases the computational costs required to find a good solution with a suitable number of switches for each case. This study improved the new method based on SA to find a suitable number of switches. By introducing neighborhood searches in which the number of switches is increased or decreased, our method can optimize a graph by changing the number of switches adaptively during the search. In numerical experiments, we verified that our method shows a good approximation for the best setting for the number of switches, and can simultaneously generate a graph with a small host-to-host average shortest path length, using instances presented by Graph Golf, an international ORP competition.
As more and more programs handle personal information, the demand for secure handling of data is increasing. The protocol that satisfies this demand is called Secure function evaluation (SFE) and has attracted much attention from a privacy protection perspective. In two-party SFE, two mutually untrustworthy parties compute an arbitrary function on their respective secret inputs without disclosing any information other than the output of the function. For example, it is possible to execute a program while protecting private information, such as genomic information. The garbled circuit (GC) — a method of program obfuscation in which the program is divided into gates and the output is calculated using a symmetric key cipher for each gate — is an efficient method for this purpose. However, GC is computationally expensive and has a significant overhead even with an accelerator. We focus on hardware acceleration because of the nature of GC, which is limited to certain types of calculations, such as encryption and XOR. In this paper, we propose an architecture that accelerates garbling by running multiple garbling engines simultaneously based on the latest FPGA-based GC accelerator. In this architecture, managers are introduced to perform multiple rows of pipeline processing simultaneously. We also propose an optimized implementation of RAM for this FPGA accelerator. As a result, it achieves an average performance improvement of 26% in garbling the same set of programs, compared to the state-of-the-art (SOTA) garbling accelerator.
FPGA cluster is a promising platform for future computing not only in the cloud but in the 5G wireless base stations with limited power supply by taking significant advantage of power efficiency. However, almost no power analyses with real systems have been reported. This work reports the detailed power consumption analyses of two FPGA clusters, namely FiC and M-KUBOS clusters with introducing power measurement tools and running the real applications. From the detailed analyses, we find that the number of activated links mainly determines the total power consumption of the systems regardless they are used or not. To improve the performance of applications while reducing power consumption, we should increase the clock frequency of the applications, use the minimum number of links and apply link aggregation. We also propose the power model for both clusters from the results of the analyses and this model can estimate the total power consumption of both FPGA clusters at the design step with 15% errors at maximum.
Binary Neural Networks (BNN) have binarized neuron and connection values so that their accelerators can be realized by extremely efficient hardware. However, there is a significant accuracy gap between BNNs and networks with wider bit-width. Conventional BNNs binarize feature maps by static globally-unified thresholds, which makes the produced bipolar image lose local details. This paper proposes a multi-input activation function to enable adaptive thresholding for binarizing feature maps: (a) At the algorithm level, instead of operating each input pixel independently, adaptive thresholding dynamically changes the threshold according to surrounding pixels of the target pixel. When optimizing weights, adaptive thresholding is equivalent to an accompanied depth-wise convolution between normal convolution and binarization. Accompanied weights in the depth-wise filters are ternarized and optimized end-to-end. (b) At the hardware level, adaptive thresholding is realized through a multi-input activation function, which is compatible with common accelerator architectures. Compact activation hardware with only one extra accumulator is devised. By equipping the proposed method on FPGA, 4.1% accuracy improvement is achieved on the original BNN with only 1.1% extra LUT resource. Compared with State-of-the-art methods, the proposed idea further increases network accuracy by 0.8% on the Cifar-10 dataset and 0.4% on the ImageNet dataset.
Widely adopted by machine learning and graph processing applications nowadays, sparse matrix-Vector multiplication (SpMV) is a very popular algorithm in linear algebra. This is especially the case for fully-connected MLP layers, which dominate many SpMV computations and play a substantial role in diverse services. As a consequence, a large fraction of data center cycles is spent on SpMV kernels. Meanwhile, despite having efficient storage options against sparsity (such as CSR or CSC), SpMV kernels still suffer from the problem of limited memory bandwidth during data transferring because of the memory hierarchy of modern computing systems. In more detail, we find that both integer and floating-point data used in SpMV kernels are handled plainly without any necessary pre-processing. Therefore, we believe bandwidth conservation techniques, such as data compression, may dramatically help SpMV kernels when data is transferred between the main memory and the Last Level Cache (LLC). Furthermore, we also observe that convergence conditions in some typical scientific computation benchmarks (based on SpMV kernels) will not be degraded when adopting lower precision floating-point data. Based on these findings, in this work, we propose a simple yet effective data compression scheme that can be extended to general purpose computing architectures or HPC systems preferably. When it is adopted, a best-case speedup of 1.92x is made. Besides, evaluations with both the CG kernel and the PageRank algorithm indicate that our proposal introduces negligible overhead on both the convergence speed and the accuracy of final results.
Similarity search for data streams has attracted much attention for information recommendation. In this context, recent leading works regard the latest W items in a data stream as an evolving set and reduce similarity search for data streams to set similarity search. Whereas they consider standard sets composed of items, this paper uniquely studies similarity search for text streams and treats evolving sets whose elements are texts. Specifically, we formulate a new continuous range search problem named the CTS problem (Continuous similarity search for Text Sets). The task of the CTS problem is to find all the text streams from the database whose similarity to the query becomes larger than a threshold ε. It abstracts a scenario in which a user-based recommendation system searches similar users from social networking services. The CTS is important because it allows both the query and the database to change dynamically. We develop a fast pruning-based algorithm for the CTS. Moreover, we discuss how to speed up it with the inverted index.
The significance of individuals' location information has been increasing recently, and the utilization of such data has become indispensable for businesses and society. The possible uses of location information include personalized services (maps, restaurant searches and weather forecast services) and business decisions (deciding where to open a store). However, considering that the data could be exploited, users should add random noise using their terminals before providing location data to collectors. In numerous instances, the level of privacy protection a user requires depends on their location. Therefore, in our framework, we assume that users can specify different privacy protection requirements for each location utilizing the adversarial error (AE), and the system computes a mechanism to satisfy these requirements. To guarantee some utility for data analysis, the maximum error in outputting the location should also be output. In most privacy frameworks, the mechanism for adding random noise is public; however, in this problem setting, the privacy protection requirements and the mechanism must be confidential because this information includes sensitive information. We propose two mechanisms to address privacy personalization. The first mechanism is the individual exponential mechanism, which uses the exponential mechanism in the differential privacy framework. However, in the individual exponential mechanism, the maximum error for each output can be used to narrow down candidates of the actual location by observing outputs from the same location multiple times. The second mechanism improves on this deficiency and is called the donut mechanism, which uniformly outputs a random location near the location where the distance from the user's actual location is at the user-specified AE distance. Considering the potential attacks against the idea of donut mechanism that utilize the maximum error, we extended the mechanism to counter these attacks. We compare these two mechanisms by experiments using maps constructed from artificial and real world data.
Universities collect and process a massive amount of Personal Identifiable Information (PII) at registration and throughout interactions with individuals. However, student PII can be exposed to the public by uploading documents along with university notice without consent and awareness, which could put individuals at risk of a variety of different scams, such as identity theft, fraud, or phishing. In this paper, we perform an in-depth analysis of student PII leakage at Vietnamese universities. To the best of our knowledge, we are the first to conduct a comprehensive study on student PII leakage in higher educational institutions. We find that 52.8% of Vietnamese universities leak student PII, including one or more types of personal data, in documents on their websites. It is important to note that the compromised PII includes sensitive types of data, student medical record and religion. Also, student PII leakage is not a new phenomenon and it has happened year after year since 2005. Finally, we present a study with 23 Vietnamese university employees who have worked on student PII to get a deeper understanding of this situation and envisage concrete solutions. The results are entirely surprising: the employees are highly aware of the concept of student PII. However, student PII leakage still happens due to their working habits or the lack of a management system and regulation. Therefore, the Vietnamese university should take a more active stand to protect student data in this situation.
The aim of a computer-aided drawing therapy system in this work is to associate drawings which a client makes with the client's mental state in quantitative terms. A case study is conducted on experimental data which contain both pastel drawings and mental state scores obtained from the same client in a psychotherapy program. To perform such association through colors, we translate a drawing to a color feature by measuring its representative colors as primary color rates. A primary color rate of a color is defined from a psychological primary color in a way such that it shows a rate of emotional properties of the psychological primary color which is supposed to affect the color. To obtain several informative colors as representative ones of a drawing, we define two kinds of color: approximate colors extracted by color reduction, and area-averaged colors calculated from the approximate colors. A color analysis method for extracting representative colors from each drawing in a drawing sequence under the same conditions is presented. To estimate how closely a color feature is associated with a concurrent mental state, we propose a method of utilizing machine-learning classification. A practical way of building a classification model through training and validation on a very small dataset is presented. The classification accuracy reached by the model is considered as the degree of association of the color feature with the mental state scores given in the dataset. Experiments were carried out on given clinical data. Several kinds of color feature were compared in terms of the association with the same mental state. As a result, we found out a good color feature with the highest degree of association. Also, primary color rates proved more effective in representing colors in psychological terms than RGB components. The experimentals provide evidence that colors can be associated quantitatively with states of human mind.
Recently, Global navigation satellite system (GNSS) positioning has been widely used in various applications (e.g. car navigation system, smartphone map application, autonomous driving). In GNSS positioning, coordinates are calculated from observed satellite signals. The observed signals contain various errors, so the calculated coordinates also have some errors. Double-difference is one of the widely used ideas to reduce the errors of the observed signals. Although double-difference can remove many kinds of errors from the observed signals, some errors still remain (e.g. multipath error). In this paper, we define the remaining error as “double-difference-error (DDE)” and propose a method for estimating DDE using machine learning. In addition, we attempt to improve DGNSS positioning by feeding back the estimated DDE. Previous research applying machine learning to GNSS has focused on classifying whether the signal is LOS (Line Of Sight) or NLOS (Non Line Of Sight), and there is no study that attempts to estimate the amount of error itself as far as we know. Furthermore, previous studies had the limitation that their dataset was recorded at only a few locations in the same city. This is because these studies are mainly aimed at improving the positioning accuracy of vehicles, and collecting large amounts of data using vehicles is costly. To avoid this problem, in this research, we use a huge amount of openly available stationary point data for training. Through the experiments, we confirmed that the proposed method can reduce the DGNSS positioning error. Even though the DDE estimator was trained only on stationary point data, the proposed method improved the DGNSS positioning accuracy not only with stationary point but also with mobile rover. In addition, by comparing with the previous (detect and remove) approach, we confirmed the effectiveness of the DDE feedback approach.
In this paper, we apply two methods in machine learning, dropout and semi-supervised learning, to a recently proposed method called CSQ-SDL which uses deep neural networks for evaluating shift quality from time-series measurement data. When developing a new Automatic Transmission (AT), calibration takes place where many parameters of the AT are adjusted to realize pleasant driving experience in all situations that occur on all roads around the world. Calibration requires an expert to visually assess the shift quality from the time-series measurement data of the experiments each time the parameters are changed, which is iterative and time-consuming. The CSQ-SDL was developed to shorten time consumed by the visual assessment, and its effectiveness depends on acquiring a sufficient number of data points. In practice, however, data amounts are often insufficient. The methods proposed here can handle such cases. For the cases wherein only a small number of labeled data points is available, we propose a method that uses dropout. For those cases wherein the number of labeled data points is small but the number of unlabeled data is sufficient, we propose a method that uses semi-supervised learning. Experiments show that while the former gives moderate improvement, the latter offers a significant performance improvement.
The image-to-image translation aims to learn a mapping between the source and target domains. For improving visual quality, the majority of previous works adopt multi-stage techniques to refine coarse results in a progressive manner. In this work, we present a novel approach for generating plausible details by only introducing a group of intermediate supervisions without cascading multiple stages. Specifically, we propose a Laplacian Pyramid Transformation Generative Adversarial Network (LapTransGAN) to simultaneously transform components in different frequencies from the source domain to the target domain within only one stage. Hierarchical perceptual and gradient penalization are utilized for learning consistent semantic structures and details at each pyramid level. The proposed model is evaluated based on various metrics, including the similarity in feature maps, reconstruction quality, segmentation accuracy, similarity in details, and qualitative appearances. Our experiments show that LapTransGAN can achieve a much better quantitative performance than both the supervised pix2pix model and the unsupervised CycleGAN model. Comprehensive ablation experiments are conducted to study the contribution of each component.
Text detection is a crucial pre-processing step in optical character recognition (OCR) for the accurate recognition of text, including both fonts and handwritten characters, in documents. While current deep learning-based text detection tools can detect text regions with high accuracy, they often treat multiple lines of text as a single region. To perform line-based character recognition, it is necessary to divide the text into individual lines, which requires a line detection technique. This paper focuses on the development of a new approach to single-line detection in OCR that is based on the existing Character Region Awareness For Text detection (CRAFT) model and incorporates a deep neural network specialized in line segmentation. However, this new method may still detect multiple lines as a single text region when multi-line text with narrow spacing is present. To address this, we also introduce a post-processing algorithm to detect single text regions using the output of the single-line segmentation. Our proposed method successfully detects single lines, even in multi-line text with narrow line spacing, and hence improves the accuracy of OCR.
Recent studies have shown that concurrent transmission with precise time synchronization enables reliable and efficient flooding for wireless networks. However, most of them require all nodes in the network to forward packets a fixed number of times to reach the destination, which leads to unnecessary energy consumption in both one-to-one and many-to-one communication scenarios. In this letter, we propose G1M address this issue by reducing redundant packet forwarding in concurrent transmissions. The evaluation of G1M shows that compared with LWB, the average energy consumption of one-to-one and many-to-one transmission is reduced by 37.89% and 25%, respectively.