Barcode reading mobile applications to identify products from pictures acquired by mobile devices are widely used by customers from all over the world to perform online price comparisons or to access reviews written by other customers. Most of the currently available 1D barcode reading applications focus on effectively decoding barcodes and treat the underlying detection task as a side problem that needs to be solved using general purpose object detection methods. However, the majority of mobile devices do not meet the minimum working requirements of those complex general purpose object detection algorithms and most of the efficient specifically designed 1D barcode detection algorithms require user interaction to work properly. In this work, we present a novel method for 1D barcode detection in camera captured images, based on a supervised machine learning algorithm that identifies the characteristic visual patterns of 1D barcodes' parallel bars in the two-dimensional Hough Transform space of the processed images. The method we propose is angle invariant, requires no user interaction and can be effectively executed on a mobile device; it achieves excellent results for two standard 1D barcode datasets: WWU Muenster Barcode Database and ArTe-Lab 1D Medium Barcode Dataset. Moreover, we prove that it is possible to enhance the performance of a state-of-the-art 1D barcode reading library by coupling it with our detection method.
Features play crucial role in the performance of classifier for object detection from high-resolution remote sensing images. In this paper, we implemented two types of deep learning methods, deep convolutional neural network (DNN) and deep belief net (DBN), comparing their performances with that of the traditional methods (handcrafted features with a shallow classifier) in the task of aircraft detection. These methods learn robust features from a large set of training samples to obtain a better performance. The depth of their layers (>6 layers) grants them the ability to extract stable and large-scale features from the image. Our experiments show both deep learning methods reduce at least 40% of the false alarm rate of the traditional methods (HOG, LBP+SVM), and DNN performs a little better than DBN. We also fed some multi-preprocessed images simultaneously to one DNN model, and found that such a practice helps to improve the performance of the model obviously with no extra-computing burden adding.
We propose a human lower body pose estimation method for team sport videos, which is integrated with tracking-by-detection technique. The proposed Label-Grid classifier uses the grid histogram feature of the tracked window from the tracker and estimates the lower body joint position of a specific joint as the class label of the multi-class classifiers, whose classes correspond to the candidate joint positions on the grid. By learning various types of player poses and scales of Histogram-of-Oriented Gradients features within one team sport, our method can estimate poses even if the players are motion-blurred and low-resolution images without requiring a motion-model regression or part-based model, which are popular vision-based human pose estimation techniques. Moreover, our method can estimate poses with part-occlusions and non-upright side poses, which part-detector-based methods find it difficult to estimate with only one model. Experimental results show the advantage of our method for side running poses and non-walking poses. The results also show the robustness of our method for a large variety of poses and scales in team sports videos.
The process of Ultra High Definition TV videos requires a lot of resources in terms of memory and computation time. In this paper we consider a block-propagation background subtraction (BPBGS) method which spreads to neighboring blocks if a part of an object is detected on the borders of the current block. This allows us to avoid processing unnecessary areas which do not contain any object thus saving memory and computational time. The results show that our method is particularly efficient in sequences where objects occupy a small portion of the scene despite the fact that there are a lot of background movements. At same scale our BPBGS performs much faster than the state-of-art methods for a similar detection quality.
Saliency maps as visual attention computational models can reveal novel regions within a scene (as in the human visual system), which can decrease the amount of data to be processed in task specific computer vision applications. Most of the saliency computation models do not take advantage of prior spatial memory by giving priority to spatial or object based features to obtain bottom-up or top-down saliency maps. In our previous experiments, we demonstrated that spatial memory regardless of object features can aid detection and tracking tasks with a mobile robot by using a 2D global environment memory of the robot and local Kinect data in 2D to compute the space-based saliency map. However, in complex scenes where 2D space-based saliency is not enough (i.e., subject lying on the bed), 3D scene analysis is necessary to extract novelty within the scene by using spatial memory. Therefore, in this work, to improve the detection of novelty in a known environment, we proposed a space-based spatial saliency with 3D local information by improving 2D space base saliency with height as prior information about the specific locations. Moreover, the algorithm can also be integrated with other bottom-up or top-down saliency computational models to improve the detection results. Experimental results demonstrate that high accuracy for novelty detection can be obtained, and computational time can be reduced for existing state of the art detection and tracking models with the proposed algorithm.
Dichromats are color-blind persons missing one of the three cone systems. We consider a computer simulation of color confusion for dichromats for any colors on any video device, which transforms color in each pixel into a representative color among the set of its confusion colors. As a guiding principle of the simulation we adopt the proportionality law between the pre-transformed and post-transformed colors, which ensures that the same colors are not transformed to two or more different colors apart from intensity. We show that such a simulation algorithm with the proportionality law is unique for the video displays whose projected gamut onto the plane perpendicular to the color confusion axis in the LMS space is hexagon. Almost all video display including sRGB satisfy this condition and we demonstrate this unique simulation in sRGB video display. As a corollary we show that it is impossible to build an appropriate algorithm if we demand the additivity law, which is mathematically stronger than the proportionality law and enable the additive mixture among post-transformed colors as well as for dichromats.
We propose a new image denoising method with shrinkage. In the proposed method, small blocks in an input image are projected to the space that makes projection coefficients sparse, and the explicitly evaluated sparsity degree is used to control the shrinkage threshold. On average, the proposed method obtained higher quantitative evaluation values (PSNRs and SSIMs) compared with one of the state-of-the-art methods in the field of image denoising. The proposed method removes random noise effectively from natural images while preserving intricate textures.
In this paper, we propose a fast and accurate object detection algorithm based on binary co-occurrence features. In our method, co-occurrences of all the possible pairs of binary elements in a block of binarized HOG are enumerated by logical operations, i.g. circular shift and XOR. This resulted in extremely fast co-occurrence extraction. Our experiments revealed that our method can process a VGA-size image at 64.6fps, that is two times faster than the camera frame rate (30fps), on only a single core of CPU (Intel Core i7-3820 3.60GHz), while at the same time achieving a higher classification accuracy than original (real-valued) HOG in the case of a pedestrian detection task.
This paper presents a mobile Lidar system for efficiently and accurately capturing the 3D shape of the Bas-reliefs in Angkor Wat. The sensor system consists of two main components: 1) a panoramic camera and 2) a 2D 360-degree laser line scanner, which moves slowly on the rails parallel to the reliefs. In this paper, we first propose a new but simple method to accurately calibrate the panoramic camera to the 2D laser scan lines. Then the sensor motion can be estimated from the sensor-fused system using the 2D/3D features tracking method. Furthermore, to reduce the drifting error of sensor motion we adopt bundle adjustment to globally optimize and smooth the moving trajectories. In experiments, we demonstrate that our moving Lidar system achieves substantially better performance for accuracy and efficiency in comparison to the traditional stop-and-go methods.
In this paper, we propose an audio-visual speech recognition system for a person with an articulation disorder resulting from severe hearing loss. In the case of a person with this type of articulation disorder, the speech style is quite different from with the result that of people without hearing loss that a speaker-independent model for unimpaired persons is hardly useful for recognizing it. We investigate in this paper an audio-visual speech recognition system for a person with severe hearing loss in noisy environments, where a robust feature extraction method using a convolutive bottleneck network (CBN) is applied to audio-visual data. We confirmed the effectiveness of this approach through word-recognition experiments in noisy environments, where the CBN-based feature extraction method outperformed the conventional methods.
Nowadays, the design of the representation of images is one of the most crucial factors in the performance of visual categorization. A common pipeline employed in most of recent researches for obtaining an image representation consists of two steps: the encoding step and the pooling step. In this paper, we introduce the Mahalanobis metric to the two popular image patch encoding modules, Histogram Encoding and Fisher Encoding, that are used for Bag-of-Visual-Word method and Fisher Vector method, respectively. Moreover, for the proposed Fisher Vector method, a close-form approximation of Fisher Vector can be derived with the same assumption used in the original Fisher Vector, and the codebook is built without resorting to time-consuming EM (Expectation-Maximization) steps. Experimental evaluation of multi-class classification demonstrates the effectiveness of the proposed encoding methods.
Most gait recognition approaches rely on silhouette-based representations due to high recognition accuracy and computational efficiency, and a key problem for those approaches is how to accurately extract individuality-preserved silhouettes from real scenes, where foreground colors may be similar to background colors and the background is cluttered. We therefore propose a method of individuality-preserving silhouette extraction for gait recognition using standard gait models (SGMs) composed of clean silhouette sequences of a variety of training subjects as a shape prior. We firstly match the multiple SGMs to a background subtraction sequence of a test subject by dynamic programming and select the training subject whose SGM fit the test sequence the best. We then formulate our silhouette extraction problem in a well-established graph-cut segmentation framework while considering a balance between the observed test sequence and the matched SGM. More specifically, we define an energy function to be minimized by the following three terms: (1) a data term derived from the observed test sequence, (2) a smoothness term derived from spatio-temporally adjacent edges, and (3) a shape-prior term derived from the matched SGM. We demonstrate that the proposed method successfully extracts individuality-preserved silhouettes and improved gait recognition accuracy through experiments using 56 subjects.
This paper presents automatic Martian dust storm detection from multiple wavelength data based on decision level fusion. In our proposed method, visual features are first extracted from multiple wavelength data, and optimal features are selected for Martian dust storm detection based on the minimal-Redundancy-Maximal-Relevance algorithm. Second, the selected visual features are used to train the Support Vector Machine classifiers that are constructed on each data. Furthermore, as a main contribution of this paper, the proposed method integrates the multiple detection results obtained from heterogeneous data based on decision level fusion, while considering each classifier's detection performance to obtain accurate final detection results. Consequently, the proposed method realizes successful Martian dust storm detection.
In this work, we proposes a simple yet effective method for improving performance of local feature matching among equirectangular cylindrical images, which brings more stable and complete 3D reconstruction by incremental SfM. The key idea is to exiplictly generate synthesized images by rotating the spherical panoramic images and to detect and describe features only from the less distroted area in the rectified panoramic images. We demonstrate that the proposed method is advantageous for both rotational and translational camera motions compared with the standard methods on the synthetic data. We also demonstrate that the proposed feature matching is beneficial for incremental SfM through the experiments on the Pittsburgh Reserach dataset.
The aerial 3D laser scanner is needed for scanning the areas that cannot be observed from the ground. Since the laser scanning takes time, the obtained range data is distorted due to the sensor motion while scanning. This paper presents a rectification method for the distorted range data by aligning each scan line to the 3D data obtained from the ground. To avoid the instability and ambiguity of the line-based alignment, the parameters to be optimized are selected alternately, and the smoothness constraint is introduced by assuming that the sensor motion is smooth. The experimental results show that the proposed method has the good accuracy simulation and actual data.
This paper investigates performances of silhouette-based and depth-based gait authentication considering practical sensor settings where sensors are located in an environments afterwards and usually have to be located quite near to people. To realize fair comparison between different sensors and methods, we construct full-body volume of walking people by a multi-camera environment so as to reconstruct virtual silhouette and depth images at arbitrary sensor positions. In addition, we also investigate performances when we have to authenticate between frontal and rear views. Experimental results confirm that the depth-based methods outperform the silhouette-based ones in the realistic situations. We also confirm that by introducing Depth-based Gait Feature, we can authenticate between the frontal and rear views.
Facial part labeling which is parsing semantic components enables high-level facial image analysis, and contributes greatly to face recognition, expression recognition, animation, and synthesis. In this paper, we propose a cost-alleviative learning method that uses a weighted cost function to improve the performance of certain classes during facial part labeling. As the conventional cost function handles the error in all classes equally, the error in a class with a slightly biased prior probability tends not to be propagated. The weighted cost function enables the training coefficient for each class to be adjusted. In addition, the boundaries of each class may be recognized after fewer iterations, which will improve the performance. In facial part labeling, the recognition performance of the eye class can be significantly improved using cost-alleviative learning.
Facial expression recognition (FER) is a crucial technology and a challenging task for human-computer interaction. Previous methods have been using different feature descriptors for FER and there is a lack of comparison study. In this paper, we aim to identify the best features descriptor for FER by empirically evaluating five feature descriptors, namely Gabor, Haar, Local Binary Pattern (LBP), Histogram of Oriented Gradients (HOG), and Binary Robust Independent Elementary Features (BRIEF) descriptors. We examine each feature descriptor by considering six classification methods, such as k-Nearest Neighbors (k-NN), Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), and Adaptive Boosting (AdaBoost) with four unique facial expression datasets. In addition to test accuracies, we present confusion matrices of FER. We also analyze the effect of combined features and image resolutions on FER performance. Our study indicates that HOG descriptor works the best for FER when image resolution of a detected face is higher than 48×48 pixels.
We propose a per-frame upper body pose estimation method for sports players captured in low-resolution team sports videos. Using the head-center-aligned upper body region appearance in each frame from the head tracker, our framework estimates (1) 2D spine pose, composed of the head center and the pelvis center locations, and (2) the orientation of the upper body in each frame. Our framework is composed of three steps. In the first step, the head region of the subject player is tracked with a standard tracking-by-detection technique for upper body appearance alignment. In the second step, the relative pelvis center location from the head center is estimated by our newly proposed poselet-regressor in each frame to obtain spine angle priors. In the last step, the body orientation is estimated by the upper body orientation classifier selected by the spine angle range. Owing to the alignment of the body appearance and the usage of multiple body orientation classifiers conditioned by the spine angle prior, our method can robustly estimate the body orientation of a player with a large variation of visual appearances during a game, even during side-poses or self-occluded poses. We tested the performance of our method in both American football and soccer videos.
In the realm of multi-modal visual recognition, the reliability of the data acquisition system is often a concern due to the increased complexity of the sensors. One of the major issues is the accidental loss of one or more sensing channels, which poses a major challenge to current learning systems. In this paper, we examine one of these specific missing data problems, where we have a main modality/view along with an auxiliary modality/view present in the training data, but merely the main modality/view in the test data. To effectively leverage the auxiliary information to train a stronger classifier, we propose a collaborative auxiliary learning framework based on a new discriminative canonical correlation analysis. This framework reveals a common semantic space shared across both modalities/views through enforcing a series of nonlinear projections. Such projections automatically embed the discriminative cues hidden in both modalities/views into the common space, and better visual recognition is thus achieved on the test data. The efficacy of our proposed auxiliary learning approach is demonstrated through four challenging visual recognition tasks with different kinds of auxiliary information.
This paper presents a new application that improves communication between digital media and customers at a point of sale. The system uses several methods from various areas of computer vision such as motion detection, object tracking, behavior analysis and recognition, semantic description of behavior, and scenario recognition. Specifically, the system is divided in three parts: low-level, mid-level, and high-level analysis. Low-level analysis detects and tracks moving object in the scene. Then mid-level analysis describes and recognizes behavior of the tracked objects. Finally high-level analysis produces a semantic interpretation of the detected behavior and recognizes predefined scenarios. Our research is developed in order to build a real-time application that recognizes human behaviors while shopping. Specifically, the system detects customer interests and interactions with various products at a point of sale.
In moving camera videos, motion segmentation is often achieved by determining the motion coherence of each moving object. However, it is a nontrivial task on optical flow due to two problems: 1) Optical flow of the camera motions in 3D world consists of three primary 2D motion flows: translation, rotation, and radial flow. Their coherence analysis is done by a variety of models, and further requires plenty of priors in existing frameworks; 2) A moving camera introduces 3D motion, the depth discontinuities cause the motion discontinuities that severely break down the coherence. Meanwhile, the mixture of the camera motion and moving objects' motions make it difficult to clearly identify foreground and background. In this work, our solution is to transform the optical flow into a potential space where the coherence of the background flow field is easily modeled by a low order polynomial. To this end, we first amend the Helmholts-Hodge Decomposition by adding coherence constraints, which can transform translation, rotation, and radial flow fields to two potential surfaces under a unified framework. Secondly, we introduce an Incoherence Map and a progressive Quad-Tree partition to reject moving objects and motion discontinuities. Finally, the low order polynomial is achieved from the rest flow samples on two potentials. We present results on more than twenty videos from four benchmarks. Extensive experiments demonstrate better performance in dealing with challenging scenes with complex backgrounds. Our method improves the segmentation accuracy of state-of-the-arts by 10%∼30%.
We propose a novel interest point detector stemming from the intuition that image patches which are highly dissimilar over a relatively large extent of their surroundings hold the property of being repeatable and distinctive. This concept of contextual self-dissimilarity reverses the key paradigm of recent successful techniques such as the Local Self-Similarity descriptor and the Non-Local Means filter, which build upon the presence of similar - rather than dissimilar - patches. Moreover, our approach extends to contextual information the local self-dissimilarity notion embedded in established detectors of corner-like interest points, thereby achieving enhanced repeatability, distinctiveness and localization accuracy. As the key principle and machinery of our method are amenable to a variety of data kinds, including multi-channel images and organized 3D measurements, we delineate how to extend the basic formulation in order to deal with range and RGB-D images, such as those provided by consumer depth cameras.
Local feature detection has been an essential part of many methods for computer vision applications like large scale image retrieval, object detection, or tracking. Recently, structure-guided feature detectors have been proposed, exploiting image edges to accurately capture local shape. Among them, the WαSH detector [Varytimidis et al., 2012] starts from sampling binary edges and exploits α-shapes, a computational geometry representation that describes local shape in different scales. In this work, we propose a novel image sampling method, based on dithering smooth image functions other than intensity. Samples are extracted on image contours representing the underlying shapes, with sampling density determined by image functions like the gradient or Hessian response, rather than being fixed. We thoroughly evaluate the parameters of the method, and achieve state-of-the-art performance on a series of matching and retrieval experiments.
April 03, 2017 There had been a system trouble from April 1, 2017, 13:24 to April 2, 2017, 16:07(JST) (April 1, 2017, 04:24 to April 2, 2017, 07:07(UTC)) .The service has been back to normal.We apologize for any inconvenience this may cause you.
May 18, 2016 We have released “J-STAGE BETA site”.
May 01, 2015 Please note the "spoofing mail" that pretends to be J-STAGE.