A two-dimensional continuous dynamic programming (2DCDP) method is proposed for two-dimensional (2D) spotting recognition of images. Spotting recognition is the simultaneous segmentation and recognition of an image by optimal pixel matching between a reference image and an input image. The proposed method performs optimal pixel-wise image matching and 2D pixel alignment, which are not available in conventional algorithms. Experimental results show that 2DCDP precisely matches the pixels of nonlinearly deformed images.
In this article we investigate ‘real-time’ watermarking of single-sensor digital camera images (often called ‘raw’ images) and blind watermark detection in demosaicked images. We describe the software-only implementation of simple additive spread-spectrum embedding in the firmware of a digital camera. For blind watermark detection, we develop a scheme which adaptively combines the polyphase components of the demosaicked image, taking advantage of the interpolated image structure. Experimental results show the benefits of the novel detection approach for several demosaicking techniques.
This paper presents the development of a novel visual speech recognition (VSR) system based on a new representation that extends the standard viseme concept (that is referred in this paper to as Visual Speech Unit (VSU)) and Hidden Markov Models (HMM). Visemes have been regarded as the smallest visual speech elements in the visual domain and they have been widely applied to model the visual speech, but it is worth noting that they are problematic when applied to the continuous visual speech recognition. To circumvent the problems associated with standard visemes, we propose a new visual speech representation that includes not only the data associated with the articulation of the visemes but also the transitory information between consecutive visemes. To fully evaluate the appropriateness of the proposed visual speech representation, in this paper an extensive set of experiments have been conducted to analyse the performance of the visual speech units when compared with that offered by the standard MPEG-4 visemes. The experimental results indicate that the developed VSR application achieved up to 90% correct recognition when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only in the range 62-72%.
The purpose of the work reported in this paper is to detect humans from images. This paper proposes a method for extracting feature descriptors consisting of co-occurrence histograms of oriented gradients (CoHOG). Including co-occurrence with various positional offsets, the feature descriptors can express complex shapes of objects with local and global distributions of gradient orientations. Our method is evaluated with a simple linear classifier on two well-known human detection benchmark datasets: “DaimlerChrysler pedestrian classification benchmark dataset” and “INRIA person data set”. The results show that our method reduces the miss rate by half compared with HOG, and outperforms the state-of-the-art methods on both datasets. Furthermore, as an example of a practical application, we applied our method to a surveillance video eight hours in length. The result shows that our method reduces false positives by half compared with HOG. In addition, CoHOG can be calculated 40% faster than HOG.
In this paper, we propose a new wavelet denoising method with edge preservation for digital images. Traditionally, most denoising methods assume additive Gaussian white noise or statistical models; however, we do not make such an assumption here. Briefly, the proposed method consists of a combination of dyadic lifting schemes and edge-preserving wavelet thresholding. The dyadic lifting schemes have free parameters, enabling us to construct filters that preserve important image features. Our method involves learning such free parameters based on some training images with and without noise. The learnt wavelet filters preserve important features of the original training image while removing noise from noisy images. We describe how to determine these parameters and the edge-preserving denoising algorithm in detail. Numerical image denoising experiments demonstrate the high performance of our method.
A very compact algorithm is presented for fundamental matrix computation from point correspondences over two images. The computation is based on the maximum likelihood (ML) principle, minimizing the reprojection error. The rank constraint is incorporated by the EFNS procedure. Although our algorithm produces the same solution as all existing ML-based methods, it is probably the most practical of all, being small and simple. By numerical experiments, we confirm that our algorithm behaves as expected.
In this paper, we present a projector-camera system for virtually altering the surface reflectance of a real object by projecting images onto it using projectors. The surface of the object is assumed to have an arbitrary shape and have a diffuse reflectance whose quantitative information is unknown. The system consists of multiple projectors and a camera. The proposed method first estimates the object surface along with the internal and external parameters of the projectors and the camera, based on the projection of structured patterns. It then improves the accuracy of surface normals by using the method of photometric stereo, where the same projectors are used as point sources of illumination. Owing to the combination of triangulation based on structured light projection and the method of photometric stereo, the surface normals of the object along with its surface shape can be accurately measured, which enables high-quality synthesis of virtual appearance. Our experimental system succeeded in giving a number of viewers a visual experience in which several plaster objects appeared as if their surfaces were made of different materials such as metals.
In this paper we describe a new technique for live video segmentation of human regions from dynamic backgrounds. Correct segmentations are produced in real-time even in severe background changes caused by camera movement and illumination changes. There are three key contributions. The first contribution is the employing of the thermal cue which proves to be very effective when fused with color. Second, we propose a new speed-up GraphCut algorithm by combining with the Bayesian estimation. The third contribution is a novel online learning method using cumulative histograms. The segmentation accuracy and speed are quite capable of the live video segmentation purpose.
Fourth-order nonlinear diffusion denoising filters are providing a good combination of the noise smoothing and the edge preservation without creating the staircase artifacts on the filtered image. However, finding an optimal choice of model parameters (i.e. the threshold value in a diffusivity function and a time step-size for stability of the numerical solver) is a challenging problem and generally, these model parameters are image-content dependent. In this paper, a fourth-order diffusion filter is proposed in which the diffusivity function is a function of modulus of the gradient of the image. It is shown that this setting for the diffusivity function can lead to a robust and fast convergent filter in which the model parameters are reduced to the only threshold value in the diffusivity function that can be estimated. A data-independent time step-size has been analytically found to guarantee the convergence of numerical solver of the proposed filter. Although this time step-size is smaller than the ones typically used, it is shown that the numerical solver of the proposed filter can provide a significantly fast convergence rate compared to the classical filter due to an improvement of the image selective smoothing obtained by the diffusivity function of the proposed filter. Simulation results demonstrate that the quality of denoised images obtained by the proposed filter are noticeably higher than the ones from the existing filters.
The topic of this paper is wide area structure from motion. We first describe recent progress in obtaining large-scale 3D visual models from images. Our approach consists of a multi-stage processing pipeline, which can process a recorded video stream in real-time on standard PC hardware by leveraging the computational power of the graphics processor. The output of this pipeline is a detailed textured 3D model of the recorded area. The approach is demonstrated on video data recorded in Chapel Hill containing more than a million frames. While for these results GPS and inertial sensor data was used, we further explore the possibility to extract the necessary information for consistent 3D mapping over larger areas from images only. In particular, we discuss our recent work focusing on estimating the absolute scale of motion from images as well as finding intersections where the camera path crosses itself to effectively close loops in the mapping process. For this purpose we introduce viewpoint-invariant patches (VIP) as a new 3D feature that we extract from 3D models locally computed from the video sequence. These 3D features have important advantages with respect to traditional 2D SIFT features such as much stronger viewpoint-invariance, a relative pose hypothesis from a single match and a hierarchical matching scheme naturally robust to repetitive structures. In addition, we also briefly discuss some additional work related to absolute scale estimation and multi-camera calibration.
Structure from motion (SfM) and appearance-based segmentation have played an important role in the interpretation of road scenes. The integration of these approaches can lead to good performance during interpretation since the relation between 3D spatial structure and 2D semantic segmentation can be taken into account. This paper presents a new integration framework using an SfM module and a bag of textons method for road scene labeling. By using a multiband image, which consists of a near-infrared and a visible color image, we can generate better discriminative textons than those generated by using only a color image. Our SfM module can accurately estimate the ego motion of the vehicle and reconstruct a 3D structure of the road scene. The bag of textons is computed over local rectangular regions: its size depends on the distance of the textons. Therefore, the 3D bag of textons method can help to effectively recognize the objects of a road scene because it considers the object's 3D structure. For solving the labeling problem, we employ a pairwise conditional random field (CRF) model. The unary potential of the CRF model is affected by SfM results, and the pairwise potential is optimized by the multiband image intensity. Experimental results show that the proposed method can effectively classify the objects in a 2D road scene with 3D structures. The proposed system can revolutionize 3D scene understanding systems used for vehicle environment perception.
This paper proposes a new color calibration method for multi-viewpoint images captured by sparsely and convergently arranged cameras. The main contribution of our method is its practical and efficient procedure while traditional methods are known to be labor-intensive. Because our method automatically extracts 3D points in the scene for color calibration, we do not need to capture color calibration objects like Macbeth chart. This enables us to calibrate a set of multi-viewpoint images whose capture environment is not available. Experiments with real images show that our method can minimize the difference of pixel values (1) quantitatively by leave-one-out evaluation, and (2) qualitatively by rendering a 3D video.
This paper proposes a method for acquiring the prior probability of human existence by using past human trajectories and the color of an image. The priors play an important role in human detection as well as in scene understanding. The proposed method is based on the assumption that a person can exist again in an area where he/she existed in the past. In order to acquire the priors efficiently, a high prior probability is assigned to an area having the same color as past human trajectories. We use a particle filter for representing and updating the prior probability. Therefore, we can represent a complex prior probability using only a few parameters. Through experiments, we confirmed that our proposed method can acquire the prior probability efficiently and use it to realize highly accurate human detection.
We propose a new method for background modeling based on combination of multiple models. Our method consists of three complementary approaches. The first one, or the pixel-level background modeling, uses the probability density function to approximate background model, where the PDF is estimated non-parametrically by using Parzen density estimation. Then the pixel-level background modeling can adapt periodical changes of pixel values. The region-level background modeling is based on the evaluation of local texture around each pixel, which can reduce the effects of variations in lighting. It can adapt gradual change of pixel value. The frame-level background modeling detects sudden and global changes of the image brightness and estimates a present background image from input image referring to a model background image, and foreground objects can be extracted by background subtraction. In our proposed method, integrating these approaches realizes robust object detection under varying illumination, whose effectiveness is shown in several experiments.
We propose a method to capture 3D video of an object that moves in a large area using active cameras. Our main ideas are to partition a desired target area into hexagonal cells, and to control active cameras based on these cells. Accurate camera calibration and continuous capture of the object with at least one third of the cameras are guaranteed regardless of the object's motion. We show advantages of our method over an existing capture method using fixed cameras. We also show that our method can be applied to a real studio.
We propose a novel wide angle imaging system inspired by compound eyes of animals. Instead of using a single lens, well compensated for aberration, we used a number of simple lenses to form a compound eye which produces practically distortion-free, uniform images with angular variation. The images formed by the multiple lenses are superposed on a single surface for increased light efficiency. We use GRIN (gradient refractive index) lenses to create sharply focused images without the artifacts seen when using reflection based methods for X-ray astronomy. We show the theoretical constraints for forming a blur-free image on the image sensor, and derive a continuum between 1 : 1 flat optics for document scanners and curved sensors focused at infinity. Finally, we show a practical application of the proposed optics in a beacon to measure the relative rotation angle between the light source and the camera with ID information.
Omnidirectional multi-camera systems cannot capture entire fields of view because of their inability to view areas directly below them. Such invisible areas in omnidirectional video decrease the resulting realistic sensation experienced when using a telepresence system. In this study, we generate omnidirectional video without invisible areas using an image completion technique. The proposed method compensates for the change in appearance of textures caused by camera motion and searches for appropriate exemplars considering three-dimensional geometric information. In our experiments, the effectiveness of our proposed method has been demonstrated by successfully filling in missing regions in real video sequences captured using an omnidirectional multi-camera system.
We present a novel technique for enhancing an image captured in low light by using near-infrared flash images. The main idea is to combine a color image with near-infrared flash images captured at the same time without causing any interference with the color image. In this work, near-infrared flash images are effectively used for removing annoying effects that are commonly observed in images of dimly lit environments, namely, image noise and motion blur. Our denoising method uses a pair of color and near-infrared flash images captured simultaneously. Therefore it is applicable to dynamic scenes, whereas existing methods assume stationary scenes and require a pair of flash and no-flash color images captured sequentially. Our deblurring method utilizes a set of near-infrared flash images captured during the exposure time of a single color image and directly acquires a motion blur kernel based on optical flow. We implemented a multispectral imaging system and confirmed the effectiveness of our technique through experiments using real images.
This paper introduces a novel efficient partial shape matching method named IS-Match. We use sampled points from the silhouette as a shape representation. The sampled points can be ordered which in turn allows to formulate the matching step as an order-preserving assignment problem. We propose an angle descriptor between shape chords combining the advantages of global and local shape description. An efficient integral image based implementation of the matching step is introduced which allows detecting partial matches an order of magnitude faster than comparable methods. We further show how the proposed algorithm is used to calculate a global optimal Pareto frontier to define a partial similarity measure between shapes. Shape retrieval experiments on standard shape datasets like MPEG-7 prove that state-of-the-art results are achieved at reduced computational costs.
We propose a method for extracting a shadow matte from a single image. The removal of shadows from a single image is a difficult problem to solve unless additional information is available. We use user-supplied hints to solve the problem. The proposed method estimates a fractional shadow matte using a graph cut energy minimization approach. We present a new hierarchical graph cut algorithm that efficiently solves the multi-labeling problems, allowing our approach to run at interactive speeds. The effectiveness of the proposed shadow removal method is demonstrated using various natural images, including aerial photographs.
The state of the art in image retrieval on large scale databases is achieved by the work inspired by the text retrieval approaches. A key step of these methods is the quantization stage which maps the high-dimensional feature vectors to discriminatory visual words. This paper mainly proposes a distance-based multiple paths quantization (DMPQ) algorithm to reduce the quantization loss of the vocabulary tree based methods. In addition, a more efficient way to build a vocabulary tree is presented by using sub-vectors of features. The algorithm is evaluated on both the benchmark object recognition and the location recognition databases. The experimental results have demonstrated that the proposed algorithm can effectively improve image retrieval performance of the vocabulary tree based methods on both the databases.
October 05, 2017 Due to the maintenance‚following linking services will not be available on Oct 18 from 10:00 to 19:00 (JST)(Oct 18‚ from 1:00 to 10:00(UTC)). We apologize for the inconvenience. a)reference linking b)cited-by linking c)linking to J-STAGE with JOI/OpenURL
May 18, 2016 We have released “J-STAGE BETA site”.
May 01, 2015 Please note the "spoofing mail" that pretends to be J-STAGE.