Stereo reconstruction serves many outdoor applications, and thus sometimes faces foggy weather. The quality of the reconstruction by state of the art algorithms is then degraded as contrast is reduced with the distance because of scattering. However, as shown by defogging algorithms from a single image, fog provides an extra depth cue in the gray level of far away objects. Our idea is thus to take advantages of both stereo and atmospheric veil depth cues to achieve better stereo reconstructions in foggy weather. To our knowledge, this subject has never been investigated earlier by the computer vision community. We thus propose a Markov Random Field model of the stereo reconstruction and defogging problem which can be optimized iteratively using the α-expansion algorithm. Outputs are a dense disparity map and an image where contrast is restored. The proposed model is evaluated on synthetic images. This evaluation shows that the proposed method achieves very good results on both stereo reconstruction and defogging compared to standard stereo reconstruction and single image defogging.
This paper presents a new approach to estimate the kinematic structure underlying a sequence of 3D dynamic surfaces reconstructed from multi-view video. The key idea is a mesoscopic surface characterization with a tree-structure constraint. Combined with different levels of surface characterizations, namely macroscopic and microscopic characterizations, our mesoscopic surface characterization can cope with shape estimation errors and global topology changes of 3D surfaces from the real world to estimate kinematic structure. The macroscopic analysis focuses on global surface topology to perform temporal segmentation of 3D video sequence into topologically-coherent sub-sequences. The microscopic analysis operates at the mesh structure level to provide temporally consistent mesh structures using a surface alignment method on each of the topologically-coherent sub-sequences. Then, the mesoscopic analysis extracts rigid parts from the preprocessed 3D video segments to establish partial kinematic structures, and integrates them into a single unified kinematic model. Quantitative evaluations using synthesized and real data demonstrate the performance of the proposed algorithm for kinematic structure estimation.
A fundamental problem in conventional photography is that movement of the camera or captured object causes motion blur in the image. In this research, we propose coding motion-invariant blur using a programmable aperture camera. The camera realizes virtual camera motion by translating the opening, and as a result, we obtain a coded image in which motion blur is invariant with respect to object velocity. Therefore, we can reduce motion blur without having to estimate motion blur kernels or requiring knowledge of the object speed. We model a projection of the programmable aperture camera and also demonstrate that our proposed coding works using a prototype camera.
This paper presents a novel image processing method to enhance appearance of micro-structure of a living-organ mucosa using polarized lighting and imaging. A new technique that uses two pairs of parallel and crossed nicol polarimetric images captured under two different linearly polarized lightings are presented, and an averaged subtracted polarization image (AVSPI) which is calculated from the above four images is introduced. Feasibility experiments were performed using the prototype of polarimetric endoscope hardware using excised porcine stomachs.
It is known that time-to-contact toward objects can be estimated just from changes in the object size in camera images, and we do not need any additional information, such as distance toward objects, camera speed and camera parameters. However, the existing methods cannot compute time-to-contact, if there are no geometric features in the images. In this paper, we propose a new method for computing time-to-contact by using photometric information. When a light source moves in the scene, an observed intensity changes according to the motion of the light source. In this paper, we analyze the change in intensity in camera images, and show that the time-to-contact can be estimated just from the change in intensity in images. Our method does not need any additional information, such as radiance of light source, reflectance of object and orientation of object surface. The proposed method can be used in various applications, such as vehicle driver assistance.
This paper is aimed at presenting a 3D surface capture algorithm of underwater objects using multiple projectors and cameras with flat housings. We use a pixel-wise varifocal model to realize an efficient forward projection in order to explicitly account for refractions caused by flat housings and to improve appearance-based correspondence estimations. We propose a practical calibration procedure of underwater projectors, and show a real system which proves our concept.
The paper addresses the recognition problem of defocused patterns. Though recognition algorithms assume that the input images are focused and sharp, it does not always hold on actual camera-captured images. Thus, a recognition method that can recognize defocused patterns is required. In this paper, we propose a novel recognition framework for defocused patterns, relying on a single camera without a depth sensor. The framework is based on the coded aperture which can recover a less-degraded image from a defocused image if depth is available. However, in the problem setting of “a single camera without a depth sensor, ” estimating depth is ill-posed and an assumption is required to estimate the depth. To solve the problem, we introduce a new assumption suitable for pattern recognition; templates are known. It is based on the fact that in pattern recognition, all templates must be available in advance for training. The experiments confirmed that the proposed method is fast and robust to defocus and scaling, especially for heavily defocused patterns.
This paper describes a quality-dependent score-level fusion framework of face, gait, and the height biometrics from a single walking image sequence. Individual person authentication accuracies by face, gait, and the height biometrics, are in general degraded when spatial resolution (image size) and temporal resolution (frame-rate) of the input image sequence decrease and the degree of such accuracy degradation differs among the individual modalities. We therefore set the optimal weights of the individual modalities based on linear logistic regression framework depending on a pair of the spatial and temporal resolutions, which are called qualities in this paper. On the other hand, it is not a realistic solution to compute and store the optimal weights for all the possible qualities in advance, and also the optimal weights change across the qualities in a nonlinear way. We thus propose a method to estimate the optimal weights for arbitrary qualities from a limited training pairs of the optimal weights and the qualities, based on Gaussian process regression with a nonlinear kernel function. Experiments using a publicly available large population gait database with 1, 935 subjects under various qualities, showed that the person authentication accuracy improved by successfully estimating the weights depending on the qualities.
We seek to localize a query panorama with a wide field of view given a large database of street-level geotagged imagery. This is a challenging task because of significant changes in appearance due to viewpoint, season, occluding people or newly constructed buildings. An additional key challenge is the computational and memory efficiency due to the planet-scale size of the available geotagged image databases. The contributions of this paper are two-fold. First, we develop a compact image representation for scalable retrieval of panoramic images that represents each panorama as an ordered set of vertical image tiles. Two panoramas are matched by efficiently searching for their optimal horizontal alignment, while respecting the tile ordering constraint. Second, we collect a new challenging query test dataset from Shibuya, Tokyo containing more than thousand panoramic and perspective query images with manually verified ground truth geolocation. We demonstrate significant improvements of the proposed method compared to the standard bag-of-visual-words and VLAD baselines.
We propose a novel method to estimate the head orientation of a pedestrian. There have been many methods for head orientation estimation based on facial textures of pedestrians. It is, however, impossible to apply these methods to low-resolution images which are captured by a surveillance camera at a distance. To deal with the problem, we construct a method that is not based on facial textures but on gait features, which are robustly obtained even from low-resolution images. In our method, first, size-normalized silhouette images of pedestrians are generated from captured images. We then obtain the Gait Energy Image (GEI) from the silhouette images as a gait feature. Finally, we generate a discriminant model to classify their head orientation. For this training step, we build a dataset consisting of gait images of over 100 pedestrians and their head orientations. In evaluation experiments using the dataset, we classified their head orientation by the proposed method. We confirmed that gait changes of the whole body were efficient for the estimation in quite low-resolution images which existing methods cannot deal with due to the lack of facial textures.
This paper proposes a background estimation method from a single omnidirectional image sequence for removing undesired regions such as moving objects, specular regions, and uncaptured regions caused by the camera's blind spot without manual specification. The proposed method aligns multiple frames using a reconstructed 3D model of the environment and generates background images by minimizing an energy function for selecting a frame for each pixel. In the energy function, we introduce patch similarity and camera positions to remove undesired regions more correctly and generate high-resolution images. In experiments, we demonstrate the effectiveness of the proposed method by comparing the result given by the proposed method with those from conventional approaches.
In these years, for medical analysis, many 3D microscopes have been developed. In the existing 3D microscopes, confocal laser scanning microscope (CLSM) is used mostly. 3D shape of biological tissues can be measured with high-resolution. However scanning, which is necessary during measurement, takes time consuming. So it is difficult to measurement living biological tissues, such as the red blood cells. On the other hand, measurement accuracy of the CLSM is about 1[µm], so 3D measurement using the CLSM is not high accuracy. To resolve problems in the CLSM, we propose a 3D measurement method using single-shot phase-shift digital holography. Using the proposed method, we also developed a 3D microscopy for 3D measurement of red blood cells with 5.5[nm] measurement accuracy and 80[ms] measurement time. This paper describes the detail of the proposed 3D measurement method and the 3D microscopy.
In this paper, we propose a method to achieve positions and poses of multiple cameras and temporal synchronization among them by using blinking calibration patterns. In the proposed method, calibration patterns are shown on tablet PCs or monitors, and are observed by multiple cameras. By observing several frames from the cameras, we can obtain the camera positions, poses and frame correspondences among cameras. The proposed calibration patterns are based on pseudo random volumes (PRV), a 3D extension of pseudo random sequences. Using PRV, we can achieve the proposed method. We believe our method is useful not only for multiple camera systems but also for AR applications for multiple users.
In this work, to the best of our knowledge, we propose a stand-alone large-scale image classification system running on an Android smartphone. The objective of this work is to prove that mobile large-scale image classification requires no communication to external servers. To do that, we propose a scalar-based compression method for weight vectors of linear classifiers. As an additional characteristic, the proposed method does not need to uncompress the compressed vectors for evaluation of the classifiers, which brings the saving of recognition time. We have implemented a large-scale image classification system on an Android smartphone, which can perform 1000-class classification for a given image in 0.270 seconds. In the experiment, we show that compressing the weights to 1/8 leaded to only 0.80% performance loss for 1000-class classification with the ILSVRC2012 dataset. In addition, the experimental results indicate that weight vectors compressed in low bits, even in the binarized case (bit =1), are still valid for classification of high dimensional vectors.
For 3D active measurement methods using video projector, there is the implicit limitation that the projected patterns must be in focus on the target object. Such limitation set a severe constraints on possible range of the depth for reconstruction. In order to overcome the problem, Depth from Defocus (DfD) method using multiple patterns with different in-focus depth is proposed to expand the depth range in the paper. With the method, not only the range of the depth is extended, but also the shape can be recovered even if there is an obstacle between the projector and the target, because of the large aperture of the projector. Furthermore, thanks to the advantage of DfD which does not require baseline between the cameras and the projector, occlusion does not occur with the method. In order to verify the effectiveness of the method, several experiments using the actual system was conducted to estimate the depth of several objects.
This paper introduces a novel method for image classification using local feature descriptors. The method utilizes linear subspaces of local descriptors for characterizing their distribution and extracting image features. The extracted features are transformed into more discriminative features by the linear discriminant analysis and employed for recognizing their categories. Experimental results demonstrate that this method is competitive with the Fisher kernel method in terms of classification accuracy.
Given an image sequence with corrupted pixels, usually big holes that span over several frames, we propose to complete the missing parts using an iterative optimization approach which minimizes an optical flow functional and propagates the color information simultaneously. Inside one iteration of the optical flow estimation, we use the solved motion field to propagate the color and then use the newly inpainted color back to the brightness constraint of the optical flow functional. We then introduce a spatially dependent blending factor, called the mask function, to control the effect of the newly propagated color. We also add a trajectory constraint by solving the forward and backward flow simultaneously using three frames. Finally, we minimize the functional by using alternating direction method of multipliers.
In this paper, we propose a method for compensating for motion features that are outside a given viewing angle by using a regression estimate that is based on a correlation between the motion features from human bodies deficient visually, when recognizing the actions of people whose bodies are only partially within the given view. This compensation is good for use in situations where parts of a person's body are partially protruding outside the edges of the viewing angle, and contributes to enlarging the region coverage for action recognition. The motion features and position of the acting person in a depth image are calculated first in the proposed method. Second, the deficit length protruding outside the view angle is calculated, according to the position of the person. Finally, the motion features from the entire body are estimated using a regression estimate from the motion features by selecting the regression coefficients according to the deficit length. The method for improving the effectiveness of the F-measure is confirmed using three kinds of motion features in a fundamental laboratory experiment. We found from the experimental results that the F-measure was improved by more than 12.5% when using motion feature compensation compared to without compensation when the person within the viewing angle cannot actually be seen from the floor to 630mm above it.
This paper proposes a method to estimate a mobile camera's position and orientation by referring to the corresponding points between aerial-view images from a GIS database and mobile camera images. The mobile camera images are taken from the user's viewpoint, and the aerial-view images include the same region. To increase the correspondence accuracy, we generate a virtual top-view image that virtually captures the target region overhead of the user by using the intrinsic parameters of the mobile camera and the inertia (gravity) information. We find corresponding points between the virtual top-view and aerial-view images and estimate a homography matrix that transforms the virtual top-view image into aerial-view image. Finally, the mobile camera's position and orientation are estimated by analyzing the matrix. In some cases, however, it is difficult to obtain a sufficient number of correct corresponding points to estimate the correct homography matrix by capturing only a single virtual top-view image. We solve this problem by stitching virtual top-view images to represent a larger ground region. We experimentally implemented our method on a tablet PC and evaluated its effectiveness.
This paper focuses on initializing 3-D reconstruction from scratch without any prior scene information. Traditionally, this has been done from two-view matching, which is prone to the degeneracy called “imaginary focal lengths.” We overcome this difficulty by using three images, but we do not require three-view matching; all we need is three fundamental matrices separately computed from pair-wise image matching. We exploit the redundancy of the three fundamental matrices to optimize the camera parameters and the 3-D structure. The main theme of this paper is to give an analytical procedure for computing the positions and orientations of the three cameras and their internal parameters from three fundamental matrices. The emphasis is on resolving the ambiguity of the solution resulting from the sign indeterminacy of the fundamental matrices. We do numerical simulation to show that imaginary focal lengths are less likely for our three view methods, resulting in higher accuracy than the conventional two-view method. We also test the degeneracy tolerance capability of our method by using endoscopic intestine tract images, for which the camera configuration is almost always nearly degenerate. We demonstrate that our method allows us to obtain more detailed intestine structures than two-view reconstruction and observe how our three-view reconstruction is refined by bundle adjustment. Our method is expected to broaden medical applications of endoscopic images.
This paper describes part of an ongoing comprehensive research project that is aimed at generating a MathML format from images of mathematical expressions that have been extracted from scanned PDF documents. A MathML representation of a scanned PDF document reduces the document's storage size and encodes the mathematical notation and meaning. The MathML representation then becomes suitable for vocalization and accessible through the use of assistive technologies. In order to achieve an accurate layout analysis of a scanned PDF document, all textual and non-textual components must be recognised, identified and tagged. These components may be text or mathematical expressions and graphics in the form of images, figures, tables and/or diagrams. Mathematical expressions are one of the most significant components within scanned scientific and engineering PDF documents and need to be machine readable for use with assistive technologies. This research is a work in progress and includes multiple different modules: detecting and extracting mathematical expressions, recursive primitive component extraction, non-alphanumerical symbols recognition, structural semantic analysis and merging primitive components to generate the MathML of the scanned PDF document. An optional module converts MathML to audio format using a Text to Speech engine (TTS) to make the document accessible for vision-impaired users.
The technique of “renormalization” for geometric estimation attracted much attention when it appeared in early 1990s for having higher accuracy than any other then known methods. The key fact is that it directly specifies equations to solve, rather than minimizing some cost function. This paper expounds this “non-minimization approach” in detail and exploits this principle to modify renormalization so that it outperforms the standard reprojection error minimization. Doing a precise error analysis in the most general situation, we derive a formula that maximizes the accuracy of the solution; we call it hyper-renormalization. Applying it to ellipse fitting, fundamental matrix computation, and homography computation, we confirm its accuracy and efficiency for sufficiently small noise. Our emphasis is on the general principle, rather than on individual methods for particular problems.
April 03, 2017 There had been a system trouble from April 1, 2017, 13:24 to April 2, 2017, 16:07(JST) (April 1, 2017, 04:24 to April 2, 2017, 07:07(UTC)) .The service has been back to normal.We apologize for any inconvenience this may cause you.
May 18, 2016 We have released “J-STAGE BETA site”.
May 01, 2015 Please note the "spoofing mail" that pretends to be J-STAGE.