Semantic indexing, or assigning semantic tags to video samples, is a key component for content-based access to video documents and collections. The Semantic Indexing task has been run at TRECVid from 2010 to 2015 with the support of NIST and the Quaero project. As with the previous High-Level Feature detection task which ran from 2002 to 2009, the semantic indexing task aims at evaluating methods and systems for detecting visual, auditory or multi-modal concepts in video shots. In addition to the main semantic indexing task, four secondary tasks were proposed namely the “localization” task, the “concept pair” task, the “no annotation” task, and the “progress” task. It attracted over 40 research teams during its running period. The task was conducted using a total of 1,400 hours of video data drawn from Internet Archive videos with Creative Commons licenses gathered by NIST. 200 hours of new test data was made available each year plus 200 more as development data in 2010. The number of target concepts to be detected started from 130 in 2010 and was extended to 346 in 2011. Both the increase in the volume of video data and in the number of target concepts favored the development of generic and scalable methods. Over 8 millions shots×concepts direct annotations plus over 20 millions indirect ones were produced by the participants and the Quaero project on a total of 800 hours of development data. Significant progress was accomplished during the period as this was accurately measured in the context of the progress task but also from some of the participants' contrast experiments. This paper describes the data, protocol and metrics used for the main and the secondary tasks, the results obtained and the main approaches used by participants.
Video semantic indexing, which aims to detect objects, actions and scenes from video data, is one of important research topics in multimedia information processing. In the Text Retrieval Conference Video Retrieval Evaluation (TRECVID) workshop, many fundamental techniques for video processing have been developed and have been shown to be effective for real data such as Internet videos. They include extensions of deep learning techniques and image recognition techniques such as bag of visual words to video data. This paper reviews TRECVID activities with these techniques for semantic indexing. We also show the TokyoTech system using Gaussian-mixture-model (GMM) supervectors and deep convolutional neural networks (CNNs) with its experimental evaluation at TRECVID 2014.
Multimedia event detection (MED) and evidence hunting are two primary topics in the area of multimedia event search. The former serves to retrieve a list of relevant videos given an event query, whereas, the latter reasons why and how much the degree a retrieved video answers that query. Common practices deal with these two topics in separate methods, however, in this paper, we combine MED and evidence hunting into a joint framework. We propose a refined semantical representation named object pooling which can dynamically extract visual snippets corresponding to the location of when and where evidences might appear. The main idea of object pooling is to adaptively sample regions from frames for generation of object histogram that can be efficiently rolled up and back. Experiments conducted on large-scale TRECVID MED 2014 dataset demonstrate the effectiveness of proposed object pooling approach on both event detection and evidence hunting.
The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos, thus these videos are unsearchable by current search engines. Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity problem by directly analyzing the visual and audio streams of each video. CBVR encompasses multiple research topics, including low-level feature design, feature fusion, semantic detector training and video search/reranking. We present novel strategies in these topics to enhance CBVR in both accuracy and speed under different query inputs, including pure textual queries and query by video examples. Our proposed strategies have been incorporated into our submission for the TRECVID 2014 Multimedia Event Detection evaluation, where our system outperformed other submissions in both text queries and video example queries, thus demonstrating the effectiveness of our proposed approaches.
Existing instance search methods based on spatial verification still suffer from limited efficiency and large 3D viewpoint changes. To address these issues, we newly incorporate two spatial verification methods in local feature-based instance search. The first method, called ensemble of weak geometric relations, imposes multiple pairwise geometric constraints on pairs of feature correspondences. It leverages the coherence of the spatial neighborhood to reduce its complexity from quadratic time to linear time in terms of the number of correspondences. The second method, which is called angle-free object information retrieval, converts each image into a set of affine-transformed images to augment the information used for search and to neutralize the viewpoint changes between two images. Extensive experiments conducted on two TRECVID instance search datasets show the superiority of our methods, which provide high robustness as regards large 3D viewpoint changes, small instances and occlusions.
This paper provides an extensive study on the availability of image representations based on convolutional networks (ConvNets) for the task of visual instance retrieval. Besides the choice of convolutional layers, we present an efficient pipeline exploiting multi-scale schemes to extract local features, in particular, by taking geometric invariance into explicit account, i.e. positions, scales and spatial consistency. In our experiments using five standard image retrieval datasets, we demonstrate that generic ConvNet image representations can outperform other state-of-the-art methods if they are extracted appropriately.
Diminished reality (DR) refers to interactive techniques for deleting or diminishing undesirable objects from a perceived environment, whereas augmented/mixed reality seamlessly merges a real and virtual scene. In this paper, we introduce data acquisition facilities and evaluation workflow towards DR method benchmark. In the proposed data acquisition facilities, simulated indoor and outdoor scenes are constructed, illumed, and photographed, using full-scale and miniature sets, a cinematography-based lighting system, and camera attached 6 degrees of freedom industrial robot arm respectively. Consequently, it facilitates acquisition of paired image sequence with and without target objects of interest, i.e., source and ground truth image sequences, to evaluate DR methods in indoor and outdoor scenarios. Through operations test, several datasets are recorded and test benches of DR methods are evaluated using the dataset to show that such data is usable for qualitative and quantitative evaluation of DR methods.
Until now, the 700 MHz band has been used for the field pickup unit (FPU), which is used for the live broadcasting of events, such as marathons and long-distance relay races. However, the frequency band is slated to migrate to 1.2 and 2.3 GHz bands based on an action plan for radio spectrum reallocation developed by the Ministry of Internal Affairs and Communications, Japan. With the frequency migration of the FPU, the size of transmitting antennas can be downsized. Therefore, 1.2 and 2.3 GHz band antennas can also be mounted on wireless cameras for professional-use. In this study, we measured the specific absorption rate (SAR) on the body of the operator exposed to electromagnetic waves radiated from the transmission antenna of the wireless camera via a 1.2 GHz band. We also calculated the SAR to confirm the validity of the measurement method, and we compared the measured results with the calculated ones. As the result, SAR distributions between the measured and the calculated results were nearly identical. It is thus possible to evaluate the SAR using the method suggested in this paper.
Augmented reality using optical see-through head-mounted displays (OSTHMDs) provides the user with a highly realistic experience compared to those using smartphones or tablet devices. It is necessary for the positional relationship between the user's eye and a virtual screen to be calibrated using input from the user. However, conventional calibration methods are highly sensitive to input errors. In this paper, we propose a vision-based robust calibration (ViRC) method using a fiducial marker, which can be used for any OSTHMD equipped with a camera. The ViRC method decomposes 11-DoFs calibration parameters into device-dependent parameters and user-dependent parameters. Once the device-dependent parameters are calculated, the user only has to perform a calibration phase for estimating the 4-DoFs user-dependent parameters. Experiments show that the ViRC method can decrease reprojection error by 83% compared with the conventional method. Consequently, users can observe correctly aligned superimpositions of computer graphics with little distortion.