2013 Volume 1 Issue 2 Pages 190-198
Analyzing video for semantic content is very important for finding the desired video among a huge amount of accumulated video data. One conventional method for detecting objects depicted in video is called the bag-of-visual-words method, and is based on local feature occurrence frequencies. We propose a method that improves on the detection accuracy of traditional method by dividing video frames into overlapped sub-regions of various sizes. The method computes local and global features for each of these sub-regions to reflect spatial positioning in the feature vectors. These changes ensure that the method is resistant to variations in the size and position of objects appearing in the video. We also propose a training framework based on semi-supervised learning that uses a small number of labeled data points as a starting point and generates additional labeled training data efficiently, with few errors. Experiments using a video data set confirmed improved detection accuracy over earlier methods.