2007 Volume 2007 Issue DMSM-A701 Pages 06-
Under the development of ubiquitous sensing, electric documents and multi-media technologies, data sets consisting of high dimensional and massive instances have become available in various practical fields. Efficient evaluation of the similarity measures, e.g., correlations and kernels, among such instances is one of the most important tasks required by major data mining techniques, for the instance queries and clustering. However, the computational complexity of the direct computation for n objects is O(n2) which is practically intractable under the high dimensional and/or massive data, and complex similarity measures. Moreover, some scientific similarity measurements among objects take much time and cost such as the case of the gene expression experiments. The objective of this paper is to provide an efficient remedy tothis problem. We propose a fast approach to estimate the similarity measures among n instances based on the partially and actually computed and/or observed similarity measures together with a mathematical constraint called "Positive Semi-Definiteness (PSD)" governing the similarity measures. The superior performance of our approach in both efficiency and accuracy of the estimation is demonstrated though the evaluation based on artificial and real world data sets.