Comparative Analysis of Relevance for SVM-Based Interactive Document Retrieval
Hiroshi Murata*, Takashi Onoda*, and Seiji Yamada**
*Central Research Institute of Electric Power Industry (CRIEPI), 2-11-1 Iwado kita, Komae-shi, Tokyo 201-8511, Japan
**National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
Support Vector Machines (SVMs) were applied to interactive document retrieval that uses active learning. In such a retrieval system, the degree of relevance is evaluated by using a signed distance from the optimal hyperplane. It is not clear, however, how the signed distance in SVMs has characteristics of vector space model. We therefore formulated the degree of relevance by using the signed distance in SVMs and comparatively analyzed it with a conventional Rocchio-based method. Although vector normalization has been utilized as preprocessing for document retrieval, few studies explained why vector normalization was effective. Based on our comparative analysis, we theoretically show the effectiveness of normalizing document vectors in SVM-based interactive document retrieval. We then propose a cosine kernel that is suitable for SVM-based interactive document retrieval. The effectiveness of the method was compared experimentally with conventional relevance feedback for Boolean, Term Frequency and Term Frequency-Inverse Document Frequency representations of document vectors. Experimental results for a Text REtrieval Conference data set showed that the cosine kernel is effective for all document representations, especially Term Frequency representation.
-  G. Salton, (Ed.), “The SMART Retrieval System – Experiments in Automatic Document Processing,” Prentice Hall, Englewood, Cliffs, New Jersey, 1971.
-  P. Ingwersen, “Information Retrieval Interaction,” Taylor Graham, 1992.
-  J. Koenemann and N. J. Belkin, “A case for interaction: a study of interactive information retrieval behavior and effectiveness,” In Proc. of 27th Annual SIGCHI Conf. on Human factors in Computing Systems, pp. 205-212, 1996.
-  G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill, 1983.
-  M. Okabe and S. Yamada, “Learning filtering rulesets for ranking refinement in relevance feedback,” Knowledge-Based Systems, Vol.18, pp. 117-124, April 2005.
-  V. Vapnik, “Statistical Learning Theory,” John Wiley and Sons Inc., 1998.
-  H. Drucker, B. Shahrary, and D. C. Gibbon, “Support vector machines: relevance feedback and information retrieval,” Information Processing & Management, Vol.38, pp. 305-323, May 2002.
-  S. Tong and D. Koller, “Support vector machine active learning with applications to text classification,” J. of Machine Learning Research, Vol.2, pp. 45-66, 2002.
-  G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” In Information Processing and Management, pp. 513-523, 1988.
-  K. Hotta, “Local normalized linear summation kernel for fast and robust recognition,” Pattern Recognition, Vol.43, pp. 906-913, March 2010.
-  H. Murata, T. Onoda, and S. Yamada, “Comparative Analysis of Relevance Evaluation for Interactive Document Retrieval Based on SVMs (in Japanese),” J. of Japan Society for Fuzzy Theory and Intelligent Informatics, Vol.23, No.6, pp. 853-862, 2011.
-  A. Moschitti, “A Study on Optimal Parameter Tuning for Rocchio Text Classifier,” In Proc. of the 25th European Conf. on Information Retrieval Research (ECIR ’03), pp. 420-435, 2003.
-  Y. Lv and C. Zhai, “Adaptive Relevance Feedback in Information Retrieval,” In Proc. of the 18th ACM Conf. on Int. Knowledge Management, pp. 255-264, 2009.
-  J. Montgomery, L. Si, J. Callan, and D. A. Evans, “Effect of varying number of documents in blind feedback: analysis of the 2003 NRRC RIA workshop“bf numdocs”experiment suite,” In Proc. of 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 476-477, 2004.
-  T. Onoda, H. Murata, and S. Yamada, “SVM-based interactive document retrieval with active learning,” New Generation Computing, Vol.26, pp. 49-61, November 2007.
-  M. Gamon, S. Basu, D. Belenko, D. Fisher, M. Hurst, and A. C. König, “BLEWS: Using Blogs to Provide Context for News Articles,” In Proc. of Int. Conf. on Weblogs and Social Media, 2008.
-  M. Klein and M. L. Nelson, “Correlation of Term Count and Document Frequency for Google N-Grams,” In Proc. of the 31th European Conf. on IR Research on Advances in Information Retrieval (ECIR ’09), pp. 620-627, 2009.