Latent Topic Estimation Based on Events in a Document
Risa Kitajima and Ichiro Kobayashi
Advanced Sciences, Graduate School of Humanities and Sciences, Ochanomizu University, 2-1-1 Ohtsuka, Bunkyo-ku, Tokyo 112-8610, Japan
Several latent topic model-based methods such as Latent Semantic Indexing (LSI), Probabilistic LSI (pLSI), and Latent Dirichlet Allocation (LDA) have been widely used for text analysis. These methods basically assign topics to words, however, and the relationship between words in a document is therefore not considered. Considering this, we propose a latent topic extraction method that assigns topics to events that represent the relation between words in a document. There are several ways to express events, and the accuracy of estimating latent topics differs depending on the definition of an event. We therefore propose five event types and examine which event type works well in estimating latent topics in a document with a common document retrieval task. As an application of our proposed method, we also show multidocument summarization based on latent topics. Through these experiments, we have confirmed that our proposed method results in higher accuracy than the conventional method.
-  S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by Latent Semantic Analysis,” J. of the American Society for Information Science, Vol.41,No.6, pp. 391-407, 1990.
-  T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. of the 22nd Annual Int. ACM-SIGIR Conf. on Research and Development in Information Retrieval, pp. 50-57, 1999.
-  D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” J. of Machine Learning Research, Vol.3, pp. 993-1022, 2003.
-  A. Berger and V. O. Mittal, “Query-relevant summarization using FAQs,” ACL ’00 Proc. of the 38th Annual Meeting on Association for Computational Linguistics, pp. 294-301, 2000.
-  A. Tombros and M. Sanderson, “Advantages of query biased summaries in information retrieval,” Proc. of the 21st Annual Int. ACMSIGIR Conf. on Research and Development in Information Retrieval, pp. 2-10, 1998.
-  M. Okumura and H. Mochizuki, “Query-Biased Summarization Based on Lexical Chaining,” Computational Intelligence, Vol.16,No.4, pp 578-585, 2000.
-  Y. Suzuki, T. Uemura, T. Kida, and H. Arimura, “Extension to word phrase on latent dirichlet allocation (in Japanese),” Forum on Data Engineering and Information Management, i-6, 2010.
-  S. Matsumoto, H. Takamura, and M. Okumura, “Sentiment Classification Using Word Sub-sequences and Dependency Sub-trees,” Proc. of the 9th Pacific-Asia Int. Conf. on Knowledge Discovery and Data Mining, pp. 301-310, 2005.
-  A. Nenkova and L. Vanderwende, “The Impact of Frequency on Summarization,” Technical report, Microsoft Research, 2005.
-  H. P. Luhn, “The automatic creation of literature abstracts,” IBM J. of Research and Development, 1958.
-  D. R. Radev, “Lexrank: graph-based centrality as salience in text summarization,” J. of Artificial Intelligence Research, 2004.
-  X. Wan and J. Yang, “Improved affinity graph based multidocument summarization,” Proc. of the Human Language Technology Conf. of the NAACL, Companion Volume: Short Papers, 2006.
-  A. Haghighi and L. Vanderwende, “Exploring Content Models for Multi-Document Summarization,” Human Language Technologies: The 2009 Annual Conf. of the North American Chapter of the ACL, pp. 362-370, 2009.
-  H. Bhandari, M. Shimbo, T. Ito, and Y. Matsumoto, “Generic Text Summarization Using Probabilistic Latent Semantic Indexing,” Proc. of the 3rd Int. Joint Conf. on Natural Language Proceeding, pp. 133-140, 2008.
-  L. Henning, “Topic-based Multi-Document Summarization with Probabilistic Latent Semantic Analysis,” Recent Advances in Natural Language Processing, pp. 144-149, 2009.
-  Q. Bing, L. Ting, Z. Yu, and L. Sheng, “Research on Multi-Document Summarization Based on Latent Semantic Indexing,” J. of Harbin Institute of Technology, Vol.12 No.1, pp. 91-94, 2005.
-  R. Arora and B. Ravindran, “Latent dirichlet allocation based multidocument summarization,” Proc. of the 2ndWorkshop on Analytics for Noisy Unstructured Text Data, 2008.
-  Y. W. Teh, D. Newman, and M. Welling, “A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation,” Advances in Neural Information Processing Systems Conf., Vol.19, pp. 1353-1360, 2006.
-  T. L. Grififths and M. Steyvers, “Finding scientific topics,” Proc. of the National Academy of Sciences of the United States of America, Vol.101, pp. 5228-5235, 2004.
-  J. Lin, “Divergence Measures based on the Shannon Entropy,” IEEE Trans. on Information Theory, Vol.37,No.1, pp. 145-151, 2002.
-  S. Kullback and R. A. Leibler, “On Information and Sufficiency,” Annuals of Mathematical Statistics, Vol.22, pp. 49-86, 1951.
-  R. Kitajima and I. Kobayashi, “A Latent Topic Extracting Method based on Events in a Document and its Application,” The 49th AnnualMeeting of the Association for Computational Linguistics: Human Language Technologies, Portland, U.S.A, June 19-24, 2011.
-  J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz, “Multidocument summarization by sentence extraction,” Proc. of the 2000 NAALP-ANLP Workshop on Automatic Summarization, pp. 40-48, 2000.
-  M. Okumura and E. Nanba, “Science of knowledge: Automatic Text Summarization(in Japanese),” Ohmsha, 2005.
-  T. Hirao, T. Fukusima, M. Okumura, C. Nobata, and H. Nanba, “Corpus and evaluation measures for multiple document summarization with multiple sources,” Proc. of the 20th Int. Conf. on Computational Linguistics, pp. 535-541, 2004.