Paper:
Topic Tracking Based on Identifying Proper Number of the Latent Topics in Documents
Midori Serizawa and Ichiro Kobayashi
Advanced Sciences, Graduate School of Humanities and Sciences, Ochanomizu University, 2-1-1 Ohtsuka, Bunkyo-ku, Tokyo 112-8610, Japan
In this paper, we propose a method for detecting and tracking topics of newspaper articles based on the latent semantics of the documents. We use Latent Dirichlet Allocation (LDA) to extract latent topics. In using LDA, we have to provide the number of latent topics in target documents in advance. To do so, perplexity is widely used as a metric for estimating the number of latent topics in documents. As a solution, we estimate the number of latent topics without any prior information in the case of using Hierarchical Dirichlet Process LDA (HDP-LDA). We propose a method to estimate the number of latent topics in target documents based on calculating the similarity among extracted topics, and conduct an experiment with three data sets to compare the method with the above two representative methods, i.e., HDP-LDA and LDA using perplexity. From experimental results, we confirmed that our method can provide results similar to that of HDP-LDA. We also detect and track topics by means of our proposed method and confirm that our method is useful.
- [1] M. Mori, T. Miura, and I. Sioya, “Topic Tracking from Temporal Clusters,” DEWS2006, 6A-i5, 2006 (in Japanese).
- [2] N. Hirata, M. Kodama, M. Ito, T. Ozono, and T. Shintani, “An Implementation of a Topic Tracking System using Multiple Windows for Browsing News Articles,” The 70th National Convention of IPSJ, pp. “1-633”-“1-634,” 2007 (in Japanese).
- [3] M. Kikuchi, M. Okamoto, and T. Yamasaki, “Extraction of Topic Transition through Time Series Document based on Hierarchical Clustering,” J. of the DBSJ, Vol.7, No.1, pp. 85-90, 2008 (in Japanese).
- [4] N. Hirata, T. Ozono, and T. Shintani, “An Implementation of a Topic Analyzing System Based on Users’ Preferences,” The 22nd Annual Conf. of the Japanese Society for Artificial Intelligence, 3G1-01, 2008 (in Japanese).
- [5] H. Mizuochi, E. Inoue, T. Yoshihiro, T. Murakawa, and M. Nakagawa, “A Method for Chronological Topic Extraction from Newspaper Articles,” DEIM Forum 2010, D6-3, 2010 (in Japanese).
- [6] T. Iwata, T. Yamada, Y. Sakurai, and N. Ueda, “Online Multiscale Dynamic Topic Models,” Technical Report on Information-Based Induction Sciences, 2009 (in Japanese).
- [7] D. Kim and A. Oh, “Topic Chains for Understanding a News Corpus,” the 12th Int. Conf. on Intelligent Text Processing and Computational Linguistics, Japan, Feb. 2011.
- [8] C. Wang, C. Yuan, X. Wang, and W. Xue, “Dirichlet Process Mixture Models based Topic Identification for Short Text Streams,” Proc. of the 7th IEEE Conf. on Natural Language Processing and Knowledge Engineering Tokushima, Japan, Nov. 27-29, 2011.
- [9] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet Allocation,” J. of Machine Learning Research, Vol.3, pp. 993-1022, 2003.
- [10] D. Blei and J. Lafferty, “Topic Models,” In A. Srivastava and M. Sahami (Eds.), Text Mining: Classification, Clustering, and Applications, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, 2009.
- [11] H. M. Wallach, I. Murray, R. Salakhutdinov, and D. M. Mimno, “Evaluation methods for topic models,” In Proc. of the 26th Int. Conf. on Machine Learning, Montreal, Canada, 2009.
- [12] D. Newman, J. H. Lau, K. Grieser, and T. Baldwin, “Automatic Evaluation of Topic Coherence,” The 2010 Annual Conf. of the North American Chapter of the ACL, pp. 100-108, California, Jun. 2010.
- [13] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Hierarchical Dirichlet Processes,” J. of the American Statistical Association, Vol.101, 2004.
- [14] L. Ren, L. Carin, and D. B. Dunson, “The Dynamic Hierarchical Dirichlet Process,” ICML ’08 Proc. of the 25th Int. Conf. on Machine Learning, 2008.
- [15] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. Natl. Acad. Sci. U.S.A., Vol.101, Suppl. 1, pp. 5228-5235, Apr. 2004.
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.