Midori Serizawa and Ichiro Kobayashi
In this paper, we propose a method for detecting and tracking topics of newspaper articles based on the latent semantics of the documents. We use Latent Dirichlet Allocation (LDA) to extract latent topics. In using LDA, we have to provide the number of latent topics in target documents in advance. To do so, perplexity is widely used as a metric for estimating the number of latent topics in documents. As a solution, we estimate the number of latent topics without any prior information in the case of using Hierarchical Dirichlet Process LDA (HDP-LDA). We propose a method to estimate the number of latent topics in target documents based on calculating the similarity among extracted topics, and conduct an experiment with three data sets to compare the method with the above two representative methods, i.e., HDP-LDA and LDA using perplexity. From experimental results, we confirmed that our method can provide results similar to that of HDP-LDA. We also detect and track topics by means of our proposed method and confirm that our method is useful.
Keywords: topic extraction, topic tracking, latent Dirichlet allocation, topic similarity, term-score