Automatic Keyword Annotation System Using Newspapers
Tomoki Takada, Mizuki Arai, and Tomohiro Takagi
Department of Computer Science, Meiji University, 1-1-1 Higashimita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571, Japan
Nowadays, an increasingly large amount of information exists on the web. Therefore, a method is needed that enables us to find necessary information quickly because this is becoming increasingly difficult for users. To solve this problem, information retrieval systems like Google and recommendation systems like that on Amazon are used. In this paper, we focus on information retrieval systems. These retrieval systems require index terms, which affect the precision of retrieval. Two methods generally decide index terms. One is analyzing a text using natural language processing and deciding index terms using varying amounts of statistics. The other is someone choosing document keywords as index terms. However, the latter method requires too much time and effort and becomes more impractical as information grows. Therefore, we propose the Nikkei annotator system, which is based on the model of the human brain and learns patterns of past keyword annotation and automatically outputs keywords that users prefer. The purposes of the proposed method are automating manual keyword annotation and achieving high speed and high accuracy keyword annotation. Experimental results showed that the proposed method is more accurate than TFIDF and Naive Bayes in P@5 and P@10. Moreover, these results also showed that the proposed method could annotate about 19 times faster than Naive Bayes.
-  M. Hamaguchi, “YOMIDAS REKISHIKAN, new database service of japanese newspaper and YOMIDAS YOGO JISHO, thesaurus by The Yomiuri Shimbun,” J. of Information Processing and Management, Vol.52, No.3, June 2009.
-  M. Ishii, “Auto-indexing system in Nihon Keizai Shinbun, Inc.,” Information Science and Technology Association, Vol.42, No.11, pp. 1058-1064, November 1992.
-  Q. Zadeh et al., “Semi-Supervised Technical Term Tagging With Minimal User Feedback,” LREC, pp. 617-621, 2012.
-  H. Nakagawa et al. “Automatic Term Recognition based on Statistics of Compound Nouns and their Components,” Terminology, Vol.9, No.2, pp. 201-219, 2001.
-  M. Utiyama et al. “Using Author Keywords for Automatic Term Recognition,” Terminology, Vol.6, No.2, pp. 313-326, 2000.
-  A. Schenker et al., “Classification of Web Documents Using a Graph Model,” Proc. of the 7th Int. Conf. on Document Analysis and Recognition, Scotland, Computer Society Press, 2003.
-  S. Bleik et al., “CGM: A Biomedical Text Categorization Approach Using Concept Graph Mining,” Proc. IEEE Int. Conf. on Bioinformatics and Biomedicine Workshop, pp. 38-43, 2009.
-  M. E. Maron, “Automatic Indexing: An Experimental Inquiry,” J. ACM, Vol.8, No.3, pp. 404-417, 1961.
-  T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. European Conf. on Machine Learning, pp. 137-142, 1998.
-  B. V. Dasarathy, “Nearrest Neighbor (NN) Norms: NN Pattern Classification Techniques,” IEEE Press, 1991.
-  C. Apte et al., “Automated Learning of Decision Rules for Text Categorization,” ACM Trans. Information Systems, Vol.12, No.3, pp. 233-251, 1994.
-  S. Bleik et al., “Text Categorization of Biomedical Data Sets Using Graph Kernels and a Controlled Vocaburary,” IEEE/ACM Trans. on Computational Biology and Bioinformatics, 2013.
-  D. Zhang andW. S. Lee, “Extracting Key-Substring-Group Features for Text Classification,” Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 474-483, 2006.
-  R. Hecht-Nielsen, “Confabulation Theory,” Springer-Verlag: Heidelberg, 2007.
-  R. Hecht-Nielsen, “Cogent confabulation,” Neural Networks, Vol.18, pp. 111-115, 2005.
-  R. Hecht-Nielsen, “Confabulation theory,” UCSD Institute for Neural Computation Technical Report #0501, 2005.
-  http://nikkei.com [Accessed September 1, 2013].