Document Analysis System Based on Awareness Learning
Jie Ji*, Rung-Ching Chen**, and Qiangfu Zhao*
*System Intelligence Laboratory, The University of Aizu, Tsuruga, Ikki-machi, Aizu-Wakamatsu, Fukushima 965-8580, Japan
**College of Informatics, Chaoyang University of Technology, 168 Jifeng E. Rd., Wufeng District, Taichung City, Taiwan, R.O.C.
The rapid growth of the Internet has naturally encouraged users to handle and process documents as online information rather than hard-copies, e.g., on paper. Dealing with large amounts of information efficiently requires classifying data into meaningful categories. Many machine-learning-based algorithms have been proposed for document classification, yielding a variety of applications such as spam filters, patent analyzers, and hot-topic retrieval systems. Different applications having different goals require different teacher signals even for the same dataset. It is not an easy task. In this study, we describe human-behavior-inspired awareness system for analyzing documents. This system starts learning with few or even no teacher signals, learning and understanding user intent through interaction with the user. We describe the structure of our proposed system and the basic steps required for analyzing documents.
-  D. Bremner, E. Demaine, J. Erickson, J. Iacono, S. Langerman, P. Morin, and G. Toussaint, “Output-sensitive algorithms for computing nearest-neighbor decision boundaries,” Discrete and Computational Geometry Vol.33, No.4, pp. 593-604, 2005.
-  P. Hall, B. U. Park, and R. J. Samworth, “Choice of neighbor order in nearest-neighbor classification,” Annals of Statistics, Vol.36, No.5, pp. 2135-2152, 2008.
-  G. I. Webb, J. Boughton, and Z. Wang, “Not So Naive Bayes: Aggregating One-Dependence Estimators,”Machine Learning, Vol.58, No.1, pp. 5-24, 2005.
-  P. Domingos and M. Pazzani, “On the optimality of the simple Bayesian classifier under zero-one loss,”Machine Learning, Vol.29, pp. 103-137, 1997.
-  T. Ash, “Dynamic Node Creation in Back-propagation Networks,” Connection Science, Vol.1, No.4, pp. 365-375, 1989.
-  H. Hayashi and Q. F. Zhao, “Quick induction of NNTrees for text categorization based on discriminative multiple centroid approach,” Proc. of Int. Conf. on Systems Man and Cybernetics 2010, pp. 705-712, 2010.
-  D. Meyer, F. Leisch, and K. Hornik, “The support vector machine under test,” Neurocomputing, Vol.55, pp. 169-186, 2003.
-  L. Bauer, “Introducing linguistic morphology,” 2nd Ed., Washington, D.C., Georgetown University Press, 2003.
-  W. B. Frakes and R. Baeza-Yates, “Information Retrieval: Data Structures and Algorithms,” Prentice Hall, Englewood Cliffs, New Jersey, 1992.
-  G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing & Management, Vol.24, No.5, pp. 513-523, 1988.
-  I. S. Dhillon and D. S. Modha, “Concept Decompositions for Large Sparse Text Data Using Clustering,” Machine Learning, Vol.42, pp. 143-175, doi:10.1023/A:1007612920971, Jan. 2001.
-  G. Salton and M. J. McGill, “Introduction to Modern Retrieval,” McGraw-Hill Book Company, 1983.
-  J. Ji, Y. T. T. Chan, and Q. Zhao, “Fast Document Clustering Based on Weighted Comparative Advantage,” Proc. of 2009 IEEE Int. Conf. on Systems, Man & Cybernetics, pp. 541-546, San Antonio, USA, 2009.
-  M. Oster, R. Douglas, and S.-C. Liu, “Computation with spikes in a winner-take-all network,” Neural Computation, Vol.21, No.9, 2009.
-  S. Geva and J. Sitte, “Adaptive nearest neighbor pattern classification,” IEEE Trans. on Neural Networks, Vol.2, No.2, pp. 318-322, Mar. 1991.
-  J. Ji, T. Y. T. Chan, and Q. Zhao, “Clustering Large Sparse Text Data: A Comparative Advantage Approach,” J. of Information Processing, Vol.51, No.9, pp. 1930-1939, 2010.
-  J. Ji and Q. Zhao, “Supervised Weighted Comparative Advantage Classification Algorithm,” Proc. of 2010 Int. Symposium on Intelligent Systems, Japan, 2010.
-  The data base of SMART system, the University of Ceronell, USA.
-  The home page of US National Science Foundation.
-  Tom Mitchell, UCI Machine Learning Repository.
-  The world-wide distributed discussion system.