Paper:
Applying Naive Bayes Classifier to Document Clustering
Jie Ji and Qiangfu Zhao
System Intelligence Lab., The University of Aizu, Tsuruga, Ikki-machi, Aizu-wakamatsu, Fukushima 965-8580, Japan
Document clustering partitions sets of unlabeled documents so that documents in clusters share common concepts. A Naive Bayes Classifier (BC) is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. BC requires a small amount of training data to estimate parameters required for classification. Since training data must be labeled, we propose an Iterative Bayes Clustering (IBC) algorithm. To improve IBC performance, we propose combining IBC with Comparative Advantage-based (CA) initialization method. Experimental results show that our proposal improves performance significantly over classical clustering methods.
- [1] P. Domingos and M. Pazzani, “On the optimality of the simple Bayesian classifier under zero-one loss,” Machine Learning, Vol.29, Nos.2-3, pp. 103-137, 1997.
- [2] M. Mozina, J. Demsar ,M. Kattan, and B. Zupan, “Nomograms for Visualization of Naive Bayesian Classifier,” In Proc. of PKDD-2004, pp. 337-348, 2004.
- [3] S. Kotsiantis and P. Pintelas, “Increasing the Classification Accuracy of Simple Bayesian Classifier,” Lecture Notes in Artificial Intelligence, AIMSA 2004, Springer-Verlag Vol.3192, pp. 198-207, 2004.
- [4] Bauer and Laurie, “Introducing linguistic morphology, 2nd Ed.,” Washington, D.C., Georgetown University Press, 2003.
- [5] W. B. Frakes and R. Baeza-Yates, “Information Retrieval: Data Structures and Algorithms,” Prentice Hall, Englewood Cliffs, New Jersey, 1992.
- [6] J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations,” Proc. of 5-th Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, University of California Press, pp. 281-297, 1967.
- [7] J. A. Hartigan, “Clustering Algorithms,” Wiley, 1975.
- [8] J. A. Hartigan and M. A. Wong, “A K-Means Clustering Algorithm,” Applied Statistics, Vol.28, No.1, pp. 100-108, 1979.
- [9] J. Ji and Q. Zhao, “Comparative Advantage Approach for Sparse Text Data Clustering,” Proc. of IEEE 9-th Int. Conf. on Computer and Information Technology, Xiamen, China, pp. 3-8, 2009.
- [10] J. Ji, T. Y. T. Chan, and Q. Zhao, “Fast Document Clustering Based on Weighted Comparative Advantage,” Proc. of IEEE Int. Conf. on Systems, Man & Cybernetics, San Antonio, Texas, USA, pp. 541-546, 2009.
- [11] P. Hardwick, B. Khan, and J. Langmead, “An Introduction to Modern Economics, 5th Ed.,” Financial Times & Prentice Hall, 1999.
- [12] A. O’Sullivan and S. M. Sheffrin, “Economics, Principles & Tools, 3th Ed.,” Prentice Hall, 2002.
- [13] S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE, Trans. on Information Theory, Vol.28, No.2, pp. 129-137, 1982.
- [14] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing & Management, Vol.24, No.5, pp. 513-523, 1988.
- [15] I. S. Dhillon and D. S. Modha, “Concept Decompositions for Large Sparse Text Data Using Clustering,” Machine Learning, Vol.42, Nos.1-2, pp. 143-175, 2001.
- [16] G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill Book Company, 1983.
- [17] ftp://ftp.cs.cornell.edu/pub/smart
- [18] http://www.nsf.gov/awardsearch
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.