JACIII Vol.14 No.6 pp. 624-630
doi: 10.20965/jaciii.2010.p0624


Applying Naive Bayes Classifier to Document Clustering

Jie Ji and Qiangfu Zhao

System Intelligence Lab., The University of Aizu, Tsuruga, Ikki-machi, Aizu-wakamatsu, Fukushima 965-8580, Japan

January 29, 2010
July 15, 2010
September 20, 2010
document clustering, Naive Bayes Classifier, Iterative Bayes Clustering, k-means, comparative advantage

Document clustering partitions sets of unlabeled documents so that documents in clusters share common concepts. A Naive Bayes Classifier (BC) is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. BC requires a small amount of training data to estimate parameters required for classification. Since training data must be labeled, we propose an Iterative Bayes Clustering (IBC) algorithm. To improve IBC performance, we propose combining IBC with Comparative Advantage-based (CA) initialization method. Experimental results show that our proposal improves performance significantly over classical clustering methods.

Cite this article as:
Jie Ji and Qiangfu Zhao, “Applying Naive Bayes Classifier to Document Clustering,” J. Adv. Comput. Intell. Intell. Inform., Vol.14, No.6, pp. 624-630, 2010.
Data files:
  1. [1] P. Domingos and M. Pazzani, “On the optimality of the simple Bayesian classifier under zero-one loss,” Machine Learning, Vol.29, Nos.2-3, pp. 103-137, 1997.
  2. [2] M. Mozina, J. Demsar ,M. Kattan, and B. Zupan, “Nomograms for Visualization of Naive Bayesian Classifier,” In Proc. of PKDD-2004, pp. 337-348, 2004.
  3. [3] S. Kotsiantis and P. Pintelas, “Increasing the Classification Accuracy of Simple Bayesian Classifier,” Lecture Notes in Artificial Intelligence, AIMSA 2004, Springer-Verlag Vol.3192, pp. 198-207, 2004.
  4. [4] Bauer and Laurie, “Introducing linguistic morphology, 2nd Ed.,” Washington, D.C., Georgetown University Press, 2003.
  5. [5] W. B. Frakes and R. Baeza-Yates, “Information Retrieval: Data Structures and Algorithms,” Prentice Hall, Englewood Cliffs, New Jersey, 1992.
  6. [6] J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations,” Proc. of 5-th Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, University of California Press, pp. 281-297, 1967.
  7. [7] J. A. Hartigan, “Clustering Algorithms,” Wiley, 1975.
  8. [8] J. A. Hartigan and M. A. Wong, “A K-Means Clustering Algorithm,” Applied Statistics, Vol.28, No.1, pp. 100-108, 1979.
  9. [9] J. Ji and Q. Zhao, “Comparative Advantage Approach for Sparse Text Data Clustering,” Proc. of IEEE 9-th Int. Conf. on Computer and Information Technology, Xiamen, China, pp. 3-8, 2009.
  10. [10] J. Ji, T. Y. T. Chan, and Q. Zhao, “Fast Document Clustering Based on Weighted Comparative Advantage,” Proc. of IEEE Int. Conf. on Systems, Man & Cybernetics, San Antonio, Texas, USA, pp. 541-546, 2009.
  11. [11] P. Hardwick, B. Khan, and J. Langmead, “An Introduction to Modern Economics, 5th Ed.,” Financial Times & Prentice Hall, 1999.
  12. [12] A. O’Sullivan and S. M. Sheffrin, “Economics, Principles & Tools, 3th Ed.,” Prentice Hall, 2002.
  13. [13] S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE, Trans. on Information Theory, Vol.28, No.2, pp. 129-137, 1982.
  14. [14] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing & Management, Vol.24, No.5, pp. 513-523, 1988.
  15. [15] I. S. Dhillon and D. S. Modha, “Concept Decompositions for Large Sparse Text Data Using Clustering,” Machine Learning, Vol.42, Nos.1-2, pp. 143-175, 2001.
  16. [16] G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill Book Company, 1983.
  17. [17]
  18. [18]

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Mar. 05, 2021