Applying Naive Bayes Classifier to Document Clustering

Jie Ji; Qiangfu Zhao

doi:10.20965/jaciii.2010.p0624

single-jc.php

« previous

JACIII Vol.14 No.6 pp. 624-630

doi: 10.20965/jaciii.2010.p0624

(2010)

Paper:

Views over last 60 days: 822

Applying Naive Bayes Classifier to Document Clustering

Jie Ji and Qiangfu Zhao

System Intelligence Lab., The University of Aizu, Tsuruga, Ikki-machi, Aizu-wakamatsu, Fukushima 965-8580, Japan

Received:

January 29, 2010

Accepted:

July 15, 2010

Published:

September 20, 2010

Keywords:

document clustering, Naive Bayes Classifier, Iterative Bayes Clustering, k-means, comparative advantage

Abstract

Document clustering partitions sets of unlabeled documents so that documents in clusters share common concepts. A Naive Bayes Classifier (BC) is a simple probabilistic classifier based on applying Bayes’ theorem with strong (naive) independence assumptions. BC requires a small amount of training data to estimate parameters required for classification. Since training data must be labeled, we propose an Iterative Bayes Clustering (IBC) algorithm. To improve IBC performance, we propose combining IBC with Comparative Advantage-based (CA) initialization method. Experimental results show that our proposal improves performance significantly over classical clustering methods.

Cite this article as:

J. Ji and Q. Zhao, “Applying Naive Bayes Classifier to Document Clustering,” J. Adv. Comput. Intell. Intell. Inform., Vol.14 No.6, pp. 624-630, 2010.

Data files:

References

[1] P. Domingos and M. Pazzani, “On the optimality of the simple Bayesian classifier under zero-one loss,” Machine Learning, Vol.29, Nos.2-3, pp. 103-137, 1997.
[2] M. Mozina, J. Demsar ,M. Kattan, and B. Zupan, “Nomograms for Visualization of Naive Bayesian Classifier,” In Proc. of PKDD-2004, pp. 337-348, 2004.
[3] S. Kotsiantis and P. Pintelas, “Increasing the Classification Accuracy of Simple Bayesian Classifier,” Lecture Notes in Artificial Intelligence, AIMSA 2004, Springer-Verlag Vol.3192, pp. 198-207, 2004.
[4] Bauer and Laurie, “Introducing linguistic morphology, 2nd Ed.,” Washington, D.C., Georgetown University Press, 2003.
[5] W. B. Frakes and R. Baeza-Yates, “Information Retrieval: Data Structures and Algorithms,” Prentice Hall, Englewood Cliffs, New Jersey, 1992.
[6] J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations,” Proc. of 5-th Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, University of California Press, pp. 281-297, 1967.
[7] J. A. Hartigan, “Clustering Algorithms,” Wiley, 1975.
[8] J. A. Hartigan and M. A. Wong, “A K-Means Clustering Algorithm,” Applied Statistics, Vol.28, No.1, pp. 100-108, 1979.
[9] J. Ji and Q. Zhao, “Comparative Advantage Approach for Sparse Text Data Clustering,” Proc. of IEEE 9-th Int. Conf. on Computer and Information Technology, Xiamen, China, pp. 3-8, 2009.
[10] J. Ji, T. Y. T. Chan, and Q. Zhao, “Fast Document Clustering Based on Weighted Comparative Advantage,” Proc. of IEEE Int. Conf. on Systems, Man & Cybernetics, San Antonio, Texas, USA, pp. 541-546, 2009.
[11] P. Hardwick, B. Khan, and J. Langmead, “An Introduction to Modern Economics, 5th Ed.,” Financial Times & Prentice Hall, 1999.
[12] A. O’Sullivan and S. M. Sheffrin, “Economics, Principles & Tools, 3th Ed.,” Prentice Hall, 2002.
[13] S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE, Trans. on Information Theory, Vol.28, No.2, pp. 129-137, 1982.
[14] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing & Management, Vol.24, No.5, pp. 513-523, 1988.
[15] I. S. Dhillon and D. S. Modha, “Concept Decompositions for Large Sparse Text Data Using Clustering,” Machine Learning, Vol.42, Nos.1-2, pp. 143-175, 2001.
[16] G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill Book Company, 1983.
[17] ftp://ftp.cs.cornell.edu/pub/smart
[18] http://www.nsf.gov/awardsearch

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] P. Domingos and M. Pazzani, “On the optimality of the simple Bayesian classifier under zero-one loss,” Machine Learning, Vol.29, Nos.2-3, pp. 103-137, 1997.

[2] [2] M. Mozina, J. Demsar ,M. Kattan, and B. Zupan, “Nomograms for Visualization of Naive Bayesian Classifier,” In Proc. of PKDD-2004, pp. 337-348, 2004.

[3] [3] S. Kotsiantis and P. Pintelas, “Increasing the Classification Accuracy of Simple Bayesian Classifier,” Lecture Notes in Artificial Intelligence, AIMSA 2004, Springer-Verlag Vol.3192, pp. 198-207, 2004.

[4] [4] Bauer and Laurie, “Introducing linguistic morphology, 2nd Ed.,” Washington, D.C., Georgetown University Press, 2003.

[5] [5] W. B. Frakes and R. Baeza-Yates, “Information Retrieval: Data Structures and Algorithms,” Prentice Hall, Englewood Cliffs, New Jersey, 1992.

[6] [6] J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations,” Proc. of 5-th Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, University of California Press, pp. 281-297, 1967.

[7] [7] J. A. Hartigan, “Clustering Algorithms,” Wiley, 1975.

[8] [8] J. A. Hartigan and M. A. Wong, “A K-Means Clustering Algorithm,” Applied Statistics, Vol.28, No.1, pp. 100-108, 1979.

[9] [9] J. Ji and Q. Zhao, “Comparative Advantage Approach for Sparse Text Data Clustering,” Proc. of IEEE 9-th Int. Conf. on Computer and Information Technology, Xiamen, China, pp. 3-8, 2009.

[10] [10] J. Ji, T. Y. T. Chan, and Q. Zhao, “Fast Document Clustering Based on Weighted Comparative Advantage,” Proc. of IEEE Int. Conf. on Systems, Man & Cybernetics, San Antonio, Texas, USA, pp. 541-546, 2009.

[11] [11] P. Hardwick, B. Khan, and J. Langmead, “An Introduction to Modern Economics, 5th Ed.,” Financial Times & Prentice Hall, 1999.

[12] [12] A. O’Sullivan and S. M. Sheffrin, “Economics, Principles & Tools, 3th Ed.,” Prentice Hall, 2002.

[13] [13] S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE, Trans. on Information Theory, Vol.28, No.2, pp. 129-137, 1982.

[14] [14] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing & Management, Vol.24, No.5, pp. 513-523, 1988.

[15] [15] I. S. Dhillon and D. S. Modha, “Concept Decompositions for Large Sparse Text Data Using Clustering,” Machine Learning, Vol.42, Nos.1-2, pp. 143-175, 2001.

[16] [16] G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval,” McGraw-Hill Book Company, 1983.

[17] [17] ftp://ftp.cs.cornell.edu/pub/smart

[18] [18] http://www.nsf.gov/awardsearch