A Similarity Rough Set Model for Document Representation and Document Clustering

Nguyen Chi Thanh; Koichi Yamada; Muneyuki Unehara

doi:10.20965/jaciii.2011.p0125

single-jc.php

JACIII Vol.15 No.2 pp. 125-133

(2011)

doi: 10.20965/jaciii.2011.p0125

Paper:

Views over last 60 days: 921

A Similarity Rough Set Model for Document Representation and Document Clustering

Nguyen Chi Thanh, Koichi Yamada, and Muneyuki Unehara

Department of Management and Information System Science, Nagaoka University of Technology, 1603-1 Kamitomioka, Nagaoka, Niigata 940-2188, Japan

Received:

September 13, 2010

Accepted:

November 25, 2010

Published:

March 20, 2011

Keywords:

document clustering, document representation, rough sets, text mining

Abstract

Document clustering is a textmining technique for unsupervised document organization. It helps the users browse and navigate large sets of documents. Ho et al. proposed a Tolerance Rough Set Model (TRSM) [1] for improving the vector space model that represents documents by vectors of terms and applied it to document clustering. In this paper we analyze their model to propose a new model for efficient clustering of documents. We introduce Similarity Rough Set Model (SRSM) as another model for presenting documents in document clustering. The model is evaluated by experiments on test collections. The experiment results show that the SRSM document clusteringmethod outperforms the one with TRSM and the results of SRSM are less affected by the value of parameter than TRSM.

Cite this article as:

N. Thanh, K. Yamada, and M. Unehara, “A Similarity Rough Set Model for Document Representation and Document Clustering,” J. Adv. Comput. Intell. Intell. Inform., Vol.15 No.2, pp. 125-133, 2011.

Data files:

References

[1] T. B. Ho and K. Funakoshi, “Information retrieval using rough sets,” J. of Japanese Society for Aritificial Intelligence, Vol.13, No.3, pp. 424-433, 1997.
[2] Y. Zhao and G. Karypis, “Hierarchical clustering algorithms for document datasets,” Data Mining and Knowledge Discovery, Vol.10, No.2, pp. 141-168, 2005.
[3] I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparse text data using clustering,” Machine Learning, Vol.42, No.1-2, pp. 143-175, 2001.
[4] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” Proc. of the KDD Workshop on Text Mining, 2000.
[5] Y. Li, S. M. Chung, and J. D. Holt, “Text document clustering based on frequent word meaning sequences,” Data and Knowledge Engineering, Vol.64, No.1, pp. 381-404, 2008.
[6] M. Mahdavi and H. Abolhassani, “Harmony K-means algorithm for document clustering,” Data Mining and Knowledge Discovery, pp. 1-22, 2008.
[7] G. Karypis, “CLUTO – A Clustering Toolkit,” 2003.
http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download
[8] T. B. Ho and N. B. Nguyen, “Nonhierarchical document clustering based on a tolerance rough set model,” Int. J. of Intelligent Systems, Vol.17, No.2, pp. 199-212, 2002.
[9] X.-J. Meng, Q.-C. Chen, and X.-L. Wang, “A tolerance rough set based semantic clustering method for web search results,” Information Technology J., Vol.8, No.4, pp. 453-464, 2009.
[10] Z. Pawlak, “Rough sets,” Int. J. of Information and Computer Sciences, Vol.11, No.5, pp. 341-356, 1982.
[11] Y. Y. Yao, S. K. M. Wong, and T. Y. Lin, “A review of rough set models,” Rough Sets and Data Mining: Analysis for Imprecise Data, pp. 47-73, 1997.
[12] R. Slowinski and D. Vanderpooten, “A generalized definition of rough approximations based on similarity,” IEEE Trans. on Knowledge and Data Engineering, Vol.12, No.2, pp. 331-336, 2000.
[13] R. Slowinski and D. Vanderpooten, “Similarity relation as a basis for rough approximations,” Advances in Machine Intelligents and Soft Computing, Vol.4, pp. 17-33, 1997.
[14] J. Stefanowski and A. Tsoukias, “Incomplete Information Tables and Rough Classification,” Computational Intelligence, Vol.17, No.3, pp. 545-566, 2001.
[15] R. D. Luce, “Semiorders and a Theory of Utility Discrimination,” Econometrica, Vol.24, No.2, pp. 178-191, 1956.
[16] A. Strehl, J. Ghosh, and R. Mooney, “Impact of similarity measures on web-page clustering,” Proc. of the 17th National Conf. on Artificial Intelligence: Workshop of Artificial Intelligence forWeb search (AAAI 2000), Austin, TX, pp. 58-64, July 2000.
[17] G. Salton and M. J. McGill, “Introduction to modern information retrieval,” MCGraw-Hill Book Company, 1983.
[18] ftp://ftp.cs.cornell.edu/pub/smart

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[B1] [1] T. B. Ho and K. Funakoshi, “Information retrieval using rough sets,” J. of Japanese Society for Aritificial Intelligence, Vol.13, No.3, pp. 424-433, 1997.

[B2] [2] Y. Zhao and G. Karypis, “Hierarchical clustering algorithms for document datasets,” Data Mining and Knowledge Discovery, Vol.10, No.2, pp. 141-168, 2005.

[B3] [3] I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparse text data using clustering,” Machine Learning, Vol.42, No.1-2, pp. 143-175, 2001.

[B4] [4] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” Proc. of the KDD Workshop on Text Mining, 2000.

[B5] [5] Y. Li, S. M. Chung, and J. D. Holt, “Text document clustering based on frequent word meaning sequences,” Data and Knowledge Engineering, Vol.64, No.1, pp. 381-404, 2008.

[B6] [6] M. Mahdavi and H. Abolhassani, “Harmony K-means algorithm for document clustering,” Data Mining and Knowledge Discovery, pp. 1-22, 2008.

[B7] [7] G. Karypis, “CLUTO – A Clustering Toolkit,” 2003.
http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download

[B8] [8] T. B. Ho and N. B. Nguyen, “Nonhierarchical document clustering based on a tolerance rough set model,” Int. J. of Intelligent Systems, Vol.17, No.2, pp. 199-212, 2002.

[B9] [9] X.-J. Meng, Q.-C. Chen, and X.-L. Wang, “A tolerance rough set based semantic clustering method for web search results,” Information Technology J., Vol.8, No.4, pp. 453-464, 2009.

[B10] [10] Z. Pawlak, “Rough sets,” Int. J. of Information and Computer Sciences, Vol.11, No.5, pp. 341-356, 1982.

[B11] [11] Y. Y. Yao, S. K. M. Wong, and T. Y. Lin, “A review of rough set models,” Rough Sets and Data Mining: Analysis for Imprecise Data, pp. 47-73, 1997.

[B12] [12] R. Slowinski and D. Vanderpooten, “A generalized definition of rough approximations based on similarity,” IEEE Trans. on Knowledge and Data Engineering, Vol.12, No.2, pp. 331-336, 2000.

[B13] [13] R. Slowinski and D. Vanderpooten, “Similarity relation as a basis for rough approximations,” Advances in Machine Intelligents and Soft Computing, Vol.4, pp. 17-33, 1997.

[B14] [14] J. Stefanowski and A. Tsoukias, “Incomplete Information Tables and Rough Classification,” Computational Intelligence, Vol.17, No.3, pp. 545-566, 2001.

[B15] [15] R. D. Luce, “Semiorders and a Theory of Utility Discrimination,” Econometrica, Vol.24, No.2, pp. 178-191, 1956.

[B16] [16] A. Strehl, J. Ghosh, and R. Mooney, “Impact of similarity measures on web-page clustering,” Proc. of the 17th National Conf. on Artificial Intelligence: Workshop of Artificial Intelligence forWeb search (AAAI 2000), Austin, TX, pp. 58-64, July 2000.

[B17] [17] G. Salton and M. J. McGill, “Introduction to modern information retrieval,” MCGraw-Hill Book Company, 1983.

[B18] [18] ftp://ftp.cs.cornell.edu/pub/smart