Paper:

# A Similarity Rough Set Model for Document Representation and Document Clustering

## Nguyen Chi Thanh, Koichi Yamada, and Muneyuki Unehara

Department of Management and Information System Science, Nagaoka University of Technology, 1603-1 Kamitomioka, Nagaoka, Niigata 940-2188, Japan

Document clustering is a textmining technique for unsupervised document organization. It helps the users browse and navigate large sets of documents. Ho et al. proposed a Tolerance Rough Set Model (TRSM) [1] for improving the vector space model that represents documents by vectors of terms and applied it to document clustering. In this paper we analyze their model to propose a new model for efficient clustering of documents. We introduce Similarity Rough Set Model (SRSM) as another model for presenting documents in document clustering. The model is evaluated by experiments on test collections. The experiment results show that the SRSM document clusteringmethod outperforms the one with TRSM and the results of SRSM are less affected by the value of parameter than TRSM.

*J. Adv. Comput. Intell. Intell. Inform.*, Vol.15, No.2, pp. 125-133, 2011.

- [1] T. B. Ho and K. Funakoshi, “Information retrieval using rough sets,” J. of Japanese Society for Aritificial Intelligence, Vol.13, No.3, pp. 424-433, 1997.
- [2] Y. Zhao and G. Karypis, “Hierarchical clustering algorithms for document datasets,” Data Mining and Knowledge Discovery, Vol.10, No.2, pp. 141-168, 2005.
- [3] I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparse text data using clustering,” Machine Learning, Vol.42, No.1-2, pp. 143-175, 2001.
- [4] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” Proc. of the KDD Workshop on Text Mining, 2000.
- [5] Y. Li, S. M. Chung, and J. D. Holt, “Text document clustering based on frequent word meaning sequences,” Data and Knowledge Engineering, Vol.64, No.1, pp. 381-404, 2008.
- [6] M. Mahdavi and H. Abolhassani, “Harmony K-means algorithm for document clustering,” Data Mining and Knowledge Discovery, pp. 1-22, 2008.
- [7] G. Karypis, “CLUTO – A Clustering Toolkit,” 2003.

http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download - [8] T. B. Ho and N. B. Nguyen, “Nonhierarchical document clustering based on a tolerance rough set model,” Int. J. of Intelligent Systems, Vol.17, No.2, pp. 199-212, 2002.
- [9] X.-J. Meng, Q.-C. Chen, and X.-L. Wang, “A tolerance rough set based semantic clustering method for web search results,” Information Technology J., Vol.8, No.4, pp. 453-464, 2009.
- [10] Z. Pawlak, “Rough sets,” Int. J. of Information and Computer Sciences, Vol.11, No.5, pp. 341-356, 1982.
- [11] Y. Y. Yao, S. K. M. Wong, and T. Y. Lin, “A review of rough set models,” Rough Sets and Data Mining: Analysis for Imprecise Data, pp. 47-73, 1997.
- [12] R. Slowinski and D. Vanderpooten, “A generalized definition of rough approximations based on similarity,” IEEE Trans. on Knowledge and Data Engineering, Vol.12, No.2, pp. 331-336, 2000.
- [13] R. Slowinski and D. Vanderpooten, “Similarity relation as a basis for rough approximations,” Advances in Machine Intelligents and Soft Computing, Vol.4, pp. 17-33, 1997.
- [14] J. Stefanowski and A. Tsoukias, “Incomplete Information Tables and Rough Classification,” Computational Intelligence, Vol.17, No.3, pp. 545-566, 2001.
- [15] R. D. Luce, “Semiorders and a Theory of Utility Discrimination,” Econometrica, Vol.24, No.2, pp. 178-191, 1956.
- [16] A. Strehl, J. Ghosh, and R. Mooney, “Impact of similarity measures on web-page clustering,” Proc. of the 17th National Conf. on Artificial Intelligence: Workshop of Artificial Intelligence forWeb search (AAAI 2000), Austin, TX, pp. 58-64, July 2000.
- [17] G. Salton and M. J. McGill, “Introduction to modern information retrieval,” MCGraw-Hill Book Company, 1983.
- [18] ftp://ftp.cs.cornell.edu/pub/smart

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 International License.