Paper:

# Landmark FN-DBSCAN: An Efficient Density-Based Clustering Algorithm with Fuzzy Neighborhood

## Hao Liu, Satoshi Oyama, Masahito Kurihara,

and Haruhiko Sato

Division of Synergetic Information Science in Graduate School of Information Science and Technology, Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan

Clustering is an important tool for data analysis and many clustering techniques have been proposed over the past years. Among them are density-based clustering methods, which have several benefits such as the number of clusters is not required before carrying out clustering; the detected clusters can be represented in an arbitrary shape and outliers can be detected and removed. Recently, the density-based algorithms were extended with the fuzzy set theory, which has made these algorithm more robust. However, the density-based clustering algorithms usually require a time complexity of *O*(*n*^{2}) where *n* is the number of data in the data set, implying that they are not suitable to work with large scale data sets. In this paper, a novel clustering algorithm called landmark fuzzy neighborhood DBSCAN (landmark FN-DBSCAN) is proposed. The concept, landmark, is used to represent a subset of the input data set which makes the algorithm efficient on large scale data sets. We give a theoretical analysis on time complexity and space complexity, which shows both of them are linear to the size of the data set. The experiments show that the landmark FN-DBSCAN is much faster than FN-DBSCAN and provides a very good quality of clustering.

*J. Adv. Comput. Intell. Intell. Inform.*, Vol.17, No.1, pp. 60-73, 2013.

- [1] R. Xu and D. C.Wunsch II, “Survey of clustering algorithms,” IEEE Trans. Neural Netw., Vol.16, No.3, pp. 645-678, 2005.
- [2] J. Moody and C. Darken, “Fast learning in networks of locallytuned processing units,” Neural Computation, Vol.1, No.2, pp. 281-294, 1989.
- [3] M.-C. Chiang, C.-W. Tsai, and C.-S. Yang, “A time-efficient pattern reduction algorithm for k-means clustering,” Information Sciences, Vol.181, No.4, pp. 716-731, 2011.
- [4] R. T. Ng and J. Han, “Clarans: A method for clustering objects for spatial data mining,” IEEE Trans. Knowl. Data Eng., Vol.14, No.5, pp. 1003-1016, 2002.
- [5] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol.1, pp. 281-297, 1967.
- [6] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern Recognition Letters, Vol.31, No.8, pp. 651-666, 2010.
- [7] H. Zhexue, “Extensions to the k-means algorithm for clustering large data sets with categorical values,” Data Mining and Knowledge Discovery, Vol.304, No.3, pp. 283-304, 1999.
- [8] C. H. Q. Ding and X. He, “K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization,” in Proc. ACM Symp. on Applied Computing, pp. 584-589, 2004.
- [9] C.-R. Lin and M.-S. Chen, “Combining partitional and hierarchical algorithms for robust and efficient data clustering with cohesion self-merging,” IEEE Trans. Knowl. Data Eng., Vol.17, No.2, pp. 145-159, 2005.
- [10] F. Murtagh and P. Contreras, “Algorithms for hierarchical clustering: an overview,” Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery, Vol.2, No.1, pp. 86-97, 2012.
- [11] K.-L. Du, “Clustering: A neural network approach,” Neural Networks, Vol.23, No.1, pp. 89-107, Jan. 2010.
- [12] S. Theodoridis and K. Koutroumbas, “Pattern Recognition,” Fourth Edition, Academic Press, 2009.
- [13] Y. Song, S. Jin, and J. Shen, “A unique property of single-link distance and its application in data clustering,” Data & Knowledge Engineering, Vol.70, No.11, pp. 984-1003, 2011.
- [14] S. Guha, R. Rastogi, and K. Shim, “CURE: an efficient clustering algorithm for large databases,” in Proc. the 1998 ACM SIGMOD Int. Conf. on Management of Data, pp. 73-84, 1998.
- [15] J. Sander, “Density-based clustering,” in Encyclopedia of Machine Learning, pp. 270-273, 2010.
- [16] H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek, “Density-based clustering,” Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery, Vol.1, No.3, pp. 231-240, 2011.
- [17] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD-96), pp. 226-231, 1996.
- [18] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Ordering points to identify the clustering structure,” in Proc. Int. Conf. on Management of Data and Symposium on Principles of Database Systems Philadelphia (SIGMOD/PODS 1999), New York, NY, pp. 49-60, 1999.
- [19] P. Viswanath and V. Suresh Babu, “Rough-dbscan: A fast hybrid density based clustering method for large data sets,” Pattern Recogn. Lett., Vol.30, No.16, pp. 1477-1488, Dec. 2009.
- [20] A. Hinneburg and H.-H. Gabriel, “Denclue 2.0: fast clustering based on kernel density estimation,” in Proc. the 7th Int. Conf. on Intelligent data analysis, ser. IDA’07, Berlin, Heidelberg: Springer-Verlag, pp. 70-80, 2007.
- [21] B. Borah and D. Bhattacharyya, “A clustering technique using density difference,” in Proc. Int. Conf. on Signal Processing, Communications and Networking 2007, pp. 585-588, Feb. 2007.
- [22] C. Fraley and A. E. Raftery, “Enhanced model-based clustering, density estimation, and discriminant analysis software: Mclust,” J. Classification, No.2, pp. 263-286, 2003.
- [23] G. Karypis, E.-H. S. Han, and V. Kumar, “Chameleon: A hierarchical clustering algorithm using dynamic modeling,” IEEE Trans. Comput., 1999.
- [24] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” Advances in Neural Information Processing Systems, Vol.14, No.2, pp. 849-856, 2001.
- [25] I. S. Dhillon, Y. Guan, and B. Kulis, “Kernel k-means: spectral clustering and normalized cuts,” in Proc. of the tenth ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, pp. 551-556, 2004.
- [26] E. N. Nasibov, “A robust algorithm for solution of the fuzzy clustering problem on the basis of the fuzzy joint points method,” Cybernetics and Sys. Anal., Vol.44, No.1, pp. 7-17, Jan. 2008.
- [27] E. N. Nasibov and G. Ulutagay, “A new unsupervised approach for fuzzy clustering,” Fuzzy Sets and Systems, Vol.158, No.19, pp. 2118-2133, 2007.
- [28] E. N. Nasibov and G. Ulutagay, “Robustness of density-based clustering methods with various neighborhood relations,” Fuzzy Sets and Systems, Vol.160, No.24, pp. 3601-3615, 2009.
- [29] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Cluster validity methods: Part I,” SIGMOD Record, Vol.31, No.2, pp. 40-45, 2002.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.