Active Sampling for Constrained Clustering
Masayuki Okabe* and Seiji Yamada**
*Information and Media Center, Toyohashi University of Technology, 1-1 Tempaku, Toyohashi, Aichi 441-8580, Japan
**National Institute of Informatics, SOKENDAI, 2-1-2 Chiyoda, Tokyko 101-8430, Japan
Constrained clustering is a framework for improving clustering performance by using constraints about data pairs. Since performance of constrained clustering depends on the set of constraints used, a method is needed to select good constraints that promote clustering performance. In this paper, we propose an active sampling method working with a constrained cluster ensemble algorithm that aggregates clustering results that a modified COP-Kmeans iteratively produces by changing the priorities of constraints. Our method follows the approach of uncertainty sampling and measures uncertainty using variations of clustering results where data pairs are clustered together in some results but not in others. It selects the data pair to be labeled that has the most variable result during cluster ensemble process. Experimental results show that our method outperforms random sampling. We further investigate the effect of important parameters.
-  Z. Wu and R. Leahy, “An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.15, pp. 1101-1113, 1993.
-  L. Dey, A. Mahajan, and S. M. Haque, “Document Clustering for Event Identification and Trend Analysis in Market News,” In Proc. of the 2009 Seventh Int. Conf. on Advances in Pattern Recognition, ICAPR’09, pp. 103-106, IEEE Computer Society, 2009.
-  S. Basu, I. Davidson, and K. L.Wagstaff (Eds.), “Constrained Clustering,” CRC Press, 2008.
-  O. Chapelle, B. Scholkopf, and A. Zien, “Semi-Supervised Learning,” The MIT Press, 2006.
-  K. Wagstaff and S. Roger, “Constrained K-means Clustering with Background Knowledge,” In Proc. of the 18th Int. Conf. onMachine Learning, pp. 577-584, 2001.
-  S. C. H. Hoi, R. Jin, and M. R. Lyu, “Learning nonparametric kernel matrices from pairwise constraints,” In Proc. of the 24th Int. Conf. on Machine learning, pp. 361-368, 2007.
-  P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman, “Online Metric Learning and Fast Similarity Search,” In Proc. of Neural Information Processing Systems, pp. 761-768, 2008.
-  D. Klein, S. D. Kamvar, and C. D. Manning, “From Instance-level Constraints to Space-level Constraints: Making the Most of Prior Knowledge in Data Clustering,” In Proc. of the 19th Int. Conf. on Machine Learnning, pp. 307-314, 2002.
-  Z. Li, J. Liu, and X. Tang, “Pairwise constraint propagation by semidefinite programming for semi-supervised classification,” In Proc. of the 25th Int. Conf. onMachine learning, pp. 576-583, 2008.
-  W. Tang, H. Xiong, S. Zhong, and J. Wu, “Enhancing Semi-Supervised Clustering: A Feature Projection Perspective,” In Proc. of the 13th Int. Conf. on Knowledge Discovery and Data Mining, pp. 707-716, 2007.
-  Y. Liu, R. Jin, and A. K. Jain, “BoostCluster: boosting clustering by pairwise constraints,” In Proc. of the 13th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, KDD’07, pp. 450-459, New York, NY, USA, 2007.
-  S. Tong and D. Koller, “Support Vector Machine Active Learning with Applications to Text Classification,” J. of Machine Learning Research, Vol.2, pp. 45-66, March 2002.
-  L. Breiman and L. Breiman, “Bagging Predictors,” In Machine Learning, pp. 123-140, 1996.
-  D. Lewis and W. Gale, “A sequential algorithm for training text classifiers,” In Proc. of the 17th Annual ACM SIGIR Conf., pp. 3-12, 1994.
-  S. Basu, A. Banjeree, E. Mooney, A. Banerjee, and R. J. Mooney, “Active Semi-Supervision for Pairwise Constrained Clustering,” In In Proc. of the 2004 SIAM Int. Conf. on Data Mining (SDM-04), pp. 333-344, 2004.