Dependence-Maximization Clustering with Least-Squares Mutual Information
Manabu Kimura and Masashi Sugiyama
Department of Computer Science, Tokyo Institute of Technology, 2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, Japan
Recently, statistical dependence measures such as mutual information and kernelized covariance have been successfully applied to clustering. In this paper, we follow this line of research and propose a novel dependence-maximization clustering method based on least-squares mutual information, which is an estimator of a squared-loss variant of mutual information. A notable advantage of the proposed method over existing approaches is that hyperparameters such as kernel parameters and regularization parameters can be objectively optimized based on cross-validation. Thus, subjective manual-tuning of hyperparameters is not necessary in the proposed method, which is a highly useful property in unsupervised clustering scenarios. Through experiments, we illustrate the usefulness of the proposed approach.
-  J. B. MacQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” In Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol.1, pp. 281-297, 1967.
-  A. Y. Ng, M. I. Jordan, and Y. Weiss, “On Spectral Clustering: Analysis and An Algorithm,” In Advances in Neural Information Processing Systems, Vol.14 (NIPS 2001), pp. 849-856, 2002.
-  J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.22, No.8, pp. 888-905, 2000.
-  M. Girolami, “Mercer Kernel-Based Clustering in Feature Space,” IEEE Trans. on Neural Networks, Vol.13, No.3, pp. 780-784, 2002.
-  I. S. Dhillon, Y. Guan, and B. Kulis, “Kernel K-means, Spectral Clustering and Normalized Cuts,” In Proc. of the 10th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2004), pp. 551-556, 2004.
-  F. Bach and Z. Harchaoui, “DIFFRAC: a discriminative and flexible framework for clustering,” In Advances in Neural Information Processing Systems, Vol.20 (NIPS 2007), pp. 49-56, 2008.
-  R. Gomes, A. Krause, and P. Perona, “Discriminative Clustering by Regularized Information Maximization,” In Advances in Neural Information Processing Systems, Vol.23 (NIPS 2010), pp. 766-774, 2010.
-  L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, “Maximum Margin Clustering,” In Advances in Neural Information Processing Systems, Vol.17 (NIPS 2004), pp. 1537-1544, 2005.
-  L. Faivishevsky and J. Goldberger, “A Nonparametric Information Theoretic Clustering Algorithm,” In Proc. of 27th Int. Conf. on Machine Learning (ICML 2010), pp. 351-358, 2010.
-  L. Song, A. Smola, A. Gretton, and K. Borgwardt, “A Dependence Maximization View of Clustering,” In Proc. of the 24th Annual Int. Conf. on Machine Learning (ICML 2007), pp. 815-822, 2007.
-  T. Suzuki, M. Sugiyama, T. Kanamori, and J. Sese, “Mutual Information Estimation Reveals Global Associations between Stimuli and Biological Processes,” BMC Bioinformatics, Vol.10, No.1, p. S52, 2009.
-  K. Pearson, “On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling,” Philosophical Magazine, Vol.50, pp. 157-175, 1900.
-  T. M. Cover and J. A. Thomas, “Elements of Information Theory,” John Wiley & Sons, Inc., Hoboken, NJ, USA, 2nd Edition, 2006.
-  S. Kullback and R. A. Leibler, “On Information and Sufficiency,” Annals of Mathematical Statistics, Vol.22, pp. 79-86, 1951.
-  S. M. Ali and S. D. Silvey, “A General Class of Coefficients of Divergence of One Distribution from Another,” J. of the Royal Statistical Society, Series B, Vol.28, No.1, pp. 131-142, 1966.
-  I. Csiszár, “Information-type Measures of Difference of Probability Distributions and Indirect Observation,” Studia Scientiarum Mathematicarum Hungarica, Vol.2, pp. 229-318, 1967.
-  A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf, “Measuring Statistical Dependence with Hilbert-Schmidt Norms,” In Proc. of the 16th Int. Conf. on Algorithmic Learning Theory (ALT 2005), Lecture Notes in Artificial Intelligence, pp. 63-77, 2005.
-  M. Collins and N. Duffy, “Convolution Kernels for Natural Language,” In Advances in Neural Information Processing Systems, Vol.14 (NIPS 2001), pp. 625-632, 2002.
-  T. Gärtner, “A Survey of Kernels for Structured Data,” SIGKDD Explorations, Vol.5, No.1, pp. S268-S275, 2003.
-  T. Gärtner, P. Flach, and S. Wrobel, “On Graph Kernels: Hardness Results and Efficient Alternatives,” In Proc. of the 16th Annual Conf. on Computational Learning Theory (COLT 2003), pp. 129-143, 2003.
-  H. Kashima and T. Koyanagi, “Kernels for Semi-Structured Data,” In Proc. of the 19th Int. Conf. on Machine Learning (ICML 2002), pp. 291-298, 2002.
-  H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels between Labeled Graphs,” In Proc. of the 20th Int. Conf. on Machine Learning (ICML 2003), pp. 321-328, 2003.
-  R. I. Kondor and J. Lafferty, “Diffusion Kernels on Graphs and Other Discrete Input Spaces,” In Proc. of the 19th Int. Conf. on Machine Learning (ICML 2002), pp. 315-322, 2002.
-  H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins, “Text Classification Using String Kernels,” J. of Machine Learning Research, Vol.2, pp. 419-444, 2002.
-  L. F. Kozachenko and N. N. Leonenko, “Sample Estimate of Entropy of a Random Vector,” Problems of Information Transmission, Vol.23, No.9, pp. 95-101, 1987.
-  T. Hofmann, “Probabilistic Latent Semantic Indexing,” In Proc. of the 22nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 1999), pp. 50-57, 1999.
-  J. Kazama and K. Torisawa, “Speeding up Training with Tree Kernels for Node Relation Labeling,” In Proc. of Human Language Technology Conf. and Conf. on Empirical Methods in Natural Language Processing (HLT/EMNLP 2005), pp. 137-144, 2005.