Paper:

# A Probabilistic WKL Rule for Incremental Feature Learning and Pattern Recognition

## Jasmin Léveillé^{*}, Isao Hayashi^{**}, and Kunihiko Fukushima^{**,***}

^{*}Center of Excellence for Learning in Education, Science and Technology, Boston University, 677 Beacon Street, Boston, Massachusetts 02215, USA

^{**}Faculty of Informatics, Kansai University, 2-1-1 Ryozenji-cho, Takatsuki, Osaka 569-1095, Japan

^{***}Fuzzy Logic Systems Institute, 680-41 Kawazu, Iizuka, Fukuoka 820-0067, Japan

Recent advances in machine learning and computer vision have led to the development of several sophisticated learning schemes for object recognition by convolutional networks. One relatively simple learning rule, the Winner-Kill-Loser (WKL), was shown to be efficient at learning higher-order features in the neocognitron model when used in a written digit classification task. The WKL rule is one variant of incremental clustering procedures that adapt the number of cluster components to the input data. The WKL rule seeks to provide a complete, yet minimally redundant, covering of the input distribution. It is difficult to apply this approach directly to high-dimensional spaces since it leads to a dramatic explosion in the number of clustering components. In this work, a small generalization of the WKL rule is proposed to learn from high-dimensional data. We first show that the learning rule leads mostly to V1-like oriented cells when applied to natural images, suggesting that it captures second-order image statistics not unlike variants of Hebbian learning. We further embed the proposed learning rule into a convolutional network, specifically, the Neocognitron, and show its usefulness on a standard written digit recognition benchmark. Although the new learning rule leads to a small reduction in overall accuracy, this small reduction is accompanied by a major reduction in the number of coding nodes in the network. This in turn confirms that by learning statistical regularities rather than covering an entire input space, it may be possible to incrementally learn and retain most of the useful structure in the input distribution.

- [1] K. Jarrett, K. Kavukcuoglu, M.-A. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” Proc. ICCV, pp. 2146-2153, 2009.
- [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. of the IEEE, pp. 2278-2324, 1998.
- [3] J. Mutch and D. G. Lowe, “Multiclass object recognition with sparse, localized features,” Proc. CVPR, pp. 11-18, 2006.
- [4] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust Object Recognition with Cortex-like Mechanisms,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 29, pp. 411-426, 2007.
- [5] N. Pinto, D. Doukhan, J. J. DiCarlo, and D. D. Cox, “A highthroughput screening approach to discovering good forms of biologically inspired visual representations,” PLOS Computational Biology, 5, e1000579, 2009.
- [6] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” Proc. ICML, pp. 609-616, 2009.
- [7] H. Lee, E. Chaitanya and A. Y. Ng, “Sparse deep belief network for visual area V2,” NIPS, 2007.
- [8] Q. V. Le, J. Ngiam, Z. Chen, D. Chia, P. Koh, and A. Y. Ng, “Tiled convolutional neural networks,” NIPS, 2010.
- [9] M.-A. Ranzato, F.-J. Huang, Y.-L. Boureau, and Y. LeCun, “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” Proc. CVPR, 2007.
- [10] P. D. Grünwald, “The Minimum Description Length Principle,” MIT Press, 2007.
- [11] N. Zhang and J. Weng, “Sparse representation from a winner-takeall neural network,” Proc. IJCNN 2004, pp. 2209-2214, 2004.
- [12] E. Oja, “Simplified neuron model as a principal component analyzer,” J. of Mathematical Biology, Vol.15, pp. 267-273, 1982.
- [13] J. L. Jr Wyatt and I. M. Elfadel, “Time-domain solutions of Oja’s equations,” Neural Computation, Vol.7, pp. 915-922, 1995.
- [14] P. Földiák, “Learning invariance from transformation sequences,” Neural Computation, Vol.3, pp. 194-200, 1991.
- [15] E. T. Rolls and T.Milward, “Model of Invariant Object Recognition in the Visual System: Learning Rules, Activation Functions, Lateral Inhibition, and Information-Based Performance Measures,” Neural Computation, Vol.12, pp. 2547-2572, 2000.
- [16] S. Becker, “Unsupervised learning procedures for neural networks,” The Int. J. of Neural Systems, 1-2, 17-33, 1991.
- [17] H. Sprekeler, C. Michaelis, and L. Wiskott, “Slowness: An Objective for Spike-Timing-Dependent Plasticity?” PLoS Comput Biol, 3, 2007.
- [18] J. Shawe-Taylor, “Symmetries and discriminability in feedforward network architectures,” IEEE Trans. on Neural Networks, Vol.4, pp. 816-826, 1993.
- [19] J. Léveillé and T. Hannagan, “Learning spatial invariance with the trace rule in non-uniform distributions,” Neural Computation, Vol.5, pp. 1261-1276, 2013.
- [20] K. Fukushima, “Neocognitron trained with winner-kill-loser rule,” Neural Networks, Vol.23, pp. 926-938, 2010.
- [21] G. Hinton, “Training product of experts by minimizing contrastive divergence,” Neural Computation, Vol.14, pp. 1771-1800, 2002.
- [22] A. Hyvärinen, J. Hurri, and P. O. Hoyer, “Natural Image Statistics – A probabilistic approach to early computational vision,” Springer-Verlag, 2009.
- [23] G. W. Cottrell, P. Munro, and D. Zipser, “Learning internal representations from gray-scale images: An example of extensional programming,” Proc. Cognitive Science Society, 1987.
- [24] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. of Machine Learning Research, Vol.11, pp. 3371-3408, 2010.
- [25] C. Farabet, Y. LeCun, K. Kavukcuoglu, E. Culurciello, B. Martini, P. Akselrod, and S. Talay, “Large-scale FPGA-based convolutional networks,” R. Bekkerman, M. Bilenko, and J. Langford, (Eds.), Scaling up Machine Learning: Parallel and Distributed Approaches, Cambridge University Press, 2011.
- [26] L. N. Cooper, N. Intrator, B. S. Blais, and H. Z. Shouval, “Theory of cortical plasticity,” Singapore, World Press Scientific, 2004.
- [27] J. Léveillé, I. Hayashi, and K. Fukushima, “Online learning of feature detectors from natural images with the probabilistic WKL rule,” 2012 Joint 6th Int. Conf. on Soft Computing and Intelligent Systems (SCIS) and 13th Int. Symp. on Advanced Intelligent Systems (ISIS), 177-182, 2012.
- [28] K. Fukushima and S. Miyake, “Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position,” Pattern Recognition, Vol.15, pp. 455-469, 1982.
- [29] M. Maruyama, G. Federico, and T. Poggio, “A connection between GRBF and MLP,” MIT AI Lab Memo AIM-1291, 1992.
- [30] M. Kouh and T. Poggio, “A general mechanism for tuning: Gain control circuits and synapses underlie tuning of cortical neurons,” MIT AI Lab Memo 2004-031, 2004.
- [31] S. Grossberg, “Contour enhancement, short-term memory, and constancies in reverberating neural networks,” Studies in Applied Mathematics, 52, 1973.
- [32] T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biological Cybernetics, 43, pp. 59-69, 1982.
- [33] J. A. Hartigan, “Clustering algorithms,” New York, John Wiley & Sons Inc, 1975.
- [34] S. Grossberg, “Competitive learning: From interactive activation to adaptive resonance,” Cognitive Science, 11, pp. 23-63, 1987.
- [35] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, Vol.381, pp. 607-609, 1996.
- [36] K. Fukushima, I. Hayashi, and J. Léveillé, “Neocognitron trained by winner-kill-loser with triple threshold,” ICONIP, 2011.
- [37] J. H. Conway and N. J. A. Sloane, “Sphere packing, lattices and groups,” New York, Springer-Verlag, 1988.
- [38] J. A. Lee and M. Verleysen, “Nonlinear dimensionality reduction,” Springer, 2007.
- [39] G. Hinton, “To recognize shapes, first learn to generate images,” Progress in Brain Research, Vol.165, pp. 535-547, 2007.
- [40] Y. W. Teh, “Dirichlet processes,” Encyclopedia of Machine Learning, Springer, 2010.
- [41] R. M. Neal, “Markov chain sampling methods for Dirichlet process mixture models,” J. of Computational and Graphical Statistics, Vol.9, pp. 249-265, 2000.
- [42] A. J. Bell and T. J. Sejnowski, “The independent components of natural scenes are edge filters,” Vision Research, Vol.23, pp. 3327-3338, 1997.
- [43] R. Mikkulainen, J. A. Bednar, Y. Choe, and J. Sirosh, “Computational maps in the visual cortex,” Springer, 2005.
- [44] B. Betsch, W. Einhäuser, K. Körding, and P. König, “The world from a cat’s perspective – statistics of natural videos,” Biological Cybernetics, Vol.90, pp. 41-50, 2004.
- [45] T. Masquelier, T. Serre, S. J. Thorpe, and T. Poggio, “Learning complex cell invariance from natural video: a plausibility proof,” CBCL Paper. Massachusetts Institute of Technology, Cambridge, MA, 2007.
- [46] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. of the IEEE, Vol.86, Issue 11, pp. 2278-2324, Nov. 1998.
- [47] D. H. Hubel and T. N. Wiesel, “Receptive Fields Of Single Neurones In The Cat’s Striate Cortex,” J. of Physiology, Vol.148, pp. 574-591, 1959.
- [48] H. Akaike, “A new look at the statistical model identification,” IEEE Transactions on Automatic Control, Vol.19, Issue 6, pp. 716-723, 1974.
- [49] M. S. Livingstone and D. H. Hubel, “Anatomy and physiology of a color system in the primate visual cortex,” J. of Neuroscience, Vol.4, pp. 309-356, 1984.
- [50] Y. Karklin and M. S. Lewicki, “Is early vision optimized for extracting higher-order dependencies?” NIPS, 2005.
- [51] G. Griffin, A. Holub, and P. Perona, “The caltech-256 object category dataset,” Technical Report, Caltech , 2007.
- [52] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes (VOC) Challenge,” Int. J. of Computer Vision, Vol.88, pp. 303-338, 2010.
- [53] M. Sugiyama and M. Kawanabe, “Machine learning in nonstationary environments: Introduction to covariate shift adaptation,” MIT Press, 2012.