Data Cleaning for Classification Using Misclassification Analysis
Piyasak Jeatrakul, Kok Wai Wong, and Chun Che Fung
School of Information Technology, Murdoch University, South Street, Murdoch, Western Australia 6150, Australia
In most classification problems, sometimes in order to achieve better results, data cleaning is used as a preprocessing technique. The purpose of data cleaning is to remove noise, inconsistent data and errors in the training data. This should enable the use of a better and representative data set to develop a reliable classification model. In most classification models, unclean data could sometime affect the classification accuracies of a model. In this paper, we investigate the use of misclassification analysis for data cleaning. In order to demonstrate our concept, we have used Artificial Neural Network (ANN) as the core computational intelligence technique. We use four benchmark data sets obtained from the University of California Irvine (UCI) machine learning repository to investigate the results from our proposed data cleaning technique. The experimental data sets used in our experiment are binary classification problems, which are German credit data, BUPA liver disorders, Johns Hopkins Ionosphere and Pima Indians Diabetes. The results show that the proposed cleaning technique could be a good alternative to provide some confidence when constructing a classification model.
-  X. Zhu and X. Wu, “Class Noise vs. Attribute Noise: A Quantitative Study,” Artificial Intelligence Review, Vol.22, pp. 177-210, 2004.
-  C. E. Brodley and M. A. Friedl, “Identifying mislabeled training data,” J. of Artificial Intelligence Research, Vol.11, pp. 137-167, 1999.
-  A. Miranda, L. Garcia, A. Carvalho, and A. Lorena, “Use of Classification Algorithms in Noise Detection and Elimination,” in Hybrid Artificial Intelligence Systems, pp. 417-424, 2009.
-  S. Verbaeten and A. Van Assche, “Ensemble methods for noise elimination in classification problems,” in Multiple Classifier Systems, pp. 317-325, 2003.
-  X. Zhu, X. Wu, and Q. Chen, “Eliminating Class Noise in Large Datasets,” in Proceedings of the Twentieth Int. Conf. on Machine Learning (20th ICML), Washington D.C., pp. 920-927, 2003.
-  G. L. Libralon, A. C. P. d. L. F. d. Carvalho, and A. C. Lorena, “Pre-Processing for Noise Detection in Gene Expression Classification Data,” J. of Brazilian Computer Society, Vol.15, pp. 3-11, 2009.
-  I. Tomek, “Two Modifications of CNN,” Systems, Man and Cybernetics, IEEE Trans. on, Vol.6, pp. 769-772, 1976.
-  Y. Sun, M. Robinson, R. Adams, R. T. Boekhorst, A. G. Rust, and N. Davey, “Using Sampling Methods to Improve Binding Site Predictions,” in European Symposium on Artificial Neural Networks (ESANN’2006), Bruges, Belgium, 2006.
-  P. Kraipeerapun, C. C. Fung, and S. Nakkrasae, “Porosity prediction Using Bagging of Complementary Neural Networks,” in Advances in Neural Networks – ISNN 2009, pp. 175-184, 2009.
-  A. Asuncion and D. J. Newman, “UCI Machine Learning Repository,” University of California, Irvine, School of Information and Computer Sciences, 2007.
-  T. W. Liao, “Classification of Weld Flaws with Imbalanced Class Data,” Expert Systems with Applications, Vol.35, pp. 1041-1052, 2008.
-  G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” SIGKDD Explor. Newsl., Vol.6, pp. 20-29, 2004.
-  P. Kraipeerapun and C. C. Fung, “Binary Classification Using Ensemble Neural Networks and Interval Neutrosophic Sets,” Neurocomput., Vol.72, pp. 2845-2856, 2009.
-  P. Kraipeerapun and C. C. Fung, “Comparing Performance of Interval Neutrosophic Sets and Neural Networks with Support Vector Machines for Binary Classification Problems,” in Digital Ecosystems and Technologies, 2008 (DEST 2008), 2nd IEEE Int. Conf. on, pp. 34-37, 2008.