Paper:

# When Partly Missing Data Matters in Software Effort Development Prediction

## Bhekisipho Twala

Department of Electrical and Electronic Engineering Science, University of Johannesburg

P.O. Box 524, Auckland Park, Johannesburg 2006, South Africa

The major objective of the paper is to investigate a new probabilistic supervised learning approach that incorporates “missingness” into a decision tree classifier splitting criterion at each particular attribute node in terms of software effort development predictive accuracy. The proposed approach is compared empirically with ten supervised learning methods (classifiers) that have mechanisms for dealing with missing values. 10 industrial datasets are utilized for this task. Overall, missing incorporated in attributes 3 is the top performing strategy, followed by C4.5, missing incorporated in attributes, missing incorporated in attributes 2, missing incorporated in attributes, linear discriminant analysis and so on. Classification and regression trees and C4.5 performed well in data with high correlations among attributes while *k*-nearest neighbour and support vector machines performed well in data with higher complexity (limited number of instances). The worst performing method is repeated incremental pruning to produce error reduction.

*J. Adv. Comput. Intell. Intell. Inform.*, Vol.21, No.5, pp. 803-812, 2017.

- [1] B. Twala, “Dancing with dirty road traffic accidents data: The case of Gauteng province in South Africa,” J. of Transportation Safety and Security, Vol.4, No.4, pp. 323-335, 2014.
- [2] P. Winston, “Artificial Intelligence,” Addison-Wesley, 3rd ed. Part II: Learning and Regularity Recognition, 1992.
- [3] G. H. John, “Robust decision trees: Removing outliers from databases,” Proc. of the 1st Int. Conf. on Knowledge Discovery and Data Mining, pp. 174-179, 1995.
- [4] A. Kalousis and M. Hilario, “Supervised knowledge discovery from incomplete data,” Proc. of the 2nd Int. Conf. on Data Mining 2000, WIT Press, 2000.
- [5] G. Batista and M. C. Monard, “An Analysis of Four Missing Data Treatment Methods for Supervised Learning,” Applied Artificial Intelligence, Vol.17, pp. 519-533, 2003.
- [6] E. Acuna and C. Rodriguez, “The treatment of missing values and its effect in the classifier accuracy,” Classification, Clustering and Data Mining Applications, Studies in Classification, Data Analysis and Knowledge Organisation, pp. 639-647, 2004.
- [7] B. Twala, “Effective Techniques for Handling Incomplete Data Using Decision Trees,” Unpublished Ph.D. thesis, Open University, Milton Keynes, UK, 2005.
- [8] B. Twala, M. C. Jones, and D. J. Hand, “Good methods for coping with missing data in decision trees,” Pattern Recognition Letters, Vol.29, pp. 950-956, 2008.
- [9] B. Twala and M. Phorah, “Predicting Incomplete Gene Microarray Data with the Use of Supervised Learning Algorithms,” Pattern Recognition Letters, Vol.31, No.13, pp. 2061-2069, 2010.
- [10] B. Twala, “Impact of Noise on Credit Risk Prediction Does Data Quality Matter?,” Intelligent Data Analysis, Vol.17, No.6, pp. 1115-1134, 2013.
- [11] K. C. Leung and C. H. LEeung, “Dynamic discriminant functions with missing feature values,” Pattern Recognition Letters, Vol.34, No.13, pp. 1548-1556, 2013.
- [12] S. Huang and Q. Zhu, “A pseudo-nearest-neighbour approach for missing data recovery on Gaussian random sets,” Pattern Recognition Letters, Vol.23, No.13, pp. 1613-1622, 2013.
- [13] B. Twala, “Reasoning with noisy software effort data,” Applied Artificial Intelligence, Vol.28, No.6, pp. 533-554, 2014.
- [14] K. Shimada and T. Hanioka, “An Evolutionary Method for Associative Contrast Rule Mining from Incomplete Database,” J. Adv. Comput. Intell. Intell. Inform. (JACIII), Vol.19, No.6, pp. 766-777, 2016.
- [15] Y. Endo, T. Suzuki, N. Konoshita, and Y. Hamasuna, “On Fuzzy Non-Metric for Data with Tolerance and its Application to Incomplete Data Clustering,” J. Adv. Comput. Intell. Intell. Inform. (JACIII), Vol.20, No.4, pp. 571-579, 2016.
- [16] K. Lakshminarayan, S. A. Harp, and T. Samad, “Imputation of Missing Data in Industrial Databases,” Applied Intelligence, Vol.11, pp. 259-275, 1999.
- [17] B. Twala, “Combining Classifiers for Credit Risk Prediction,” Journal of Systems Science and Systems Engineering, Vol.18, No.3, pp. 292-311, 2009.
- [18] X. Zhu and W. Wu, “Class noise vs. attribute noise: A quantitative study of their impacts,” Artificial Intelligence Review, Vol.22, No.3-4, pp. 177-210, 2004.
- [19] K. Strike, K. El Emama, and N. Madhavji, “Software cost estimation with incomplete data,” IEEE Trans. on Software Engineering, Vol.27, No.1, pp. 890-908, 2001.
- [20] I. Myrtveit, E. Stensrud, and U. Olsson, “Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods,” IEEE Trans. on Software Engineering, Vol.27, No.11, pp. 1999-1013, 2001.
- [21] D. R. Cox, “Some procedures associated with the logistic qualitative response curve,” Research papers in Statistics: Festschrift for J. Neyman (ed. F.N. David), Wiley, pp. 55-71, 1966.
- [22] N. E. Day and D. F. Kerridge, “A general maximum likelihood discriminant,” Biometrics, Vol.23, pp. 313-323, 1967.
- [23] D. W. Hosmer and S. Lameshow, “Applied Logistic Regression,” Wiley, 1989.
- [24] M. Cartwright, M. Shepperd, and Q. Song, “Dealing with missing software project data,” Proc. of the 9th Int. Software Metrics Symp. (METRICS ’03), pp. 154-165, 2003.
- [25] P. Jönsson and C. Wohlin, “An evaluation of k-nearest neighbour imputation using likert data,” 10th Int. Software Metrics Symp. (METRICS ’04), pp. 108-118, 2004.
- [26] Q. Song, M. Shepperd, and M. Cartwright, “A short note on safest default missingness mechanism assumptions,” Empirical Software Engineering, Vol.10, pp. 235-243, 2005.
- [27] P. Sentas and L. Angelis, “Categorical missing data imputation for software cost estimation by multinomial logistic regression,” J. of Systems and Software, Vol.79, No.3, pp. 404-414, 2006.
- [28] B. Twala, “Ensemble missing data techniques for software effort prediction,” Intelligent Data Analysis, Vol.14, pp. 299-331, 2010.
- [29] J. Van Hulse and T. M. Khoshgotaar, “Incomplete-case nearest neighbour imputation in software measurement,” Information Science, Vol.259, pp. 596-610, 2014.
- [30] R. J. A. Little and D. B. Rubin, “Statistical Analysis with missing data,” Wiley, 1987.
- [31] J. L. Schafer, “Analysis of Incomplete Multivariate Data,” Chapman and Hall, 1997.
- [32] R. S. Michalski, I. Mozetic, J. Hong, and N. Lavrac, “The multi-purpose incremental learning system AQ15 and its testing application to three medical domains,” Proc. of the 5th National Conf. on Artificial Intelligence, pp. 1041-1045, AAAI Press, 1986.
- [33] B. D. Ripley, “Pattern Recognition and Neural Networks,” Cambridge University Press, John Wiley, 1992.
- [34] D. West, “Neural Network Credit Scoring Models,” Computers & Operations Research, Vol.27, pp. 1131-1152, 2000.
- [35] J. R. Quinlan, “C.4.5: Programs for machine learning,” Morgan Kauffman Publishers, INC, 1993.
- [36] L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification and Regression Trees,” Wadsworth, 1984.
- [37] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine Learning, Vol.24, pp. 173-202, 1991.
- [38] D. J. Hand and V. Vinciotti, “Choosing k for Two-Class Nearest Neighbour Classifiers with Unbalanced Classes,” Pattern Recognition Letters, Vol.24, pp. 1555-1562, 2003.
- [39] C. C. Holmes and N. M. Adams, “A Probabilistic Nearest Neighbour Method for Statistical Pattern Recognition,” J. of the Royal Statistical Society, Series B, Vol.64, pp. 295-306, 2002.
- [40] J. Branke, S. Meisel, and C. Schmidt, “Simulated annealing in the presence of noise,” J. of Heuristics, Vol.14, No.6, pp. 627-654, 2008.
- [41] P. McCullagh and J. A. Nelder, “Generalised Linear Models,” 2nd Edition, Chapman and Hall, 1990.
- [42] R. Duda and P. Hart, “Pattern Classification and Scene Analysis,” John Wiley, 1973.
- [43] D. J. Hand, “Construction and Assessment of Classification Rules,” Wiley, 1997.
- [44] P. Domingos and M. Pazzani, “Beyond independence: conditions for the optimality of the simple Bayesian classifier,” Proc. of the 13th Int. Conf. on Machine Learning, pp. 105-112, 1996.
- [45] I. Kononenko, “Semi-naïve Bayesian classifier,” Proc. of European Conf. on Artificial Intelligence, pp. 206-219, 1991.
- [46] P. Langley and S. Sage, “Induction of selective Bayesian classifiers,” Proc. Conf. on Uncertainty in AI, Morgan Kauffmann, 1994.
- [47] W. W. Cohen, “Fast effective rule induction,” Proc. of the 12th Int. Conf. in Machine Learning, Lake Tahoe, California, Morgan Kauffman, 1995.
- [48] V. N. Vapkin, “The Nature of Statistical Learning Theory,” Springer, 1995.
- [49] K. Pelckmans, J. De Brabanter, J. A. K. Suykens, and B. De Moor, “Handling Missing Values in Support Vector Machine Classifiers,” Neural Networks, Vol.18, pp. 684-692, 2005.
- [50] B. Twala, C. Jones, and D. J. Hand, “Good Methods for Coping with Missing Data in Decision Trees,” Pattern Recognition Letters, Vol.29, pp. 950-956, 2008.
- [51] B. Twala, “Extracting Grey Relational Systems from Incomplete Road Traffic Accidents Data: The Case of the Gauteng Province in South Africa,” J. of Expert Systems – The J. of Knowledge Engineering, Vol.31, No.3, pp. 220-231, 2014.
- [52] T. Tran, D. Phung, and S. Venkatesh, “Tree-based iterated local search for Markov random fields with application in image analysis,” J. of Heuristics, Vol.21, No.1, pp. 25-45, 2015.
- [53] C. L. Blake and C. J. Mertz, “UCI Repository of Machine Learning Databases,” University of California, Department of Information and Computer Science, Irvine, http://www.ics.uci.edu/˜mlearn/MLRepository.html [accessed Aug. 4, 2014], 1998.
- [54] T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan, “The PROMISE repository of empirical software engineering data,” http://promisedata.googlecode.com,WestVirginiaUniversity,DepartmentofComputerScience [accessed Aug. 4, 2014], 2012.
- [55] I. H. Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and Techniques,” 2nd Edition, Morgan Kauffmann, Francisco, 2005.
- [56] MATLAB, The MathWorks Inc., Natick, MA, 2000.
- [57] K. Fukunaga and D. L. Kessel, “Nonparametric Bayes Error Estimation Using Unclassified Samples,” IEEE Trans. on Information Theory, Vol.19, pp. 434-440, 1973.
- [58] MINITAB, “Statistical Software for Windows 9.0,” MINITAB, Inc., PA, USA, 2002.
- [59] R. E. Kirk, “Experimental design (2nd Ed.),” Brooks, Cole Publishing Company, 1982.