Optimizing Prediction Accuracy in High-Dimensional Data: Comparative Analysis of Feature Selection Methods with Naive Bayes Algorithm

International Journal of Electronics and Communication Engineering
© 2024 by SSRG - IJECE Journal
Volume 11 Issue 3
Year of Publication : 2024
Authors : T. Rajendran, C. Balakrishnan, B. Yamini, CH. Srilakshmi, B. Maheswari, M. Nalini, R. Siva Subramanian
pdf
How to Cite?

T. Rajendran, C. Balakrishnan, B. Yamini, CH. Srilakshmi, B. Maheswari, M. Nalini, R. Siva Subramanian, "Optimizing Prediction Accuracy in High-Dimensional Data: Comparative Analysis of Feature Selection Methods with Naive Bayes Algorithm," SSRG International Journal of Electronics and Communication Engineering, vol. 11,  no. 3, pp. 41-52, 2024. Crossref, https://doi.org/10.14445/23488549/IJECE-V11I3P105

Abstract:

This high-dimensional data is becoming increasingly common in various sectors, such as the social sciences, biology, medicine, and finance. It is defined by datasets that include many characteristics or dimensions. This study explores the idea of high-dimensional data, the difficulties it presents, and how it affects prediction results. Developing strategies to extract useful information from high-dimensional data is crucial due to its many issues, including the curse of dimensionality, increased processing complexity, and poor interpretability. This article presents research that employs feature selection techniques to solve problems related to high-dimensional data. The crucial process of feature selection is determining which features are most relevant and keeping them while eliminating those that are unnecessary or redundant. This method seeks to increase prediction accuracy, decrease overfitting, and improve model interpretability by lowering the dimensionality of the data. The present study examines three distinct feature selection methods, including Filter (SU & RELIEF), Wrapper (GENETIC & SFS), and Hybrid (combining filter & wrapper), to choose relevant features from high-dimensional data. We use a real-world high-dimensional customer dataset from UCI to illustrate how well our suggested feature selection methods work with the Naive Bayes machine learning algorithm. We demonstrate how feature selection strategies, both with and without feature selection, lead to improved prediction outcomes in high-dimensional data settings via a number of experiments and evaluations. The results show that using feature selection enhances the accuracy of the predictions. In contrast to filter and wrapper techniques, hybrid FS selects the best feature set from the three FS models. Researchers and practitioners working with high-dimensional data may make better decisions using these insights, eventually improving the prediction models’ quality and applicability.

Keywords:

Feature Selection, High dimensional data, Machine Learning, Naive Bayes, SFS.

References:

[1] Peter Bühlmann, and Sara van de Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer Science & Business Media, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Marius Muja, and David G., “Lowe Scalable Nearest Neighbor Algorithms for High Dimensional Data,” IEEE Transactions on Pattern Analysis and Machine Intelligence,” vol. 36, no. 11, pp. 2227-2240, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Andrea Bommert et al., “Benchmark for Filter Methods for Feature Selection in High-Dimensional Classification Data,” Computational Statistics & Data Analysis, vol. 143, pp. 1-19, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[4] G. Thippa Reddy et al., “Analysis of Dimensionality Reduction Techniques on Big Data,” IEEE Access, vol. 8, pp. 54776-5477, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Papia Ray, S. Surender Reddy, and Tuhina Banerjee, “Various Dimension Reduction Techniques for High Dimensional Data Analysis: A Review,” Artificial Intelligence Review, vol. 54, pp. 3473-3515, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Jamshid Pirgazi et al., “An Efficient Hybrid Filter-Wrapper Metaheuristic-Based Gene Selection Method for High Dimensional Datasets,” Scientific Reports, vol. 9, pp. 1-15, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Cai Jie et al., “Feature Selection in Machine Learning: A New Perspective,” Neurocomputing, vol. 300, pp. 70-79, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Hongtao Shi et al., “An Efficient Feature Generation Approach Based on Deep Learning and Feature Selection Techniques for Traffic Classification,” Computer Networks, vol. 132, pp. 81-98, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Yu Xue, Haokai Zhu, and Ferrante Neri, “A Feature Selection Approach Based on NSGA-II with ReliefF,” Applied Soft Computing, vol. 134, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Baoshuang Zhang, Yanying Li, and Zheng Chai, “A Novel Random Multi-Subspace Based ReliefF for Feature Selection,” Knowledge-Based Systems, vol. 252, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Annu Lambora, Kunal Gupta, and Kriti Chopra, “Genetic Algorithm-A Literature Review,” 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, pp. 380-384, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Negar Maleki, Yasser Zeinali, and Seyed Taghi Akhavan Niaki, “A k-NN Method for Lung Cancer Prognosis with the Use of a Genetic Algorithm for Feature Selection,” Expert Systems with Applications, vol. 164, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Knitchepon Chotchantarakun, “Optimizing Sequential Forward Selection on Classification Using Genetic Algorithm,” Informatica, vol. 47, no. 9, pp. 81-90, 2023.
[Google Scholar] [Publisher Link]
[14] A. Pasyuk, E. Semenov, and D. Tyuhtyaev, “Feature Selection in the Classification of Network Traffic Flows,” 2019 International MultiConference on Industrial Engineering and Modern Technologies (FarEastCon), Vladivostok, Russia, pp. 1-5, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Feng-Jen Yang, “An Implementation of Naive Bayes Classifier,” 2018 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, USA, pp. 301-306, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Mucahid Mustafa Saritas, and Ali Yasar, “Performance Analysis of ANN and Naive Bayes Classification Algorithm for Data Classification,” International Journal of Intelligent Systems and Applications in Engineering, vol. 7, no. 2, pp. 88-91, 2019.[Google Scholar] [Publisher Link]