ANALISIS KOMPARATIF STRATEGI IMPUTASI NILAI HILANG PADA DATASET HEPATITIS UCI MENGGUNAKAN XGBOOST

Muhammad Mirza Kurniawan; Betha Nurina Sari

doi:10.23960/jitet.v14i2.9163

Muhammad Mirza Kurniawan
Universitas Singaperbangsa Karawang
Betha Nurina Sari
Universitas Singaperbangsa Karawang

DOI: https://doi.org/10.23960/jitet.v14i2.9163

Keywords Imputasi Data Hilang, XGBoost, MICE, KNN Imputation

Abstract Views (Last 12 Months)

124 Abstract Views

84 Downloads

Abstract

Penyakit hepatitis masih menjadi tantangan kesehatan global yang signifikan, dengan beban kasus terbesar ditemukan di wilayah berkembang. Meskipun Rekam Kesehatan Elektronik (EHR) sangat bernilai bagi penelitian klinis dan pemodelan prediktif, data tersebut sering kali tidak lengkap. Laporan menunjukkan bahwa hingga 71% entri data dapat memiliki nilai hilang (missing values), yang menghadirkan tantangan substansial terhadap keandalan analisis data dan pembangunan model. Penelitian ini mengevaluasi efektivitas berbagai strategi imputasi data hilang pada dataset Hepatitis UCI, sebuah benchmark yang dikenal memiliki tingkat ketidaklengkapan tinggi. Kami membandingkan metode Listwise deletion, Mean Imputation, K-Nearest Neighbors (KNN), serta Multivariate Imputation by Chained Equations (MICE) beserta variannya. Evaluasi dilakukan menggunakan algoritma klasifikasi XGBoost dengan Stratified 5-Fold Cross-Validation. Hasil penelitian menunjukkan bahwa Listwise deletion tidak hanya mencapai kinerja rata-rata tertinggi dengan F1-Score sebesar 81,76%, tetapi juga menunjukkan stabilitas paling konsisten dengan standar deviasi terendah (6,22%) dibandingkan teknik imputasi kompleks lainnya yang menunjukkan variabilitas tinggi.

Downloads

Download data is not yet available.

References

Al-Amain, F. T. Janin, F. Ahmed Robin, S. Ahmed, and K. M. Mohi Uddin, “Unleashing Machine Learning for Hepatitis C Prediction: A Holistic Exploration of Clinical Insights,” in 2024 IEEE International Conference on Computing, Applications and Systems (COMPAS), 2024, pp. 1–6. doi: 10.1109/COMPAS60761.2024.10796087.

Y. Fan, X. Lu, and G. Sun, “IHCP: interpretable hepatitis C prediction system based on black-box machine learning models,” BMC Bioinformatics, vol. 24, no. 1, pp. 1–16, 2023, doi: 10.1186/s12859-023-05456-0.

A. Alizargar, Y. L. Chang, and T. H. Tan, “Performance Comparison of Machine Learning Approaches on Hepatitis C Prediction Employing Data Mining Techniques,” Bioengineering, vol. 10, no. 4, 2023, doi: 10.3390/bioengineering10040481.

M. O. Edeh et al., “Artificial Intelligence-Based Ensemble Learning Model for Prediction of Hepatitis C Disease,” Front. Public Heal., vol. 10, no. April, 2022, doi: 10.3389/fpubh.2022.892371.

A. M. Elsayad, A. M. Nassef, and M. Al-Dhaifallah, “Diagnosis of Hepatitis Disease with Logistic Regression and Artificial Neural Networks,” J. Comput. Sci., vol. 16, no. 3, pp. 364–377, Mar. 2020, doi: 10.3844/jcssp.2020.364.377.

D. X. Yang et al., “Prevalence of Missing Data in the National Cancer Database and Association with Overall Survival,” JAMA Netw. Open, vol. 4, no. 3, 2021, doi: 10.1001/jamanetworkopen.2021.1793.

N. Cesare and L. P. O. Were, “A multi-step approach to managing missing data in time and patient variant electronic health records,” BMC Res. Notes, vol. 15, no. 1, pp. 1–7, 2022, doi: 10.1186/s13104-022-05911-w.

B. Bouvarel, F. Carrat, and N. Lapidus, “Updating mortality risk estimation in intensive care units from high-dimensional electronic health records with incomplete data,” BMC Med. Inform. Decis. Mak., vol. 23, no. 1, pp. 1–9, 2023, doi: 10.1186/s12911-023-02264-7.

J. Mi, R. D. Tendulkar, S. M. C. Sittenfeld, S. Patil, and E. C. Zabor, “Combining Missing Data Imputation and Internal Validation in Clinical Risk Prediction Models,” Stat. Med., vol. 44, no. 18–19, pp. 1–15, 2025, doi: 10.1002/sim.70203.

T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, A survey on missing data in machine learning, vol. 8, no. 1. Springer International Publishing, 2021. doi: 10.1186/s40537-021-00516-9.

M. Afkanpour, E. Hosseinzadeh, and H. Tabesh, “Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review,” BMC Med. Res. Methodol., vol. 24, no. 1, 2024, doi: 10.1186/s12874-024-02310-6.

I. El Badisy, N. Graffeo, M. Khalis, and R. Giorgi, “Multi-metric comparison of machine learning imputation methods with application to breast cancer survival,” BMC Med. Res. Methodol., vol. 24, no. 1, 2024, doi: 10.1186/s12874-024-02305-3.

Z. Chen, S. Tan, U. Chajewska, C. Rudin, and R. Caruana, “Missing Values and Imputation in Healthcare Data: Can Interpretable Machine Learning Help?,” Proc. Mach. Learn. Res., vol. 209, pp. 86–99, 2023.

S. Batra, R. Khurana, M. Z. Khan, W. Boulila, A. Koubaa, and P. Srivastava, “A Pragmatic Ensemble Strategy for Missing Values Imputation in Health Records,” Entropy, vol. 24, no. 4, pp. 1–20, 2022, doi: 10.3390/e24040533.

J. H. Li et al., “Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets,” BMC Med. Res. Methodol., vol. 24, no. 1, pp. 1–9, 2024, doi: 10.1186/s12874-024-02173-x.

L. O. Joel, W. Doorsamy, and B. S. Paul, “A comparative study of imputation techniques for missing values in healthcare diagnostic datasets,” Int. J. Data Sci. Anal., vol. 20, no. 7, pp. 6357–6373, 2025, doi: 10.1007/s41060-025-00825-9.

UCI Machine Learning Repository, “Hepatitis.” 1983. [Online]. Available: https://doi.org/10.24432/C5Q59J

L. Jin et al., “A comparative study of evaluating missing value imputation methods in label-free proteomics.,” Sci. Rep., vol. 11, no. 1, p. 1760, Jan. 2021, doi: 10.1038/s41598-021-81279-4.

W.-C. Lin, C.-F. Tsai, and J. R. Zhong, “Deep learning for missing value imputation of continuous data and the effect of data discretization,” Knowledge-Based Syst., vol. 239, p. 108079, 2022, doi: https://doi.org/10.1016/j.knosys.2021.108079.

S. Wu, W. Yau, T. Ong, and S.-C. Chong, “Integrated Churn Prediction and Customer Segmentation Framework for Telco Business,” IEEE Access, vol. 9, pp. 62118–62136, 2021, [Online]. Available: https://api.semanticscholar.org/CorpusId:233434157

H. Rosado-Galindo and S. Dávila-Padilla, “Tree-Based Missing Value Imputation Using Feature Selection,” J. Data Sci., vol. 18, no. 4, pp. 606–631, 2020, doi: 10.6339/JDS.202010_18(4).0002.

K. M. Fouad, M. M. Ismail, A. T. Azar, and M. M. Arafa, “Advanced methods for missing values imputation based on similarity learning.,” PeerJ. Comput. Sci., vol. 7, p. e619, 2021, doi: 10.7717/peerj-cs.619.

M. S. Santos, P. H. Abreu, S. Wilk, and J. Santos, “How distance metrics influence missing data imputation with k-nearest neighbours,” Pattern Recognit. Lett., vol. 136, pp. 111–119, 2020, doi: 10.1016/j.patrec.2020.05.032.

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, in KDD ’16. ACM, Aug. 2016, pp. 785–794. doi: 10.1145/2939672.2939785.

D. Rohmayani, C. A. Sugianto, R. S. Perdana, and M. Mansoor, “Improving Extreme Gradient Boosting Model for Heart Disease Prediction Using SMOTE for Class Imbalance,” vol. 6, no. 4, pp. 1717–1728, 2025.

F. H. Syahadah, R. T. Subagio, and P. Rizqiyah, “Penerapan XGBoost dalam Prediksi Pendaftaran Siswa Baru Bimbingan Belajar QSC di Kota Cirebon,” Jurnal Informatika dan Teknik Elektro Terapan, vol. 13, no. 3S1, pp. 1082–1089, 2025.

ANALISIS KOMPARATIF STRATEGI IMPUTASI NILAI HILANG PADA DATASET HEPATITIS UCI MENGGUNAKAN XGBOOST

Abstract

Downloads

References

Most read articles by the same author(s)