Predicting Obesity Levels with High Accuracy: Insights from a CatBoost Machine Learning Model

Authors

  • Aga Maulana Department of Informatics, Faculty of Mathematics and Natural Sciences, Universitas Syiah Kuala, Banda Aceh 23111, Indonesia
  • Razief Perucha Fauzie Afidh Department of Informatics, Faculty of Mathematics and Natural Sciences, Universitas Syiah Kuala, Banda Aceh 23111, Indonesia
  • Nur Balqis Maulydia Graduate School of Mathematics and Applied Sciences, Universitas Syiah Kuala, Banda Aceh 23111, Indonesia
  • Ghazi Mauer Idroes Graduate School of Mathematics and Applied Sciences, Universitas Syiah Kuala, Banda Aceh 23111, Indonesia; Department of Occupational Health and Safety, Faculty of Health Sciences, Universitas Abulyatama, Aceh Besar 23372, Indonesia
  • Souvia Rahimah Department of Food Industrial Technology, Faculty of Agroindustrial Technology, Universitas Padjadjaran, Bandung, West Java, Indonesia

DOI:

https://doi.org/10.60084/ijds.v2i1.195

Keywords:

Gradient boosting, Obesity classification, Risk factors, Comparative analysis, Precision public health

Abstract

This study aims to develop a machine learning model using the CatBoost algorithm to predict obesity based on demographic, lifestyle, and health-related features and compare its performance with other machine learning algorithms. The dataset used in this study, containing information on 2,111 individuals from Mexico, Peru, and Colombia, was used to train and evaluate the CatBoost model. The dataset included gender, age, height, weight, eating habits, physical activity levels, and family history of obesity. The model's performance was assessed using accuracy, precision, recall, and F1-score and compared to logistic regression, K-nearest neighbors (KNN), random forest, and naive Bayes algorithms. Feature importance analysis was conducted to identify the most influential factors in predicting obesity levels. The results indicate that the CatBoost model achieved the highest accuracy at 95.98%, surpassing other models. Furthermore, the CatBoost model demonstrated superior precision (96.08%), recall (95.98%), and F1-score (96.00%). The confusion matrix revealed that the model accurately predicted the majority of instances in each obesity level category. Feature importance analysis identified weight, height, and gender as the most influential factors in predicting obesity levels, followed by dietary habits, physical activity, and family history of overweight. The model's high accuracy, precision, recall, and F1-score and ability to handle categorical variables effectively make it a valuable tool for obesity risk assessment and classification. The insights gained from the feature importance analysis can guide the development of targeted obesity prevention and management strategies, focusing on modifiable risk factors such as diet and physical activity. While further validation on diverse populations is necessary, the CatBoost model's results demonstrate its potential to support clinical decision-making and inform public health initiatives in the fight against the global obesity epidemic.

Downloads

Download data is not yet available.

References

  1. World Health Organization. (2021). Obesity and Overweight.
  2. Adebibe, M., and Coppack, S. W. (2022). Obesity-Associated Comorbidities: Health Consequences, Obesity, Bariatric and Metabolic Surgery, Springer International Publishing, Cham, 1–16. doi:10.1007/978-3-030-54064-7_4-1.
  3. Rana, S., Sultana, A., and Bhatti, A. A. (2021). Effect of Interaction between Obesity-Promoting Genetic Variants and Behavioral Factors on the Risk of Obese Phenotypes, Molecular Genetics and Genomics, Vol. 296, No. 4, 919–938. doi:10.1007/s00438-021-01793-y.
  4. Health, I. M. of. (2018). Basic Health Research (Riskesdas), Jakarta.
  5. Eberwein, J. D., Oddo, V., Akuoku, J. K., Okamura, K. S., Popkin, B., and Shekar, M. (2020). Prevalence and Trends, Obesity: Health and Economic Consequences of an Impending Global Challenge. World Bank Publications.
  6. Amalia, B., Cadogan, S. L., Prabandari, Y. S., and Filippidis, F. T. (2019). Socio-Demographic Inequalities in Cigarette Smoking in Indonesia, 2007 to 2014, Preventive Medicine, Vol. 123, 27–33. doi:10.1016/j.ypmed.2019.02.025.
  7. Romieu, I., Dossus, L., Barquera, S., Blottière, H. M., Franks, P. W., Gunter, M., Hwalla, N., Hursting, S. D., Leitzmann, M., Margetts, B., Nishida, C., Potischman, N., Seidell, J., Stepien, M., Wang, Y., Westerterp, K., Winichagoon, P., Wiseman, M., and Willett, W. C. (2017). Energy Balance and Obesity: What Are the Main Drivers?, Cancer Causes & Control, Vol. 28, No. 3, 247–258. doi:10.1007/s10552-017-0869-z.
  8. Beltrán-Carrillo, V. J., Megías, Á., González-Cutre, D., and Jiménez-Loaisa, A. (2022). Elements behind Sedentary Lifestyles and Unhealthy Eating Habits in Individuals with Severe Obesity, International Journal of Qualitative Studies on Health and Well-Being, Vol. 17, No. 1, 2056967.
  9. Pearson, N., and Biddle, S. J. H. (2011). Sedentary Behavior and Dietary Intake in Children, Adolescents, and Adults, American Journal of Preventive Medicine, Vol. 41, No. 2, 178–188. doi:10.1016/j.amepre.2011.05.002.
  10. Warburton, D. E. R. (2006). Health Benefits of Physical Activity: The Evidence, Canadian Medical Association Journal, Vol. 174, No. 6, 801–809. doi:10.1503/cmaj.051351.
  11. Sulistiadi, W., Kusuma, D., Amir, V., Tjandrarini, D. H., and Nurjana, M. A. (2023). Growing Up Unequal: Disparities of Childhood Overweight and Obesity in Indonesia’s 514 Districts, Healthcare, Vol. 11, No. 9, 1322. doi:10.3390/healthcare11091322.
  12. Colmenarejo, G. (2020). Machine Learning Models to Predict Childhood and Adolescent Obesity: A Review, Nutrients, Vol. 12, No. 8, 2466. doi:10.3390/nu12082466.
  13. Yagin, F. H., Gülü, M., Gormez, Y., Castañeda-Babarro, A., Colak, C., Greco, G., Fischetti, F., and Cataldi, S. (2023). Estimation of Obesity Levels with a Trained Neural Network Approach optimized by the Bayesian Technique, Applied Sciences, Vol. 13, No. 6, 3875. doi:10.3390/app13063875.
  14. Oyebode, O., Fowles, J., Steeves, D., and Orji, R. (2023). Machine Learning Techniques in Adaptive and Personalized Systems for Health and Wellness, International Journal of Human–Computer Interaction, Vol. 39, No. 9, 1938–1962. doi:10.1080/10447318.2022.2089085.
  15. Dugan, T. M., Mukhopadhyay, S., Carroll, A., and Downs, S. (2015). Machine Learning Techniques for Prediction of Early Childhood Obesity, Applied Clinical Informatics, Vol. 06, No. 03, 506–520. doi:10.4338/ACI-2015-03-RA-0036.
  16. Kıvrak, M. (2021). Deep Learning-Based Prediction of Obesity Levels according to Eating Habits and Physical Condition, The Journal of Cognitive Systems, Vol. 6, No. 1, 24–27.
  17. Pavey, T. G., Gilson, N. D., Gomersall, S. R., Clark, B., and Trost, S. G. (2017). Field Evaluation of a Random Forest Activity Classifier for Wrist-Worn Accelerometer Data, Journal of Science and Medicine in Sport, Vol. 20, No. 1, 75–80. doi:10.1016/j.jsams.2016.06.003.
  18. Musa, F., Basaky, F., and E.O, O. (2022). Obesity Prediction Using Machine Learning Techniques, Journal of Applied Artificial Intelligence, Vol. 3, No. 1, 24–33. doi:10.48185/jaai.v3i1.470.
  19. Pouladzadeh, P., Kuhad, P., Peddi, S. V. B., Yassine, A., and Shirmohammadi, S. (2016). Food Calorie Measurement Using Deep Learning Neural Network, 2016 IEEE International Instrumentation and Measurement Technology Conference Proceedings, IEEE, 1–6. doi:10.1109/I2MTC.2016.7520547.
  20. Tandiono, S. M., and Sanjaya, S. A. (2023). Machine Learning Approach of Obesity Level Classification: A Systematic Literature Review of Methods and Factors, G-Tech: Jurnal Teknologi Terapan, Vol. 8, No. 1, 196–208. doi:10.33379/gtech.v8i1.3604.
  21. Yandex. (2021). CatBoost Documentation.
  22. Palechor, F. M., and Manotas, A. de la H. (2019). Dataset for Estimation of Obesity Levels Based on Eating Habits and Physical Condition in Individuals from Colombia, Peru and Mexico, Data in Brief, Vol. 25, 104344. doi:10.1016/j.dib.2019.104344.
  23. Fabio Mendoza Palechor, A. D. la H. M. (2021). Estimation of Obesity Levels UCI Dataset, Kaggle. doi:10.34740/KAGGLE/DSV/2918196.
  24. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). CatBoost: Unbiased Boosting with Categorical Features, Advances in Neural Information Processing Systems, Vol. 31.
  25. Dorogush, A. V., Ershov, V., and Gulin, A. (2018). CatBoost: gradient boosting with categorical features support, ArXiv Preprint ArXiv:1810.11363.
  26. Hancock, J. T., and Khoshgoftaar, T. M. (2020). Survey on Categorical Data for Neural Networks, Journal of Big Data, Vol. 7, No. 1, 28. doi:10.1186/s40537-020-00305-w.
  27. Anghel, A., Papandreou, N., Parnell, T., De Palma, A., and Pozidis, H. (2018). Benchmarking and Optimization of Gradient Boosting Decision Tree Algorithms.
  28. Simeon, S., and Jongkon, N. (2019). Construction of Quantitative Structure Activity Relationship (QSAR) Models to Predict Potency of Structurally Diversed Janus Kinase 2 Inhibitors, Molecules, Vol. 24, No. 23, 4393. doi:10.3390/molecules24234393.
  29. Noviandy, T. R., Idroes, G. M., Maulana, A., Hardi, I., Ringga, E. S., and Idroes, R. (2023). Credit Card Fraud Detection for Contemporary Financial Management Using XGBoost-Driven Machine Learning and Data Augmentation Techniques, Indatu Journal of Management and Accounting, Vol. 1, No. 1, 29–35. doi:10.60084/ijma.v1i1.78.
  30. Maulana, A., Noviandy, T. R., Suhendra, R., Earlia, N., Sofyan, H., Subianto, M., and Idroes, R. (2023). Performance Analysis and Feature Extraction for Classifying the Severity of Atopic Dermatitis Diseases, 2023 2nd International Conference on Computer System, Information Technology, and Electrical Engineering (COSITE), 226–231. doi:10.1109/COSITE60233.2023.10249760.
  31. Idroes, G. M., Noviandy, T. R., Maulana, A., Zahriah, Z., Suhendrayatna, S., Suhartono, E., Khairan, K., Kusumo, F., Helwani, Z., and Abd Rahman, S. (2023). Urban Air Quality Classification Using Machine Learning Approach to Enhance Environmental Monitoring, Leuser Journal of Environmental Studies, Vol. 1, No. 2, 62–68. doi:10.60084/ljes.v1i2.99.
  32. Ng, M., Fleming, T., Robinson, M., Thomson, B., Graetz, N., Margono, C., Mullany, E. C., Biryukov, S., Abbafati, C., Abera, S. F., Abraham, J. P., Abu-Rmeileh, N. M. E., Achoki, T., AlBuhairan, F. S., Alemu, Z. A., Alfonso, R., Ali, M. K., Ali, R., Guzman, N. A., Ammar, W., Anwari, P., Banerjee, A., Barquera, S., Basu, S., Bennett, D. A., Bhutta, Z., Blore, J., Cabral, N., Nonato, I. C., Chang, J.-C., Chowdhury, R., Courville, K. J., Criqui, M. H., Cundiff, D. K., Dabhadkar, K. C., Dandona, L., Davis, A., Dayama, A., Dharmaratne, S. D., Ding, E. L., Durrani, A. M., Esteghamati, A., Farzadfar, F., Fay, D. F. J., Feigin, V. L., Flaxman, A., Forouzanfar, M. H., Goto, A., Green, M. A., Gupta, R., Hafezi-Nejad, N., Hankey, G. J., Harewood, H. C., Havmoeller, R., Hay, S., Hernandez, L., Husseini, A., Idrisov, B. T., Ikeda, N., Islami, F., Jahangir, E., Jassal, S. K., Jee, S. H., Jeffreys, M., Jonas, J. B., Kabagambe, E. K., Khalifa, S. E. A. H., Kengne, A. P., Khader, Y. S., Khang, Y.-H., Kim, D., Kimokoti, R. W., Kinge, J. M., Kokubo, Y., Kosen, S., Kwan, G., Lai, T., Leinsalu, M., Li, Y., Liang, X., Liu, S., Logroscino, G., Lotufo, P. A., Lu, Y., Ma, J., Mainoo, N. K., Mensah, G. A., Merriman, T. R., Mokdad, A. H., Moschandreas, J., Naghavi, M., Naheed, A., Nand, D., Narayan, K. M. V., Nelson, E. L., Neuhouser, M. L., Nisar, M. I., Ohkubo, T., Oti, S. O., Pedroza, A., Prabhakaran, D., Roy, N., Sampson, U., Seo, H., Sepanlou, S. G., Shibuya, K., Shiri, R., Shiue, I., Singh, G. M., Singh, J. A., Skirbekk, V., Stapelberg, N. J. C., Sturua, L., Sykes, B. L., Tobias, M., Tran, B. X., Trasande, L., Toyoshima, H., van de Vijver, S., Vasankari, T. J., Veerman, J. L., Velasquez-Melendez, G., Vlassov, V. V., Vollset, S. E., Vos, T., Wang, C., Wang, X., Weiderpass, E., Werdecker, A., Wright, J. L., Yang, Y. C., Yatsuya, H., Yoon, J., Yoon, S.-J., Zhao, Y., Zhou, M., Zhu, S., Lopez, A. D., Murray, C. J. L., and Gakidou, E. (2014). Global, Regional, and National Prevalence of Overweight and Obesity in Children and Adults during 1980–2013: A Systematic Analysis for the Global Burden of Disease Study 2013, The Lancet, Vol. 384, No. 9945, 766–781. doi:10.1016/S0140-6736(14)60460-8.
  33. Villareal, D. T., Apovian, C. M., Kushner, R. F., and Klein, S. (2005). Obesity in Older Adults: Technical Review and Position Statement of the American Society for Nutrition and NAASO, the Obesity Society, The American Journal of Clinical Nutrition, Vol. 82, No. 5, 923–934. doi:10.1093/ajcn/82.5.923.
  34. Maes, H. H., Neale, M. C., and Eaves, L. J. (1997). Genetic and Environmental Factors in Relative Body Weight and Human Adiposity., Behavior Genetics, Vol. 27, No. 4, 325–51. doi:10.1023/a:1025635913927.
  35. Link, J. C., and Reue, K. (2017). Genetic Basis for Sex Differences in Obesity and Lipid Metabolism, Annual Review of Nutrition, Vol. 37, No. 1, 225–245. doi:10.1146/annurev-nutr-071816-064827.
  36. Zhang, D., Zhang, L., Sun, X., Gao, Y., Lan, Z., Wang, Y., Zhai, H., Li, J., Wang, W., Chen, M., Li, X., Hou, L., and Li, H. (2022). A New Method for Calculating Water Quality Parameters by Integrating Space–Ground Hyperspectral Data and Spectral-In Situ Assay Data, Remote Sensing, Vol. 14, No. 15, 3652. doi:10.3390/rs14153652.
  37. Hancock, J. T., and Khoshgoftaar, T. M. (2020). CatBoost for Big Data: An Interdisciplinary Review, Journal of Big Data, Vol. 7, No. 1, 94. doi:10.1186/s40537-020-00369-8.
  38. Mozaffarian, D. (2016). Dietary and Policy Priorities for Cardiovascular Disease, Diabetes, and Obesity, Circulation, Vol. 133, No. 2, 187–225. doi:10.1161/CIRCULATIONAHA.115.018585.
  39. Thorp, A. A., Owen, N., Neuhaus, M., and Dunstan, D. W. (2011). Sedentary Behaviors and Subsequent Health Outcomes in Adults, American Journal of Preventive Medicine, Vol. 41, No. 2, 207–215. doi:10.1016/j.amepre.2011.05.004.
  40. Yi, X., He, Y., Gao, S., and Li, M. (2024). A Review of the Application of Deep Learning in Obesity: From Early Prediction Aid to Advanced Management Assistance, Diabetes & Metabolic Syndrome: Clinical Research & Reviews, Vol. 18, No. 4, 103000. doi:10.1016/j.dsx.2024.103000.

Downloads

Published

2024-05-22

How to Cite

Maulana, A., Afidh, R. P. F., Maulydia, N. B., Idroes, G. M., & Rahimah, S. (2024). Predicting Obesity Levels with High Accuracy: Insights from a CatBoost Machine Learning Model. Infolitika Journal of Data Science, 2(1), 17–27. https://doi.org/10.60084/ijds.v2i1.195

Issue

Section

Articles