Inductive Biases in Feature Reduction for QSAR: SHAP vs. Autoencoders

Teuku Rizky Noviandy; Ghifari Maulana Idroes; Andi Lala; Zuchra Helwani; Rinaldi Idroes

doi:10.60084/ijds.v3i1.306

Authors

Teuku Rizky Noviandy Department of Information Systems, Faculty of Engineering, Universitas Abulyatama, Aceh Besar 23372, Indonesia
Ghifari Maulana Idroes Department of Nuclear Engineering and Engineering Physics, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia
Andi Lala School of Mathematics and Applied Sciences, Universitas Syiah Kuala, Banda Aceh 23111, Indonesia
Zuchra Helwani Department of Chemical Engineering, Universitas Riau, Pekanbaru 28293, Indonesia
Rinaldi Idroes School of Mathematics and Applied Sciences, Universitas Syiah Kuala, Banda Aceh 23111, Indonesia

DOI:

https://doi.org/10.60084/ijds.v3i1.306

Keywords:

Feature selection, Interpretability, Generalization, LightGBM, Autoencoder

Abstract

Machine learning models in drug discovery often depend on high-dimensional molecular descriptors, many of which may be redundant or irrelevant. Reducing these descriptors is essential for improving model performance, interpretability, and computational efficiency. This study compares two widely used reduction strategies: SHAP-based feature selection and autoencoder-based compression, within the context of Quantitative Structure-Activity Relationship (QSAR) classification. LightGBM is used as a consistent modeling framework to evaluate models trained on all descriptors, the top 50 and 100 SHAP-ranked descriptors, and a 64-dimensional autoencoder embedding. The results show that SHAP-based selection produces interpretable and stable models with minimal performance loss, particularly when using the top 100 descriptors. In contrast, the autoencoder achieves the highest test performance by capturing nonlinear patterns in a compact, low-dimensional representation, although this comes at the cost of interpretability and consistency across data splits. These findings reflect the differing inductive biases of each method. SHAP prioritizes sparsity and attribution, while autoencoders focus on reconstruction and continuity. The analysis emphasizes that descriptor reduction strategies are not interchangeable. SHAP-based selection is suitable for applications where interpretability and reliability are essential, such as in hypothesis-driven or regulatory settings. Autoencoders are more appropriate for performance-driven tasks, including virtual screening. The choice of reduction strategy should be guided not only by performance metrics but also by the specific modeling requirements and assumptions relevant to cheminformatics workflows.

Downloads

Download data is not yet available.

References

Gupta, R., Srivastava, D., Sahu, M., Tiwari, S., Ambasta, R. K., and Kumar, P. (2021). Artificial Intelligence to Deep Learning: Machine Intelligence Approach for Drug Discovery, Molecular Diversity, Vol. 25, No. 3, 1315–1360. doi:10.1007/s11030-021-10217-3.
Khan, S., Sarfraz, A., Prakash, O., and Khan, F. (2024). Machine Learning-Based QSAR Modeling, Molecular Docking, Dynamics Simulation Studies for Cytotoxicity Prediction in MDA-MB231 Triple-Negative Breast Cancer Cell Line, Journal of Molecular Structure, Vol. 1315, 138807. doi:10.1016/j.molstruc.2024.138807.
Noviandy, T. R., Maulana, A., Emran, T. B., Idroes, G. M., and Idroes, R. (2023). QSAR Classification of Beta-Secretase 1 Inhibitor Activity in Alzheimer’s Disease Using Ensemble Machine Learning Algorithms, Heca Journal of Applied Sciences, Vol. 1, No. 1, 1–7. doi:10.60084/hjas.v1i1.12.
Wigh, D. S., Goodman, J. M., and Lapkin, A. A. (2022). A Review of Molecular Representation in the Age of Machine Learning, WIREs Computational Molecular Science, Vol. 12, No. 5. doi:10.1002/wcms.1603.
Li, J., Luo, D., Wen, T., Liu, Q., and Mo, Z. (2021). Representative Feature Selection of Molecular Descriptors in QSAR Modeling, Journal of Molecular Structure, Vol. 1244, 131249. doi:10.1016/j.molstruc.2021.131249.
Goodarzi, M., Dejaegher, B., and Heyden, Y. Vander. (2012). Feature Selection Methods in QSAR Studies, Journal of AOAC INTERNATIONAL, Vol. 95, No. 3, 636–651. doi:10.5740/jaoacint.SGE_Goodarzi.
Lundberg, S. M., and Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions, Advances in Neural Information Processing Systems, Vol. 30.
Noviandy, T. R., Idroes, G. M., Syukri, M., and Idroes, R. (2024). Interpretable Machine Learning for Chronic Kidney Disease Diagnosis: A Gaussian Processes Approach, Indonesian Journal of Case Reports, Vol. 2, No. 1, 24–32. doi:10.60084/ijcr.v2i1.204.
Berahmand, K., Daneshfar, F., Salehi, E. S., Li, Y., and Xu, Y. (2024). Autoencoders and Their Applications in Machine Learning: A Survey, Artificial Intelligence Review, Vol. 57, No. 2, 28. doi:10.1007/s10462-023-10662-6.
Azizah, M., Yanuar, A., and Firdayani, F. (2022). Dimensional Reduction of QSAR Features Using a Machine Learning Approach on the SARS-Cov-2 Inhibitor Database, Jurnal Penelitian Pendidikan IPA, Vol. 8, No. 6, 3095–3101. doi:10.29303/jppipa.v8i6.2432.
Khan, P. M., and Roy, K. (2018). Current Approaches for Choosing Feature Selection and Learning Algorithms in Quantitative Structure–Activity Relationships (QSAR), Expert Opinion on Drug Discovery, Vol. 13, No. 12, 1075–1089. doi:10.1080/17460441.2018.1542428.
Gaulton, A., Bellis, L. J., Bento, A. P., Chambers, J., Davies, M., Hersey, A., Light, Y., McGlinchey, S., Michalovich, D., Al-Lazikani, B., and Overington, J. P. (2012). ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery, Nucleic Acids Research, Vol. 40, No. D1, D1100–D1107. doi:10.1093/nar/gkr777.
Yu, T., Nantasenamat, C., Kachenton, S., Anuwongcharoen, N., and Piacham, T. (2023). Cheminformatic Analysis and Machine Learning Modeling to Investigate Androgen Receptor Antagonists to Combat Prostate Cancer, ACS Omega, Vol. 8, No. 7, 6729–6742. doi:10.1021/acsomega.2c07346.
Toropov, A. A., and Toropova, A. P. (2020). QSPR/QSAR: State-of-Art, Weirdness, the Future, Molecules, Vol. 25, No. 6, 1292. doi:10.3390/molecules25061292.
Moriwaki, H., Tian, Y. S., Kawashita, N., and Takagi, T. (2018). Mordred: A Molecular Descriptor Calculator, Journal of Cheminformatics, Vol. 10, No. 1, 1–14. doi:10.1186/s13321-018-0258-y.
Noviandy, T. R., Maulana, A., Idroes, G. M., Suhendra, R., Afidh, R. P. F., and Idroes, R. (2024). An Explainable Multi-Model Stacked Classifier Approach for Predicting Hepatitis C Drug Candidates, Sci, Vol. 6, No. 4, 81. doi:10.3390/sci6040081.
Noviandy, T. R., Idroes, G. M., and Hardi, I. (2024). Machine Learning Approach to Predict AXL Kinase Inhibitor Activity for Cancer Drug Discovery Using XGBoost and Bayesian Optimization, Journal of Soft Computing and Data Mining, Vol. 5, No. 1, 46–56.
Ahsan, M., Mahmud, M., Saha, P., Gupta, K., and Siddique, Z. (2021). Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance, Technologies, Vol. 9, No. 3, 52. doi:10.3390/technologies9030052.
Baron, G., and Stańczyk, U. (2021). Standard vs. Non-Standard Cross-Validation: Evaluation of Performance in a Space with Structured Distribution of Datapoints, Procedia Computer Science, Vol. 192, 1245–1254. doi:10.1016/j.procs.2021.08.128.
Noviandy, T. R., Idroes, G. M., Mohd Fauzi, F., and Idroes, R. (2024). Application of Ensemble Machine Learning Methods for QSAR Classification of Leukotriene A4 Hydrolase Inhibitors in Drug Discovery, Malacca Pharmaceutics, Vol. 2, No. 2, 68–78. doi:10.60084/mp.v2i2.217.
Noviandy, T. R., Maulana, A., Idroes, G. M., Maulydia, N. B., Patwekar, M., Suhendra, R., and Idroes, R. (2023). Integrating Genetic Algorithm and LightGBM for QSAR Modeling of Acetylcholinesterase Inhibitors in Alzheimer’s Disease Drug Discovery, Malacca Pharmaceutics, Vol. 1, No. 2, 48–54. doi:10.60084/mp.v1i2.60.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). Lightgbm: A Highly Efficient Gradient Boosting Decision Tree, Advances in Neural Information Processing Systems, Vol. 30.
Noviandy, T. R., Maulana, A., Irvanizam, I., Idroes, G. M., Maulydia, N. B., Tallei, T. E., Subianto, M., and Idroes, R. (2025). Interpretable Machine Learning Approach to Predict Hepatitis C Virus NS5B Inhibitor Activity Using Voting-Based LightGBM and SHAP, Intelligent Systems with Applications, Vol. 25, 200481. doi:10.1016/j.iswa.2025.200481.
Tropsha, A. (2010). Best Practices for QSAR Model Development, Validation, and Exploitation, Molecular Informatics, Vol. 29, Nos. 6–7, 476–488. doi:10.1002/minf.201000061.
Danishuddin, Madhukar, G., Malik, M. Z., and Subbarao, N. (2019). Development and Rigorous Validation of Antimalarial Predictive Models Using Machine Learning Approaches, SAR and QSAR in Environmental Research, Vol. 30, No. 8, 543–560. doi:10.1080/1062936X.2019.1635526.
Noviandy, T. R., Imelda, E., Idroes, G. M., Suhendra, R., and Idroes, R. (2025). Evaluation of Machine Learning Methods for Identifying Carbonic Anhydrase-II Inhibitors as Drug Candidates for Glaucoma, Malacca Pharmaceutics, Vol. 3, No. 1, 32–41. doi:10.60084/mp.v3i1.271.
Kramer, O. (2016). Scikit-Learn, 45–53. doi:10.1007/978-3-319-33383-0_5.