← Volver atrás
Publicaciones

Comparative performance of machine learning vs classical formulas for LDL-cholesterol calculation; Rendimiento comparativo del aprendizaje automático frente a las fórmulas clásicas para el cálculo del colesterol LDL

Autores

Martin-Perez, S. , Suppi, R. , Arrobas-Velilla, T. , Téllez Hernández, F.D.B. , LEÓN JUSTEL, ANTONIO

Publicación externa

No

Medio

Clin Investig Arterioscler

Alcance

Article

Naturaleza

Científica

Cuartil JCR

Cuartil SJR

Fecha de publicacion

01/01/2025

Scopus Id

2-s2.0-105025168245

Abstract

Introduction Low-density lipoprotein cholesterol (LDL-C) is a significant cardiovascular risk factor, as direct measurement is expensive and often unavailable in most clinical laboratories. The Friedewald formula (FD), despite its widespread use since 1972, has notable limitations, especially at high triglyceride levels and low LDL-C concentrations. Machine learning (ML) techniques offer promising alternatives for accurate LDL-C estimation, potentially overcoming traditional formula limitations by leveraging complex pattern recognition in lipid profile data. Material and methods This retrospective study analyzed 34,678 lipid profiles from patients over 18 years attending Hospital Virgen Macarena, Seville (January 2021–December 2022). The study was approved by the Ethics Committee (CEI HVM-VR_03/2024). All lipid parameters (total cholesterol, triglycerides, HDL-C, LDL-C) were measured using Cobas 6000 analyzer. Twenty-two machine learning models were developed using Python's PyCaret library with 80/20 train-test split. Models included Linear Regression, Random Forest, XGBoost, LightGBM, and Gradient Boosting among others. Performance was evaluated using coefficient of determination ( R 2), mean absolute error (MAE), and root mean square error (RMSE). Four triglyceride subgroups were analyzed: <150, 150–250, 250–400, and >400 mg/dL. Results The dataset comprised 34,678 individuals with mean values: total cholesterol 204.6 ± 73.36 mg/dL, triglycerides 203.95 ± 143.94 mg/dL, HDL-C 51.83 ± 18.45 mg/dL, and LDL-C 120.38 ± 62.29 mg/dL. LightGBM achieved the highest performance ( R 2 = 0.965, RMSE = 11.35, MAE = 7.99), followed by Gradient Boosting ( R 2 = 0.962, RMSE = 11.89, MAE = 7.87) and XGBoost ( R 2 = 0.958, RMSE = 12.49, MAE = 8.3). Traditional formulas showed inferior performance: Martin–Hopkins ( R 2 = 0.951, RMSE = 13.82, MAE = 9.3) and Friedewald ( R 2 = 0.926, RMSE = 16.92, MAE = 11.97). Performance differences were more pronounced at triglyceride levels = 250 mg/dL, with ML models maintaining R 2 > 0.92 while classical formulas deteriorated significantly, particularly Friedewald ( R 2 = 0.34) at triglycerides > 400 mg/dL. Conclusions Machine learning models, particularly boosting algorithms (LightGBM, Gradient Boosting, XGBoost), significantly outperformed traditional LDL-C calculation formulas across all triglyceride ranges. These AI-based approaches yielded superior accuracy and robustness, especially in challenging clinical scenarios with elevated triglycerides where conventional formulas fail. Implementation of ML models in clinical laboratories could provide more reliable LDL-C estimations, contributing to improved cardiovascular risk stratification and patient management. This technological advancement represents a promising transformation in laboratory medicine methodology. © 2025 Sociedad Española de Arteriosclerosis.

Palabras clave

Cardiovascular risk; Clinical laboratory; Gradient boosting; LDL-cholesterol; Lipid profile; Machine learning

Miembros de la Universidad Loyola