• Application of Machine Learning Algorithms for Predicting Missing Cost Data

      Rueda, Juan-David; Slejko, Julia; 0000-0002-0907-7106 (2019)
      Objective: To compare new alternatives to estimate health care costs in the presence of missing data using methods based on machine-learning (ML). Introduction: Costs must be correctly estimated for value assessment and budget calculations. Problems arise when they are not correctly estimated. Sometimes costs can be biased and lead to wrong decisions that affect population health. Cost estimation is a challenging task and it is more challenging in the presence of missing data. Methods: We used Surveillance, Epidemiology, and End Results program (SEER)-Medicare including patients with multiple myeloma newly diagnosed from 2007-2013. We explored the problem of missing data using different approaches creating artificial missing data. We hypothesized that the use of ML techniques improves the prediction of mean medical total costs in the presence of missingness. ML methods included support vector machines, boosting, random forest, and classification and regression trees. First, we analyzed the problem considering only one dimension, when one variable is missing in a cross-sectional scenario, using generalized linear models as a comparator against ML. Then, we added time as a factor for missingness, utilizing reweighted estimators against ML. Finally, we explored the different levels of censoring and determined how each censoring level affected our cost estimations. In this case, we created multiple linear spline models to establish the effect of censoring on the bias of the estimator. Results: We demonstrated that ML algorithms had better prediction when data were missing completely at random and missing at random. All the methods performed badly in the missing not at random scenario. In the second aim, we showed that ML-based methods predict just as well as reweighted estimators for the five-year total cost of a patient with multiple myeloma. Lastly, we found that ML methods are consistent and robust at low and moderate levels of censoring; however, we failed to prove that they are better than the reweighted estimators. Conclusions: ML-based methods are a good alternative for the prediction of missing cost data in the case of cross-sectional and longitudinal data.