• Application of Machine Learning Algorithms for Predicting Missing Cost Data

      Rueda, Juan-David; Slejko, Julia F; 0000-0002-0907-7106 (2019)
      Objective: To compare new alternatives to estimate health care costs in the presence of missing data using methods based on machine-learning (ML). Introduction: Costs must be correctly estimated for value assessment and budget calculations. Problems arise when they are not correctly estimated. Sometimes costs can be biased and lead to wrong decisions that affect population health. Cost estimation is a challenging task and it is more challenging in the presence of missing data. Methods: We used Surveillance, Epidemiology, and End Results program (SEER)-Medicare including patients with multiple myeloma newly diagnosed from 2007-2013. We explored the problem of missing data using different approaches creating artificial missing data. We hypothesized that the use of ML techniques improves the prediction of mean medical total costs in the presence of missingness. ML methods included support vector machines, boosting, random forest, and classification and regression trees. First, we analyzed the problem considering only one dimension, when one variable is missing in a cross-sectional scenario, using generalized linear models as a comparator against ML. Then, we added time as a factor for missingness, utilizing reweighted estimators against ML. Finally, we explored the different levels of censoring and determined how each censoring level affected our cost estimations. In this case, we created multiple linear spline models to establish the effect of censoring on the bias of the estimator. Results: We demonstrated that ML algorithms had better prediction when data were missing completely at random and missing at random. All the methods performed badly in the missing not at random scenario. In the second aim, we showed that ML-based methods predict just as well as reweighted estimators for the five-year total cost of a patient with multiple myeloma. Lastly, we found that ML methods are consistent and robust at low and moderate levels of censoring; however, we failed to prove that they are better than the reweighted estimators. Conclusions: ML-based methods are a good alternative for the prediction of missing cost data in the case of cross-sectional and longitudinal data.
    • Biological Insight from Mass Spectrometry Through Novel Computational Approaches

      Fondrie, William Ellis; Strickland, Dudley K.; Goodlett, David Robinson, 1960-; 0000-0002-1554-3716 (2018)
      Mass spectrometry is a powerful tool to identify, quantify, and characterize diverse biological molecules. As a result, it has become a mainstay in modern biology enabling studies ranging from the structural characterization of a single lipid, to the identification and quantification of complete proteomes. With the increasing abundance of high-throughput mass spectrometry experiments, the major challenge has shifted from acquisition to interpretation. In most cases, these high-throughput experiments require inference beyond the directly measured characteristics of the analyte to draw insightful biological conclusions. However, this can be difficult, particularly when the outcomes of an experiment are not clearly defined. To address this challenge, we developed novel experimental and computational methods that utilize high-throughput mass spectrometry techniques to answer specific biological questions. We first describe a method called serial dilution-affinity purification-mass spectrometry (SDAP-MS) to identify and characterize protein-protein interactions. When investigating interactions to a protein of interest, researchers must decide between high-throughput screening methods that merely yield binary results, or low-throughput approaches that yield insight into the biochemical properties of these interactions. The SDAP-MS approach alleviates this burden by providing a high-throughput screening method capable of estimating equilibrium binding constants, which we demonstrate using two LDL receptor family members, LRP1 and LRP1B. We then describe a technique to discern the protein cargo of exosomes from contaminants co-isolated during purification. Exosomes are microvesicles and potential carriers of biomarkers and, as a result, it is vital to ensure that the proteins attributed to the exosome are not artifacts from the isolation strategy. To assess this, we present an approach that uses proteomics and machine learning to investigate the enrichment of proteins across multiple stages of exosome isolation, culminating in a score that indicates the confidence of each protein being of exosomal origin. Finally, we develop a machine learning approach to identify microbial pathogens from glycolipid mass spectra, enabling their rapid diagnosis. We show that Acinetobacter baumannii and Klebsiella pneumoniae can be identified from a library containing 48 additional organisms in both isolate and polymicrobal specimens. Together, these studies present methods that result in valuable insight beyond analyte identification and quantification by mass spectrometry.
    • The Creation of Objective Performance Criteria and Generation of Predictive Models among Medical Devices in a Vascular Space

      Gressler, Laura; Shaya, Fadia T.; 0000-0003-2042-2174 (2021)
      Background: Objective Performance Criteria (OPC) have been explored as a tool to address the growing pressures to expedite device approval and enhance active surveillance. Existing data infrastructures can be employed to develop OPC to evaluate the use of devices, and can be further leveraged to develop predictive models. The objective of this dissertation was to: (1) Develop a framework for the creation of OPC, (2) Compare the use of stent, atherectomy, and combination of stent and atherectomy, and (3) Formulate a predictive model used to predict the probability of undergoing a major adverse limb event (MALE) or experiencing death following the aforementioned treatments. Methods: The framework was developed in 3 phases through (1) Review of the literature, (2) Engagement of key stakeholders, and (3) Feedback from an advisory committee. Retrospective cohort studies were conducted using the Vascular Quality Initiative (2010-2018). Logistic regression and the Fine-Gray subdistribution hazard model were used to compare short- and long-term MALE, respectively. A generalized linear model (GLM), a Least Absolute Shrinkage and Selection Operator (LASSO) regularized GLM, a gradient boosted decision tree, and random forest model were compared when used to predict MALE and mortality. Results: The developed framework consisted of 5 elements: (1) Identification of Medical Devices, (2) Engagement of Key Stakeholders, (3) Selection of Data Source, (4) Performance of Appropriate Statistical Analyses, (5) Reporting of Findings. The odds of short-term MALE (0.94;95%CI:0.77-1.14) and hazards of long-term MALE (0.92;95%CI:0.82-1.04) were not significantly different in the combination stent and atherectomy group when compared to stent alone. The most effective predictive model was the gradient boosted decision tree (Area Under the Curve (AUC)= 0.7539) for MALE and the LASSO regularized GLM (AUC=0.7930) for mortality. Conclusions: The developed framework provides a guide and needed foundation for the continued generation of OPC. Applying the identified statistical steps in the framework to an existing data infrastructure showed that patients receiving combination stent and atherectomy do not experience significantly different rates of MALE compared to stent alone. Predictive models generated using the infrastructure demonstrated the ability of machine learning techniques to generate robust predictive models within the vascular space.
    • Follicular Lymphoma Stage at Diagnosis: Determinants, Prediction from Administrative Claims Data and Impact on Healthcare Costs

      Albarmawi, Husam; Onukwugha, Eberechukwu (2020)
      Introduction: Follicular lymphoma (FL) stage is an important determinant of survival, treatment options and treatment outcomes. However, the determinants of advanced FL, defined as Ann Arbor stages III and IV, and its impact on the economic burden of FL are unknown. Moreover, for studies that rely on administrative claims data, it is not clear if advanced FL can be accurately predicted from this data source. Methods: Using the linked Surveillance, Epidemiology, and End Results-Medicare database we identified patients newly diagnosed with FL. We estimated a modified Poisson regression to explore the effect of pre-diagnosis healthcare resource utilization patterns and baseline county-level factors on FL stage. We estimated the 1-year and 5-year incremental costs of stages II-IV compared to stage I using generalized linear models. To predict FL stage from claims data, we developed and tested two random forests models. Results: We identified 11,078 patients diagnosed in 2000-2013. Half of the sample had advanced FL. Higher counts of specialist physician visits in the 3 years pre-diagnosis were associated with lower risk of advanced FL (4th quartile vs. 1st quartile: Relative Risk [RR]=0.92; 95% CI=0.86–0.99). The risk of advanced FL was 8% lower among women receiving screening mammography compared to men (RR=0.92; 95% CI=0.88–0.97). Living in counties designated as health professional shortage areas (HPSA) was associated with 7% increased risk of advanced FL (RR=1.07; 95% CI=1.00–1.14, p=0.049). In 2004-2009, FL patients with stages II, III and IV had statistically higher 1-year ($14,911; $15,106; $24,639, respectively, p<0.01) and 5-year costs ($21,590; $23,599; $34,968, respectively, p<0.01) compared to stage I patients. The random forests models exhibited poor accuracy of classifying limited and advanced FL from Medicare claims data (accuracy: ≤64%; sensitivity: ≤72%; specificity: ≤57%). Conclusions: Higher frequencies of specialist physician visits and living in counties with no HPSA can reduce the risk of presenting with advanced FL. Patients with stages II-IV incur significantly higher costs compared to stage I patients. The incremental cost increases with higher FL stage. Predicting advanced FL from claims data may not be feasible and researchers may need to rely on datasets with existing clinical information.
    • Medicare Disabled Patients with Hepatitis C: Determinants of Quality of Care Receipt, Peg-Interferon Treatment Initiation, and Risk of Metabolic and Vascular Disorders

      Chirikov, Viktor; Shaya, Fadia T.; 0000-0002-9480-0580 (2015)
      Background: Due to years of undetected hepatitis C virus (HCV) infection, the burden of liver and extrahepatic disorders will continue to increase in the US. HCV patients receiving Social Security Disability Benefits represent~70% of HCV patients in Medicare and are an understudied population facing numerous barriers to HCV management. We explored pre-treatment quality of care (QC) patterns, determined the factors associated with differential QC receipt and peg-interferon treatment initiation, and examined the effectiveness of peg-interferon therapy for ?24 weeks at reducing metabolic/vascular risk in Medicare disabled HCV patients. Methods: Medicare claims (2006-2009) linked to the Area Health Resource Files were used. We used a random forest model of conditional inference trees to aggregate QC indicators into high, good, fair, and low QC groupings. Ordinal partial proportional odds regression modelled the receipt of differential QC levels. Modified Poisson regressions, propensity-score weighted for the level of QC received, examined the association between treatment initiation and patient- and county-level characteristics. Poisson regression with weights for treatment selection, discontinuation, and informative censoring due to mortality quantified the effect of peg-interferon treatment on the risk of incident mild, severe, or mild and severe metabolic/vascular events, compared to the untreated. Results: We identified 1,936 patients with continuous enrolment, of whom 10.4% initiated peg-interferon. The five strongest QC metrics predicting treatment included "having received liver biopsy", "HCV genotype testing", "visit to specialist", "confirmation of HCV viremia", and "iron overload testing". While county of residence had no effect on QC receipt, residence in rural counties with high screening capacity was associated with higher prevalence ratio [PR] of treatment initiation (PR=1.42, p=0.09). High QC (PR=5.61, p<0.01) and good QC (PR=2.46, p<0.01) were associated with higher treatment rates. Multiple comorbidities were associated with lower odds of QC receipt (OR=0.76, p=0.05) and treatment initiation (PR=0.27, p<0.01). Over two years of follow-up, there was no difference in metabolic/vascular risk between those treated≥24 weeks (n=43) and untreated (n=879) patients. Conclusion: As barriers to eradicating the HCV infection would likely persist even with novel interferon-free regimens, future research should use our findings to better characterize and optimize treatment in HCV patients with disabilities.
    • Use of Machine Learning To Predict COPD Treatments and Exacerbations in Medicare Older Adults: A Comparison of Multiple Approaches

      Le, Tham Thi; Simoni-Wastila, Linda (2021)
      Background: Multiple comorbidities, suboptimal adherence to maintenance medications (MMs), and exacerbations remain clinically important problems among older adults with chronic obstructive pulmonary disease (COPD). To better understand comorbidity profiles and to facilitate risk-based strategies for disease management, this dissertation quantified the prevalence and newly diagnosed rates of comorbidities, and validated predictive models of COPD medication non-adherence and exacerbations in the older Medicare population. Methods: Comorbidities were quantified in COPD beneficiaries and compared with matched non-COPD individuals using multivariable logistic regression. In a cohort of COPD beneficiaries with prevalent and new MM use, logistic and LASSO regressions were used to cross-validate the prediction of one-year non-adherence to MMs using different sets of predictors. A time-varying design was applied to assess improvement in predicting COPD exacerbations of the super learner versus component approaches (logistic regression, elastic net regression, random forest, gradient boosting, and neural network). Results: COPD beneficiaries had significantly increased odds of 40 measured comorbidities relative to matched non-COPD controls. The best-performing models in predicting MM non-adherence were those including initial MM adherence as a predictor, with validated Area Under the ROC Curves (AUC: 0.871-0.881). In predicting COPD exacerbations there were time-varying estimates of predictive accuracy and associations between predictors and the exacerbation outcome. Super learner performed slightly better (AUC: 0.650-0.761) than individual machine learning methods. Conclusions: Comorbidity burden is substantial and increases over time among Medicare older adults with COPD. Generated models achieved good and average discrimination in predicting COPD medication non-adherence and exacerbations, respectively. COPD hospitalization, oxygen supplementation, COPD treatment adherence, and numbers of inpatient visits were the most important predictors of COPD medication non-adherence and exacerbations. Super learner demonstrates a slight improvement compared to component methods, suggesting potential usability in augmenting prediction. Validated models with good discrimination can be adopted using friendly tools to optimizing resources for risk-based management and interventions of COPD.