Data source and study population
The MIMIC-IV v0.4 database is a large public database that contains hospitalization information for patients at Beth Israel Deaconess Medical Center between 2008 and 2019, which was approved by the Massachusetts Institute of Technology (Cambridge, MA) and Beth Israel Deaconess Medical Center (Boston, MA). Because the present study was an analysis of the third party anonymized publicly available database with pre-existing institutional review board (IRB) approval, our institution’s IRB approval was exempted. This database provides a strong information base for clinical studies. In the database, the true identity information about the patient is hidden. Thus, obtaining the patient’s informed consent was not needed. The author completed the relevant course training and obtained the certificate to access the database. All data are from Physionet official website (https://mimic.physionet.org/).
A total of 11,897 patients were diagnosed with sepsis in the database, including 6,567 patients aged 65 years old or older. Exclusion criteria were as follows: patients who died within 24 h of entering intensive care unit (ICU). Finally, a total of 6,503 patients were selected for the study.
Using Structured Query Language to extract data, the extracted variables included the general information of patients, as follows: ethnicity, sex, age, weight, ventilator use, vasopressor use, continuous renal replacement therapy (CRRT) use, and first care unit (unit). The severity of the disease was assessed using SOFA, SAPS II, and APS III. Charlson comorbidity index was used, and the comorbidities included the following: myocardial infarction, congestive heart failure, peripheral vascular disease, cerebrovascular disease, dementia, chronic pulmonary disease, rheumatic disease, peptic ulcer disease, mild liver disease, diabetes uncomplicated, diabetes complicated, paraplegia, renal disease, malignant cancer, severe liver disease, metastatic solid tumor, and AIDS. Results of the first laboratory examination after admission to the ICU included data on the following: white blood cells (WBC), red blood cells (RBC), hemoglobin, hematocrit, red cell distribution width (RDW), mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), mean corpuscular volume (MCV), platelet count (PLT), prothrombin time (PT), partial thromboplastin time (PTT), INR PT, lactate, calculated total CO2, PaCO2, pH, PaO2, alanine aminotransferase (ALT), aspartate aminotransferase (AST), albumin, alkaline phosphatase(AP), bilirubin total, urea nitrogen, creatinine, glucose, anion gap (AG), base excess, calcium total, chloride, magnesium, bicarbonate, phosphate, potassium, sodium, specific gravity, urine output. Vital signs included data on the following: mean heartrate, mean systolic blood pressure, mean diastolic blood pressure, mean blood pressure, mean respiratory rate, mean temperature, and mean SpO2.
In this study, indicators with a missing degree greater than 20% were not included, and the remaining missing data were filled in by multiple imputation. In this study, the final complete data was generated from 10 imputed datasets obtained by the "mice" package of the R software .
The elderly patients with sepsis were randomly assigned to the training cohort (80%) or validation cohort (20%). The training cohort was used to construct the RSF model and perform internal validation. The validation cohort was used to verify the performance of the model. Categorical variables were described by frequency and percentage values, and differences between cohorts were determined by the chi-square test or Fisher's exact test. In some statistical guides, it is shown that for descriptive statistics, the median and quartiles are preferred over means and standard deviation values . Therefore, in this study, the median and quartiles are used to describe continuous variables.
RSF is an ensemble method , which firstly uses the Bootstrap's sampling method to randomly select N samples from the training cohort to generate N survival trees, and then at each node of the tree, randomly select a subset of the covariates as candidate variables for splitting. Therefore, each tree is composed of categorized or split node variables, where tree nodes are split according to the maximum survival difference between child nodes, which can be calculated by four methods, namely log-rank, conservation of events, log-rank score, and random . The method used in this study is the log-rank. For each bootstrap sample, about 37% of the samples in the training cohort were not extracted on average, and these samples were called out-of-bag (OOB) samples. The OOB error rate of the OOB sample was calculated. The OOB error rate and the predictive error rate of the validation set were used to evaluate the model’s performance. The lower the error rate was, the better the model performance was. In this study, the optimal parameter combination of the model was determined by calculating the error rate of the bag in the training cohort under various parameter combination conditions through grid search . The parameter combination that made the total error rate of the RSF the lowest was determined. RSF model was built according to the optimal parameters, and variables were screened according to variable importance (VIMP)14. The importance score is an evaluation index used to measure the predictive ability of predictive variables to outcome variables. The greater the VIMP value was, the stronger the predictive ability was. VIMP was positive, indicating that the variable had a predictive effect. A VIMP of 0 or a negative value indicated that the variable was not a meaningful predictor. Ranking was performed according to the score of order of importance from the most important to the least important. The top 30 variables of importance were selected, and the RSF was built again. C index and calibration curves were used to evaluate the performance of the model.
In this study, data analysis was performed using R 4.0.3 software and Python 3.7; the packages used include randomForestSRC, survival, survivalROC, matplotlib, and scikit-learn.