Development and validation of a scoring system for mortality prediction and application of standardized W statistics to assess the performance of emergency departments

Background In-hospital mortality and short-term mortality are indicators that are commonly used to evaluate the outcome of emergency department (ED) treatment. Although several scoring systems and machine learning-based approaches have been suggested to grade the severity of the condition of ED patients, methods for comparing severity-adjusted mortality in general ED patients between different systems have yet to be developed. The aim of the present study was to develop a scoring system to predict mortality in ED patients using data collected at the initial evaluation and to validate the usefulness of the scoring system for comparing severity-adjusted mortality between institutions with different severity distributions. Methods The study was based on the registry of the National Emergency Department Information System, which is maintained by the National Emergency Medical Center of the Republic of Korea. Data from 2016 were used to construct the prediction model, and data from 2017 were used for validation. Logistic regression was used to build the mortality prediction model. Receiver operating characteristic curves were used to evaluate the performance of the prediction model. We calculated the standardized W statistic and its 95% confidence intervals using the newly developed mortality prediction model. Results The area under the receiver operating characteristic curve of the developed scoring system for the prediction of mortality was 0.883 (95% confidence interval [CI]: 0.882–0.884). The Ws score calculated from the 2016 dataset was 0.000 (95% CI: − 0.021 – 0.021). The Ws score calculated from the 2017 dataset was 0.049 (95% CI: 0.030–0.069). Conclusions The scoring system developed in the present study utilizing the parameters gathered in initial ED evaluations has acceptable performance for the prediction of in-hospital mortality. Standardized W statistics based on this scoring system can be used to compare the performance of an ED with the reference data or with the performance of other institutions.


Background
To improve the quality of emergency care, performance measurement is essential [1]. Performance measures include structure, process, and outcome indicators [1,2]. Outcome measures evaluate the effect of patient care and include readmission, mortality, or patient satisfaction [1]. Outcome indicators have many advantages, such as being valid, stable and concrete, so policymakers and patients have shown greater interest in outcome measures [1,2]. In-hospital mortality or short-term mortality are commonly used outcome indicators to evaluate the outcome of emergency department (ED) treatment [3,4]. However, crude mortality has limitations, as it is difficult to interpret the results when differences in severity are not considered. Therefore, scoring systems based on physiological variables were developed and have been used to measure the severity of illness or injury [5][6][7]. The impact of ED process indicators such as ED length of stay or leaving without being seen are evaluated against outcome indicators such as hospital mortality [3,8]. However, mortality without consideration of severity can lead to misleading or inconclusive results. For example, although Singer et al. found that the length of ED stay was associated with increased mortality, the finding is not conclusive because it is possible that more severely ill patients are likely to stay longer in the ED. [9] In the field of trauma care, methods of comparing performance between institutions or trauma systems have been developed and widely used, including the W statistic and standardized W (Ws) based on the Trauma and Injury Severity Score (TRISS) [6,7,[10][11][12][13][14]. Although several scoring systems and machine learningbased approaches have been suggested to grade the severity of illness in ED patients, methods of comparing the severity-adjusted mortality of general ED patients between different systems have yet to be developed [15,16].
We hypothesized that W statistics used in trauma systems could be used to assess ED performance when coupled with an appropriate scoring system that can predict mortality in general ED patients. The aim of the present study was to develop a scoring system for predicting the mortality of ED patients based on data collected at the initial evaluation and to validate the usefulness of the scoring system for comparing severity-adjusted mortality between institutions with different severity distributions by means of the W and Ws statistics.

Study setting and population
The study was based on the registry of the National Emergency Department Information System (NEDIS) maintained by the National Emergency Medical Center of the Republic of Korea. The NEDIS is a computerized system that prospectively collects demographic and clinical data pertaining to ED patients from all emergency medical facilities in Korea [17,18]. The Institutional Review Board of the Dong-A University Hospital determined the study to be exempt from the need for informed consent because the study involves a deidentified version of the preexisting national dataset. (DAUHIRB-EXP-20-062).
Data regarding consecutive emergency visits between January 1, 2016, and December 31, 2017, were extracted from the NEDIS registry, anonymized by the National Emergency Medical Center and provided for this research. Data from the designated regional emergency centers or local emergency centers were included for analysis, while cases from smaller EDs were excluded because of excessive missing data. Regional emergency centers are the highest-level emergency centers in Korea and have been designated as such by the Ministry of the Health and Welfare, while local emergency centers are designated by the governors or mayors and are intended to provide local people with access to emergency medical care [18]. Data from 2016 were used for the construction of the prediction model, and data from 2017 were used for validation. Children aged less than 15 years, patients who transferred from the ED, patients who were in cardiac arrest at the time of arrival, and patients with missing data or highly unreliable values (such as blood pressure above 300 mmHg or pulse oximetry above 100%) were excluded from analysis.

Measurements
Age, sex, cause of visit, level of consciousness measured with the alert, verbal, pain, unresponsive (AVPU) scale, systolic and diastolic blood pressures, heart rate, respiratory rate, body temperature, pulse oxygen saturation at the time of ED arrival, and treatment results at the time of ED discharge and at the time of hospital discharge were retrieved from the NEDIS database. The cause of visit was divided into disease and other.
The outcome variable was defined as mortality, which included mortality in the ED, hospital mortality after admission. Discharge with almost no chance of recovery and the expectation of death in the short term were also regarded as mortality [19,20].

Data analyses
Building the mortality prediction model Each measured variable was plotted against mortality, and score values were arbitrarily assigned to ranges of each variable because it could not be assumed that each measurement has a linear association with mortality. Logistic regression was performed with all the variables initially included. Variables with multicollinearity and negative effects were removed when the final model was built. To simplify the prediction system, the final score for each variable was calculated as two times the arbitrary score multiplied by the coefficient derived from the logistic regression, rounded to an integer value. Finally, the scores were summed to generate the mortality prediction score, which is referred to as the 'Emergency Department Initial Evaluation Score (EDIES)' in this article.

Validation of the mortality prediction score
Receiver operating characteristic (ROC) curves were drawn with the EDIES against mortality with data from 2016 and 2017, and the areas under the curves were calculated.

Standardized W statistic for comparing performance between systems
The W statistic is the difference between the actual number of survivors and the predicted number of survivors per 100 patients analyzed. The W statistic has been used to compare the performance of trauma care systems, with predictions based on the trauma and injury severity score (TRISS) [6,11]. However, the W statistic was criticized for being inadequate to compare systems with different severity distributions; therefore, the standardized W (Ws) statistic was proposed, where the predicted probability of survival is divided into a number of intervals, and the W statistic is calculated for each interval [13]. We calculated the Ws statistic and its 95% confidence intervals in the same manner as described in the literature [13], except that EDIES was used for prediction instead of the TRISS.

Validation of the Ws statistics
To evaluate whether the Ws statistics and their 95% confidence intervals can be used to compare performance between systems with different severity distributions, random, severe, and not severe cases were sampled from the 2017 database. Hypothetical severe cases were sampled according to the gamma distribution  with a shape parameter of 9 and a rate parameter of 1, while hypothetical nonsevere cases were sampled with a shape parameter of 3.4 and a rate parameter of 1.1. The parameters were determined based on trial-and-error to reflect extremely high or low levels of severity. (Fig. 1) For each severity category, 30,000 cases were sampled repeatedly 1000 times each. Actual Type I error rates of the calculated 95% intervals of the Ws statistic were calculated by the percentage of samples for which the population Ws statistic from the entire 2017 data was outside of the 95% confidence intervals of the sample Ws statistic [21,22].

Statistical software used
MedCalc Statistical Software version 19.0.7 (MedCalc Software Ltd., Ostend, Belgium) was used to determine the correlation of factors with univariate mortality and curve fitting with linear or quadratic functions. R version 3.6.1 (R Foundation for Statistical Computing, Vienna, Austria, 2019) was used for the other statistical analyses. The package 'stats' was used for logistic regression and sampling cases matched to different severity distributions. The package 'pROC' was used for receiver operating curve analysis.

Characteristics of study subjects
A total of 5,605,288 cases were extracted from the 2016 database. After applying the exclusion criteria, 1,836,577 cases were finally included for model building. From the 2017 database, 5,991,404 cases were extracted, and 2, 041,407 cases were included for further analysis (Fig. 2). The general characteristics of the study population are summarized in Table 1. P-values are not presented because the study was not intended to generate comparisons between the years.

EDIES scoring system
The results of logistic regression of the multivariate association of mortality and variable scores based on univariate mortality and epidemiologic parameters are summarized in Table 2. The final scoring system is presented in Table 3. Body temperature was removed from the final model because the positive univariate association with mortality was inversed in the multivariate regression. Diastolic blood pressure was removed because of multicollinearity with systolic blood pressure. EDIES can theoretically have values between 0 and 39, although with the 2016 dataset, the EDIES were distributed between 0 and 35. (Fig. 3).

Validation of EDIES with the 2017 dataset
When EDIES was applied to the dataset from the 2017 NEDIS registry, the area under the receiver operating characteristic curve for the prediction of mortality was 0.883 (95% confidence interval [CI]: 0.882-0.884, Fig. 4).

Ws statistics
To calculate the Ws statistic, the probability of survival predicted by the scoring system and the fraction for each severity group was derived from the 2016 dataset and are presented in Tables 4 and 5. The Ws statistic calculated from the 2016 dataset was 0.000 (95% CI: − 0.021 -0.021). The Ws statistic calculated from the 2017 dataset was 0.049 (95% CI: 0.030-0.069). From the 2017 dataset, 30,000 cases were sampled repeatedly 1000 times using the random, hypothetical nonsevere, and hypothetical severe categories, and the Ws statistic for each distribution was calculated.
The mean Ws statistic from 1000 random selections was 0.049, while those from nonsevere selections and severe selections were 0.059 and 0.048, respectively. The chances that the 95% CI of the Ws statistic from 30,000 samples included the population Ws statistic, which was 0.049, were 94.5% for random selection, 96.6% for nonsevere samples, and 96.9% for severe samples.

Discussion
We developed a severity score for the prediction of survival or mortality from data collected during the initial    ED evaluation. Although the predicted probabilities of survival might not be useful as prognostic indicators for individual patients, they allow comparisons of the performance of an institution with the original prediction database or with other institutions [14]. Prediction systems such as the Revised Trauma Score (RTS), the Abbreviated Injury Scale (AIS), the Injury Severity Score (ISS), and the TRISS have been used for prognostication for trauma victims and comparisons of performance between trauma systems for decades [6,7,13,14,23]. The area under the ROC curve (AUROC) is commonly used to measure the predictive performance of a scoring system, and the AUROCs of trauma scores including the TRISS are reported to be between 0.57 and 0.98 in the literature [7,23]. There have also been efforts to develop or apply severity scores to general ED patients. Although scoring systems such as the Acute Physiology and Chronic Health Evaluation (APACHE) score or Simplified Acute Physiology Score were developed to predict in-hospital mortality, the application of these systems in the general ED population is limited because they rely on laboratory values that are often not measured in ED patients with relatively mild complaints and are not included in national databases [24][25][26][27]. A recent multinational study found that the National Early Warning Score (NEWS) can predict mortality among adult medical patients in the ED with an AUROC of 0.73, and the combination of the NEWS and laboratory biomarkers yielded an AUROC of 0.82 [16]. Some investigators developed machine-learning models for the prediction of death or admission to intensive care units among ED patients and reported the AUROCs of their models to be between 0.84 and 0.87 [15,28]. The AUROC of the scoring system for survival prediction among general ED patients developed in this study was 0.883, which can be regarded as adequate, considering the fact that patients with diverse complaints were included, no laboratory data are required, and the calculation is much simpler than the calculations needed for trauma scores based on anatomical injuries or sophisticated machine-learning-based models. W statistics were introduced to compare the clinical performance of trauma care between institutions or trauma systems, and the W statistic represents the average increase in the number of survivors per 100 patients treated compared with reference expectations [14].  Although the W statistics are being utilized to compare treatment outcomes among trauma care systems, they have been criticized as inappropriate for comparing systems with different case severity distributions [10][11][12][13].
To address the severity distribution issue, the Ws statistic was proposed as an alternative. The Ws statistic is calculated by standardizing the W statistic with respect to injury severity ranges, in a similar manner to calculating age-standardized mortality rates. The range of severities should be divided into a number of intervals, and the W statistic of each interval is calculated and multiplied by the fraction in the reference population and then summarized into the Ws statistic [13].
Although the W and Ws statistics are based on the TRISS, which was developed for use in trauma settings, the principle underlying the W statistic could be applied to other areas of patient care if appropriate scoring systems are used instead of the TRISS. We developed the EDIES and calculated the Ws statistic based on the EDIES instead of the TRISS, and validated its use for comparing severity-adjusted mortality between institutions with different severity distributions. As the intervals of severity were defined by the authors and the interval definition would affect the validity of the Ws statistic in populations with different case severity distributions, the Ws statistic was validated by taking samples with different severity distributions and comparing the sample Ws statistic with the population Ws statistic. The sample Ws statistics were found to be similar to the population Ws, although the severity distributions of the hypothetical samples were extremely different from that of the population. To assess the statistically significant difference between Ws statistics, the 95% confidence intervals (CI) of the Ws statistic were used [13]. To assess whether the Ws statistics in different severity samples and their calculated 95% CIs can represent the population Ws statistic, samples were taken 1000 times from each severity classification, and the number in which the 95% CI for the sample Ws statistic included the population Ws statistic was counted. Theoretically, it is expected that the population Ws statistic is within the 95% CIs of the sample Ws statistics in 95% of cases, and the results showed that the chances were between 94.5 and 96.9%. Therefore, the Ws statistic and its 95% CI are stable, even with very different severity distributions.

Limitations
While the data were collected from all emergency centers nationally, data from small community EDs were not included; therefore, application of the system to those EDs might be inappropriate. The relatively high rate of exclusion because of missing values may have caused selection bias; more cases were likely included from centers with relatively high proportions of patients with severe cases, which were also likely to have policies in place regarding the recording of vital signs and detailed medical records. Although some existing scoring systems have been derived from populations with fewer missing values, they tend to have been derived from the data obtained in a single hospital, including the Rapid Emergency Medicine Score (REMS) and ViEWS, which later served as a template for NEWS [15,[29][30][31]. Our results are more reliable than the results in those studies because our data were derived from EDs at various levels and of different sizes; however, the amount of missing data was a limitation.

Conclusions
In conclusion, the scoring system developed in the present study utilizing the parameters gathered during the initial ED evaluation has acceptable performance for the prediction of in-hospital mortality. The Ws statistics based on the scoring system can be used to compare the performance of an ED with the reference data or with the performance of other institutions.