Study design and setting
This was a retrospective cross-sectional study using previously identified variables  reflective of the presentation of septic patients arriving to the ED at Södersjukhuset. The hospital is located in Stockholm, and has more than 120,000 annual ED visits . The study period was between January 1st 2013 and December 31st 2013.
The inclusion criteria were patients ≥18 years of age, admitted to in-hospital care via the ED at Södersjukhuset and discharged from in-hospital care with an International Classification of Disease, Tenth Revision, (ICD-10) code corresponding to sepsis (A02.1, A22.7, A26.7, A32.7, A39.2, A39.4, A40.0 – A40.3, A48 - A49, A41.0 - A41.5, A41.8 - A41.9, A42.7, B37.7, R57.2, R65.0–65.1).
The exclusion criteria were healthcare-associated infection (HCAI), defined as sepsis onset after 48 h from arrival to the ED,  patients arriving by emergency medical services (EMS) with ongoing treatment for sepsis or other infectious diseases, unknown mode of arrival and the lack of personal identification number and medical record.
Definitions and predictive variables
Sepsis was defined as discharge from in-hospital care with an ICD-10 code corresponding to sepsis as specified above. Data was collected when the SEPSIS-2 criteria were in use . The study population included both EMS patients, arriving by ambulance or helicopter, and non-EMS patients, including all other means of arrival to the ED. The definition of severe sepsis was in accordance with a prior definition adapted for emergency care .
A total of 90 previously identified variables reflecting the clinical presentation of septic patients to the ED (i.e. vital signs, symptoms, observations and information from medical history, see Supplementary figure 1) were used,  in addition to mode of arrival. I.e. a total of 91 variables were included and used as input for the machine learning methods, as described below.
Ethical approval and consent to participate
Ethical approval was obtained from the regional review board (“Regionala Etikprövningsnämnden i Stockholm”) in Stockholm, diary number 2012/1288.31/3 and 2015/1019–32. All methods were carried out in accordance with relevant guidelines and regulation. Informed consent was waivered by the regional review board in Stockholm as the current study was retrospective and based on a review of medical records.
IBM SPSS Statistics for Macintosh, version 26.0 (IBM corp., Armonk, N.Y., USA) was used for the descriptive analysis, i.e. calculating mean, median, confidence interval and interquartile range for the characteristics of the study population. Shapiro-Wilks test was used to test for normality.
Balanced random forests
The supervised machine learning models were developed using the Balanced Random Forest Classifier from the Imblearn collection . This method can be used to build prediction models and to identify associations between specific variables and predicted outcome in unbalanced data. The Balanced Random Forest Classifier technique approaches the challenge of an imbalanced dataset by under-sampling the majority class (bootstrapping) and applying ensemble learning . Thus, the class distribution is changed in order to represent classes equally in each tree; in this case “Patients who died within 7 or 30 days” and “Patients who survived”. Prior study has shown that under-sampling is a more effective method to balance data compared to over-sampling,  explaining the choice of using under-sampling in the current study. A 10-fold cross validation was implemented with an 80 to 20 percentage train:test distribution. In each fold, the Balanced Random Forest Classifier included 100 decision trees. Each decision tree was created from a randomly selected subset of the fold’s training set through bootstrapping and included an equal number of patients who died and survived. The ensemble learning method called feature bagging was implemented in the development of each tree, randomly selecting a subset of variables, equal in size to the square root of all variables, to be tested in each node split. When the model makes a prediction, it is based on the majority vote from each of the 100 decision trees. The fold’s test set was used to determine the accuracy, described as area under the ROC curve (AUC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (LR) and negative LR. The mean value of the accuracy from all 10 folds is presented in the results. The SHapley Additive explanation (SHAP) interpreter was used to illustrate the relationship between specific predictive variables and the outcome. An exhaustive search was made describing how the mean AUC changes depended on the number of variables included to determine the number of variables to include in the final model. For each iteration of the exhaustive search, the least important variable, described in Gini Impurity, was excluded. See Supplementary figure 2 and 3 for details.