A unified machine learning approach to time series forecasting applied to demand at emergency departments

Background There were 25.6 million attendances at Emergency Departments (EDs) in England in 2019 corresponding to an increase of 12 million attendances over the past ten years. The steadily rising demand at EDs creates a constant challenge to provide adequate quality of care while maintaining standards and productivity. Managing hospital demand effectively requires an adequate knowledge of the future rate of admission. We develop a novel predictive framework to understand the temporal dynamics of hospital demand. Methods We compare and combine state-of-the-art forecasting methods to predict hospital demand 1, 3 or 7 days into the future. In particular, our analysis compares machine learning algorithms to more traditional linear models as measured in a mean absolute error (MAE) and we consider two different hyperparameter tuning methods, enabling a faster deployment of our models without compromising performance. We believe our framework can readily be used to forecast a wide range of policy relevant indicators. Results We find that linear models often outperform machine learning methods and that the quality of our predictions for any of the forecasting horizons of 1, 3 or 7 days are comparable as measured in MAE. Our approach is able to predict attendances at these emergency departments one day in advance up to a mean absolute error of ±14 and ±10 patients corresponding to a mean absolute percentage error of 6.8% and 8.6% respectively. Conclusions Simple linear methods like generalized linear models are often better or at least as good as ensemble learning methods like the gradient boosting or random forest algorithm. However, though sophisticated machine learning methods are not necessarily better than linear models, they improve the diversity of model predictions so that stacked predictions can be more robust than any single model including the best performing one.


Background
In 2019 there were 25.6 million attendances at emergency departments (EDs) in the UK, corresponding to 70,231 patients attending every day [1]. The National Health Service (NHS) trusts across England are under very high pressure to maintain current standards and quality of care [2]. In fact, the rate of attendance has grown by 4.8% since 2018 and by 10.6% over the last 5 years meaning that it is increasing at a rate faster than population growth thus putting high pressure on our health care system. Failure to make provisions for surges in demand can lead to overcrowding, which in turn has been linked to multiple adverse patient outcomes such as unfavourable patient satisfaction, poor quality of care and diseconomies of scale [3]. In order to decrease overcrowding, the NHS introduced a new operational standard in 2010, commonlyknown as the "four hour target", requiring that at least 95% of patients attending EDs should either be discharged, admitted or transferred within 4 hours of arrival. However, this target has not been met since 2014 and failure rates have reached a new high in January 2020 with 18% of patients at EDs waiting longer than four hours despite the fact that the overall number of attendances was lower in January 2020 than in January 2019. In fact 15.3% of patients spent more than 4 hours in EDs in 2019 compared to 11.9% last year and 5.5% five years ago, making 2019 the year with the worst annual performance on record [1]. A shortage of staff is not the predominant cause for long waiting times and low quality of care [4], rather, it is that the correct type of staffing is not matched with patient demand. This inefficiency in staffing can have substantial impacts such as the 86,264 of elective surgeries that had to be cancelled for non-clinical reasons on the day the patient was due to arrive in 2019. These cancellations leave NHS hospital trusts with lost costs of surgeons, anaesthetists and nurses as well as surgical session time and theatre capacity. Moreover, in the same year, the percentage of patients who had not been treated within 28 days of cancellation decreased from 9.8% in 2018 to 8.8% in 2019, however, still failing one of the NHS's improvement objectives. In addition the NHS England advises that a 85% bed occupancy rate is the maximum safe level of occupancy and it advises that trusts should try and keep bed occupancy below 92%. However, 126 out of 170 trusts recorded bed occupancy above 85%, 58 trusts rates above 92% in the third quarter of 2019, eight trusts had occupancy above 98% and one trust recorded 100% bed occupancy (NHS SitRep).
Overall the NHS in England has spent around £130 billion in 2018/19 on the delivery of health services [5]. A major fraction, 44.9% in the financial year 2016-2017 [6], of this spending is due to the 1.2 million people employed by NHS hospital and community health services. Between February 2018 and 2019 the number of doctors alone rose by 2.5% and by 10% during the past five years [7]. This emphasises the fact that long waiting times may not simply result from a shortage of staff. Nonetheless, every year high agency and staffing costs are required to cover for staffing shortages.
Given the importance of health care optimisation, procedures to help the hospitals balance the right staffing numbers deployed in the most effective way is paramount. This not only has implications for patient outcomes but for general efficiency within the NHS. This paper provides a means to improve health care delivery through a predictive understanding of the rates of admission and the duration of stay at EDs (collectively referred to as "demand"). In particular we provide statistical techniques that can help facilitate an operationally relevant prediction of demand for use in day to day planning. We also contribute to a better understanding of the reasons for demand and highlight why this understanding is fundamental to the delivery of high quality care.

Previous work
Despite its importance, methods for predicting rates of admission and understanding underlying dynamics have not been studied extensively in the literature. Existing methods have been limited to the application of classical time series forecasting methods. In an Australian study, Boyle et al. [8] predict monthly, daily and four hourly demand at 27 EDs in Queensland. Taking public holidays into account as predictors, the authors used an autoregressive integrated moving average (ARIMA) model as well as regression and exponential smoothing methods to predict demand up to a mean absolute percentage error (MAPE) of 11% for daily admissions. McCarthy et al. [9] predicted hourly presentations to American emergency departments while including several temporal, weather and patient factors on the number of hourly arrivals in order to characterise behaviour at EDs. Jones et al. [10] used seasonal autoregressive integrated moving average (SARIMA), time series regression, exponential smoothing and artificial neural network models to forecast demand. The authors had access to two years of data and made predictions ranging from 1 to 30 days in advance for three hospitals in the USA. The authors found seasonal and weekly patterns to be most important for an accurate prediction and obtained a Mean Absolute Predictive Error (MAPE) of 9 − 13% depending on the facility. Champion et al. [11] performed an analysis of monthly demand at an emergency department in regional Victoria. They applied exponential smoothing and Box-Jenkins methods to five years worth of admissions data. Hoot et al. [12] carried out a discrete event simulation in order to obtain forecasts for 2, 4, 6 and 8 hours into the future. The authors analysed waiting times, length of stay and bed occupancy.

Our contribution
We develop a novel, predictive framework to understand the temporal dynamics of hospital demand and we apply an exhaustive statistical analysis to daily presentations at EDs at St Mary's and Charing Cross Hospitals, evaluating a range of standard time series and machine learning approaches and ultimately developing our own unique approach. In contrast to existing studies, we do not only focus on the application of time series algorithms in order to characterise demand but develop a generic procedure that allows us to compare and combine both time series and machine learning algorithms in order to obtain an informative, more appropriate and consistently accurate approach to the prediction of demand. Our models have the ability to be retrained regularly and efficiently and are therefore a powerful tool for online platforms and near real time prediction. Using novel data from electronic logging systems from eight years of daily presentations to EDs at St Mary's and Charing Cross Hospitals in London, we construct a model that predicts the number of daily arrivals to both hospitals. Our analysis accounts for seasonal fluctuations, daily observed weather data and specific, pre-planned events indicated by staff at both EDs such as the yearly Notting Hill Carnival. Using our procedure can help the hospitals with the provision of the right staffing numbers and deploying resources in the most effective way.

Data sources
The St Mary's and Charing Cross Hospitals are part of the Imperial College Healthcare NHS Trust, one of 228 NHS hospital trusts in England. St Mary's Hospital is the major acute care hospital for North West London housing a major trauma centre. Its ED has faced an average demand of 208 patients (with a maximum of 289 and a minimum of 99) every day since 2011. Charing Cross Hospital includes the serious injuries centre for West London as well as a hyper acute stroke unit. On average there have been 106 (with a maximum of 177 and a minimum of 60) daily attendances at the ED since 2011. For our analysis, we had access to electronic data records of the number of daily attendances at the EDs for both hospitals from 2011 to 2018 (see Fig. 1). In order to investigate the demand dynamics, we also collected data on school [13] and bank holidays [14], as well as on the weather and Google search volume for the word "flu" [15,16] (see the Appendix for more details). Finally, experienced staff at the EDs of both hospitals provided us with a list of specific known events in the locality that cause surge in demand (e.g. the Notting Hill carnival -an annual festival taking place in the catchment area of the hospital).

Exploratory data analysis
A central aim of this paper is not only to predict hospital demand accurately but also attempt to understand the factors driving hospital demand. Daily data is driven by a complex web of exogenous variables, many of which are related to seasonal patterns or trends. Both time series depicted in Fig. 1 show a strong underlying trend. In the case of Charing Cross Hospital this trend is clearly upwards while it goes downwards first for St Mary's Hospital due to a change of the hospital's infrastructure, see Fig. 2. There is also clear seasonality, the monthly attendance at St Mary's Hospital shows a very clear monthly periodic pattern with troughs in January, April and August and a rise in attendance during the winter months (likely due to increases in acute respiratory infections), see Fig. 3. Figure 3 also shows indications that bank or school holidays have a strong influence on the number of ED admissions together with the flu season. It also shows that the flu seasons contribution to increased demand runs well into spring. While the monthly attendance at Charing Cross Hospital also shows some periodic behaviour, it is not as strong, see Fig. 4. It is therefore useful to note that dynamics differ even from geographically close hospitals with overlapping catchments. Both series show clear day-of-week patterns, characterised by a strong autocorrelation with respect to their lagged values of order 7, see Fig. 5. Mondays have the highest volume of attendances at both hospitals while attendance reaches its minimum during weekends. This finding validates and confirms other studies on hospital demand [17,18].

Methods
We focus on forecasting demand one, three, and seven days into the future. These particular forecasting intervals are relevant as they allow the hospitals to take action by using short term measures such as the cancellation of elective surgeries or the hiring of additional staff through agencies.
We use two different kinds of algorithms for our predictions: traditional time series and machine learning algorithms [19]. A discrete time series is a sequence of data points in chronological order divided into regular time intervals. The fundamental assumption behind both algorithms is that data points that are close to each other in time show a similar behaviour and that there is a dependency between data points at the same position of the time interval, e.g. same time of the year or same day of the week. Time series algorithms use both the chronology of the events and the specified interval in order to make inference and split the time series into different linear components such as seasonality, trend and a residual. The residual, which is assumed to contain some correlative structure, is usually modelled using an autoregressive stochastic process or exponential decay where future values are predicted based on past values [19]. In contrast, machine learning algorithms specify a broad function class (such as trees or smooth curves) with sufficient capacity to learn complex functions. These algorithms learn from data balancing function complexity with predictive accuracy. For both sets of algorithms we create a model (design) matrix containing explanatory variables. For the time series algorithms only lagged demand from previous time points were used. For the machine learning algorithm lagged demand values were used alongside other covariates (see Table 1). As predictors we have chosen past values such as demand on the previous day, last week and the average of the past week as well as indicators for bank holidays and school holidays. Moreover, we use data from some of the surrounding weather stations on precipitation, minimal and maximal temperature as covariates. Finally, we use search engine query data as a covariate, as it has proven to be a very efficient measure for the detection of influenza epidemics [20]. Table 1 shows a few rows of our model matrix.
Below, we summarise the time series and machine learning algorithms that we have considered (for detailed mathematical information we refer the reader to [21]:  space, an average of the labels of the k nearest neighbours is taken. The disadvantage of this algorithm is that it slows down quickly once the volume of data points increases.

Stacked regression
All the above algorithms have strengths and weaknesses. It is therefore challenging to choose a single model for prediction. We create a consensus model by adopting stacked regression, a particularly effective ensemble approach. The idea of stacked regression is to combine the diversity and strengths of multiple algorithms into a single model with a better overall ability to generalise. We choose a linear stacking model subject to convex combination constraints [27]. Following [27,28] we train a linear stacker on cross validation predictions of the individual time series and machine learning models. This procedure helps us avoid selecting models that overfit to the data. Stacking models is not only empirically motivated, but has a strong theoretical backing and has been proven to perform asymptotically exactly as well (by some loss metric) as the best possible choice for the given data set among the family of weighted combinations of the estimators. Stacking has also been showing to work in a variety of settings [29].

Validation and evaluation
Developing an algorithm with the best possible forecast accuracy is the main goal of this study. However, care must be taken to ensure that these forecasts are not simply overfitting to noise in the data but are accurate and can truly forecast to unseen data. A key innovation of this paper is the development of a novel general purpose time series cross validation procedure to ensure that: a all algorithms are evaluated fairly and equally, b the same data is used in all algorithms, and c forecast errors are completely blind to the held-out data (i.e. exactly as if the model was being used in a real forecasting setting).
Temporal or time series cross-validation [21] is a method to split the data into testing and training sets in order to account for temporal structure in the data. The main idea is that each test set only consists of a forecasting window of one day which lies one, three or seven days in the future while the corresponding training set consists of a number of observations prior to the forecasting window. The origin can either be fixed so that the length of the training window grows by one, three, or seven days for each new test set or it can move forward so that the  training window size remains fixed. We employ the latter method.
Using temporal cross-validation for time series algorithms only requires splitting the data into a training and test set. Adapting the data so that the last 730 days (24 months) of data are held out for testing, 2140 days (roughly six years) were available for training which corresponds to a ∼75%/25% split. We then applied temporal cross validation to each day in the test set (see Fig. 6) using a rolling window so that all training sets consist of the same number of days.
Given the fairness and robustness of our cross validation scheme, we are confident that our results are robust to data shift and not only valid for the data times we have collected.

Validation of hyperparameters
In order to allow for a fair comparison of the machine learning algorithms, the method for temporal crossvalidation has to be adapted as most machine learning algorithms require tuning of their hyperparameters to balance over and under fitting. Therefore we also have to split the training data into a training and a validation set in order to choose the best set of hyperparameters which minimizes the error on the validation set. All machine learning algorithms listed above (see "Machine learning Fig. 6 Splitting the data into training, (one day) validation and test set algorithms" section) were trained on 4 years worth of data and their hyperparameters were compared on a validation set consisting of 730 days. In our analysis we have used two different approaches to split the training set and to choose the hyperparameters. We will call these the batch and the online method.

The batch method
In order to choose the set of hyperparameters which minimizes the error on the validation set, we consider five different approaches. We choose the set of hyperparameters which minimize the error 1 on the previous day, 2 on the past n days, 3 over the whole validation period using an exponential moving average, 4 averaged over the whole validation period, or 5 according to caret's built in rules (see [30]).
For each of the five cases above, we choose the best set of hyperparameters for each day of the test set and refitted all models on a daily basis. Of course refitting every model for each day of the test set is computationally expensive. Therefore we develop an online method, described below.

The online method
In the batch method described above, all models are refitted daily, which takes significant computational power to run and might not be feasible for a deployed version of our methods in a hospital setting. Thus, we explore whether keeping the parameters fixed for longer periods, which yields significant savings in terms of computation, hurts performance. We refit each model over several testing periods of various sizes which are subsets of the test set of length 730 days with a rolling origin, see Fig. 7. Our chosen testing periods are 1 day, 7 days, 30 days, 60 days, 365 days and 730 days long. That means that we validated the best set of hyperparameters for each algorithm for every testing period, and then rolled forwards. The final error rates are the result of the overall error on all predictions on the whole test set of length 730.

Results
We compare the mean absolute error (MAE) and mean absolute percentage error (MAPE) rates for all time series algorithms as well as for the batch and the online method as outlined above (see "The batch method" and "The online method" sections). Finally, we compare our results with the stacked regression.

St Mary's hospital
The MAE of the time series algorithms range from 14.46 to 17 as shown in Table 2, with an MAPE ranging from 6.9% to 8.3%. Hence, although the time series algorithms are simpler and only based on the time series without using any other predictors, such as weather, bank or school holidays (see Table 1 for details on all predictors), they already give relatively accurate results. In case of the machine learning algorithms, independent of the type of hyperparameter tuning we use, most of the results for the batch methods range between an MAE of 14.3 to 15.2 and an MAPE of 6.8% − 7.4%, as shown in Table 3. Only the k-nearest neighbour algorithm performs worse despite different ways of tuning. Especially when using the online method, see Table 4a, the linear models produce the best results for St Mary's hospital with an MAPE of 6.8% in the case of daily retraining of the generalized linear model.

Charing cross hospital
The results for Charing Cross Hospital are similar although the error rates are slightly increased. As shown in Table 2, most time series algorithms yield MAE error rates between 10.5% and 14.5% while the corresponding MAPE error rates range from 8.5% to 12%. The best performance reached using the batch method is also around 10.5% although the overall performance is a little bit better, see Table 5. In case of the online method, the MAPE error rates are as low as 8.6% for gradient boosting machine, the generalized linear model or a simple linear model retrained on a daily basis, as shown in Table 4b.

Stacked regression
In order to make use of the strengths of all algorithms, we applied a generalised linear model as well as a penalized regression to create an effective ensemble approach, see Fig. 6 for details. The best performance was achieved

Interpretation of results
In our analysis we consider a variety of predictors ranging from past values of ED attendance, to the weather forecast and to school and bank holidays. The question left to answer is therefore which of the predictors are actually important for making accurate predictions? Could some of them actually be redundant? In machine learning, variable importance can be defined as the dependence between input and output variables and computed by permuting the values of a given pre-dictor and calculating error on a held out set. This measure has drawbacks in the case of multicollinearity, which could suppress the importance of certain variables in some models. As shown in Fig. 8, the most important variable predicting demand is the average demand the week before -except for glmnet -followed by specific days and months. The importance of these variables varies considerably between algorithms but largely is consistent with one another. To our knowledge, this is the first attempt to quantify the importance of variables in predicting demand. We believe that these importance percentages can be utilised as heuristics to help staff improve their prediction of demand in the absence of statistical analysis.

Discussion
The results of our analysis highlight that in our case a penalised linear method performs equally well or better when compared to more sophisticated machine learning methods such as the gradient boosting algorithm, the random forest algorithm or the stacked regression. However, each of these methods has desirable advantages over the others. While a linear model is much easier to apply in practice, machine learning algorithms can generalize better out of sample. Combining both the machine learning and linear modelling approaches improve the diversity of model predictions so that stacked predictions are more robust and accurate than any single model including the best performing one. This has been formally demonstrated [31] but is also intuitive -it follows from a 'wisdom of the crowd' rule for ensemble learning. However, using a stacked regression comes with the disadvantage that the valuable insight on the drivers of a time series may get lost. Hence the mechanism presented above can be understood as a road map to find the best method for a particular problem weighing its accuracy, simplicity and interpretability. If additional covariates are collected for our problem at hand or the data is more granular, e.g. hourly, this process needs to be reapplied. The online method that we have created has the ability to provide accurate results with a very quick turnaround time: an average model only takes a couple of minutes to train and forecast. Running it over longer periods of time without retraining of the model is much faster and at the same time not significantly worse than tuning the hyperparameters on a daily basis. These properties are highly desirable from an operational standpoint. Our model is easily deployed to perform near real time analysis to help staff and health care planners adjust existing rotas or prepare for excess demand, especially important when health systems are stressed by acute health events such as COVID-19 [32].

Conclusion
The framework, analysis and methodology proposed in this paper are highly relevant from an operational viewpoint. To the authors knowledge, the majority of EDs in the UK do try to account for demand, but do so using ad hoc heuristics. Our approach provides some scientifically backed information to improve these heuristics, but more importantly provides a framework that is quick and easy to implement. Estimates of 1, 3 and 7 day demand forecasts can be created at any time and updated easily allowing statistically backed estimates to be used to inform hospital policy and practice. Our hope is this paper will fill a knowledge gap and increase hospitals uptake in the use of these methods. Given epidemics and disease outbreaks that can strain health systems, our approach can provide added precision to help EDs operate as efficiently as possible. The challenge will then be for teams in hospitals to implement more insight-driven ways of working,