This study was approved by Indiana State Department of Health (ISDH) data release committee and the Indiana University Institutional review board (USA).
Datasets
Data collected for this study were derived from the original Health Level-7 (HL7) version 2 registration transactions for ED encounters from 96 institutions participating in the Indiana Public Health Emergency Surveillance System (PHESS) between January 1, 2008 and December 31, 2010. The data is not publically available but can be accessed through the Regenstrief Institute Data Core (https://www.regenstrief.org/hsr/research-programs/rcher/data-core/).
The processes for preparing ED encounter data as well as the details for each step were presented in our previous paper [17]. Briefly, registration transactions were processed to ensure each transaction was unique and contained valid ED encounter data according to PHESS requirements and a set of heuristics drawn from Regenstrief’s long-term real-world experience operating a health information exchange. Unique ED encounters were established using data elements including person, place and time. The specific fields included [1] healthcare institution (HL7 MSH-4), [2] ED encounter date (HL7 PV1–44), and [3] medical record number (HL7 PID-3). Transactions missing any of these fields could not be definitively and uniquely identified as an encounter and were excluded from the analysis.
Unique patients were identified using various combinations of patient demographics, including social security number, last and first name, gender, date of birth, telephone number, and zip code as determined by an open-source probabilistic record linkage software package [18]. In this manner all ED encounters belonging to the same patient were linked, forming a “patient group.” A unique global patient identifier was assigned to each patient group. In total, we identified 7,447,521 unique ED encounters. Data available for analysis includes: age, sex, chief complaints, ZIP codes for patients’ address, and hospital ZIP codes. Patients’ global identifier was used to link visits across different hospital databases, including all ED visits regardless of disposition.
Predictive model
We developed multivariable logistic regression models. Patients with at least one ED visit in 2008 were used to predict ED visits in the years of 2009 and 2010. Patients who died before January 1, 2009 or had missing values in one or more covariates were excluded (<4.30 %). The final sample size was 1,272,367 patients. All variables were summarized at the patient level for model development.
Covariates
All covariates were determined based on the ED utilization data in 2008.
Age: age was determined at the time of the first ED visit, and divided into six subgroups: <5, 5–14, 15–24, 25–44, 45–64 and > =65 years.
Sex: male and female;
Visits in 2008: the total number of ED visits made in 2008 for each patient;
Chief complaints: the chief complaint syndromes were grouped into 11 categories: respiratory, gastrointestinal (GI), undifferentiated infection (UDI), influenza-like illness (ILI), lymphatic, skin, neurological, pain, dental, alcohol and musculoskeletal syndromes. These categories were used by other surveillance programs with slight modification [19–21]. Chief complaints that could not be grouped into the above 11 syndromes were assigned to “unclassified”. The categories were then reviewed by two physicians (Grannis S, Finnel JT) and an epidemiologist. For each patient, the proportion of each chief complaint syndrome is determined through dividing the number of ED visits with a specific syndrome by the total number of ED visits that the patient had in 2008. Since one ED visit may have more than one syndrome, these percentages do not add up to 100 %.
Zip code centroid straight-line distances: The Perl library Geo::Distance was used to calculate the straight-line distances between geographic points from patients’ home to hospital based on zip code centroids of patient’s home address and hospital address. Distance was then grouped into 3 categories: <=5 miles, 5–20 miles and >20 miles. Since one patient may have multiple ED visits with different distance, we determined the proportion of ED visits falling into each of the three categories by dividing the number of ED visits with a specific distance category by the total number of ED visits that a patient made in 2008. Because the proportions for each of these three distance categories add up to 100 %, only two categories (<5 miles and >20 miles) were included in the analytic model.
Study outcome
The outcome was measured as dichotomized variable (frequent versus low ED user). Frequent ED users were investigated by using visit cut-points ranging from 8 to 16 visits over a two-year period (between 2009 and 2010). One model was fit for each cut point. Patients were defined as frequent ED users if their ED visits were equal to or higher than the visit cut-point, and were otherwise defined as low ED users.
Model performance evaluation
The model’s performance was assessed for discrimination using the Receiver Operating Characteristic (ROC) curves. We balanced the goal of identifying all frequent ED utilizers with the intervention cost of incorrectly identifying frequent ED users by selecting a fixed sensitivity of 25 % to minimize the false positive rate. We then evaluated the specificity and positive predictive value (PPV) for each model at fixed sensitivity of 25 %. We also combined the false positive (FP) patients who had 8 or more visits with the true positive (TP) patients to obtain the “adjusted” positive cohort. The “adjusted” PPV was determined by dividing the “adjusted” positive group by the sum of TP and FP. Statistical analyses were conducted using SAS version 9.3 (SAS Corporation; Cary, North Carolina).