With French GN or SN, we showed that FRENCH scale gives better triage results than the US ESI scale, whatever nurses experience. For unstable patients, we observed that ESI scale is less performing : more undertriage could potentially lead to adverse outcomes and supports the use of FRENCHThe inter-observer triage concordance for any experience gives no significant difference with 8100 results of dual triage. Moreover, the practicality of the two French user populations are in favor of the FRENCH scale about learning, use facility, and tool security.
We chose to take up the 60-referent cases made by each of the two sorting scales developers; their rating defined a gold standard triage for these fictional patients. The same nurses have sorted the cases of the two scales, after training, but without previous experience of one or the other of the two scales. These clinical situations on paper make it possible to guarantee identical triage conditions: each nurse is confronted with the same cases, without neither the subjectivity induced towards the patient, the simultaneous sorting with two judges’ blind, nor the sorting bias posteriori. This is independently assessed of their triage experience and their service habits [9]. These results are consistent with other studies on these same ESI (κPL = 0.84, IC: 0.77–0.91) [17].
For the inter-observer concordance test we obtained correct results with two scales and our data are in accordance to the literature (FRENCH κPL = 0.77, κ = 0.64, ESI v.3 κPL = 0.89) [9, 18]. These concordances are lower than in referring articles with a common practice of the evaluated scale for nurses [9]. In order to compare their results, our nurses did not know either of the two scales in common practice. We have not compared the results of nurses with similar experience but with all experience. Both concordances would be artificially increased. On the other hand, in order to test better knowledge, it seems that the referent cases proposed by the developers of each scale are more difficult to sort than the average of real cases. Finally, triage of paper cases usually gives a lower agreement than the identical real cases [19].
This study has several limits. Paper scenarios obtain different results of triage compared to real cases. However it allows a better inter-individual comparability of the triage. However, paper-cases may not be representative of real clinical practice in ED and leave room for imagination. Cases simulated by an actor would not have this limitation. Furthermore, as the clinical scenarios were performed differently for the two scales, the differences observed may be due to differences in the difficulty of the scenarios (level 1–2 scenarios: 13/60 for French and 26/60 for ESI). Using the same scenarios, by consensus of experts on both scales would not have such important limitation. However, the evaluation of the clinical cases by the experts who constructed each scale seemed more robust than a comparative evaluation by independent experts. The same raters participated in each scale test. Despite an interval of 15 days between the two tests for each nurse, and a reversal of the order of the tests for each half group, contamination between the two scales was indeed possible (Additional file 1: Figure S1). As the difference in choice between two judges is important for the same scale, it seemed important to use the same judges in each scale in order to be able to interpret the results of correct sorting in one or other of the scales [2]. In the case of a randomized trial, the comparison of kappa cannot distinguish the influence of the scales and those of the different judges in the two groups, because of a low intrinsic kappa for the scale.
Experienced nurses worked in the same emergency department but each one had previous experience in other department. The two populations effectives are different but represents a big part of each analyzed group (respectively 69.6% of experience nurses and 90.0% of the nurse students). For both scales, the training received by learners was short. The time offered to answer the questionnaires was limited. The response rate differs between the two groups but is related to the speed of implementation: unlike experienced nurses, most nursing students were unable to full complete the questionnaire. To limit the measurement bias, nurse had to respect the order of the question. A post-training test over several days could provide different results. Subjectivity bias is limited by retaining the same nurses for both scales. Even if the greatest care has been taken for this stage, the translation of the scale and the paper cases of ESI into FRENCH may have lost some nuances. Selection bias is limited by the absence of sampling since all existing scenarios for each scale were included.
Our results seem to be in favor of the FRENCH scale. Indeed each scale was developed in a given context and for a given care organization. Although naive on each scale and in spite of training identical to each one, they may have been influenced by their training,education, and organization French models. The French model defines the nurse as an effector of the medical decision. Whereas the American model leaves a more important place to the decision-making of the nurse, American nurses are more trained and better qualified than in our European system. Thus, for our French nurses, the FRENCH scale may appear more adapted to their practice and more secure. In fact, FRENCH scale leave less freedom to the nurses compared to the ESISo, in a French-style health care system, French nurses and nursing students seem more prone to apply according to the way it has been elaborated, the FRENCH scale than the American ESI scale.
Our study compares two cultures and two ways of thinking. The ESI leaves more freedom for the nurses judgment coupled with an assessment of care needed. FRENCH headed scale approaches an e-sorting scale. A recent paper evaluates an electronic triage system (e-triage) based on machine learning that predicts likelihood of acute outcomes enabling improved patient differentiation [20]. As the FRENCH scale, e-triage is composed of a random model applied to triage data (vital signs, chief complaint) to determine a triage score and seems to improve ESI under-sorting. In both cases the final adjustment of the triage score, depending on the clinical context and the patient’s medical history, is based on the evaluator’s experience: triage nurse for FRENCH and big data in e-sorting. In view of their similar results with ESI, these two conceptions of final triage fit still need to be compared together in a prospective study.
However in the absence of gold standard scale, any comparison need to be attentive to evaluation criteria. In fact the ESI takes into account the care to be provided and therefore influences the indirect validity. Thus an e-sorting scale will have good concordance results. The direct validity is not influenced by the scale but can only be performed on paper cases. However gold standard scenarios are necessarily developed by the designers of the scale. A lower difficulty of the scenarios can enable them to obtain better scores of adequate sorting (direct validity). Identical scenarios for each scale, sorted by their own experts, would limit this bias.