Overall, internal consistency reliability (Cronbach’s alpha) was strong for both MDP domains (Immediate Perception and Emotional Response) across all three recall administrations. During the ED visit, test–retest reliability in recall MDP ratings for dyspnea at the time participants decided to seek care in the ED was strong for individual items and very strong for the two domains (Table 3). Within-subjects agreement (intra-rater reliability) was satisfactory for both domains (Additional file 1: Table A2). In contrast, for the much longer recall interval between the ED and follow-up visits, the test–retest reliability (Table 3) and within-subjects agreement (Additional file 1: Table A2) were poor for individual items and significantly attenuated for the two domains.
For the short recall interval during the ED visit, the median within-subjects difference in scores was 0 for individual items and from 0 to 0.2 for the mean domain scores (Additional file 1: Table A2). There was a small but consistent shift toward lower ratings on the second administration in the ED (Table 1). Assuming the earliest recall rating as the reference standard, the consistency and amount of shift indicates a systematic error or bias of approximately +0.3 points on average as reflected in the positive mean within-subjects differences (Figure 1). This shift was also evident in absolute values of within-subjects differences at the 75th, 90th, and 95th percentiles generally exceeding the corresponding absolute values at the 25th, 10th, and 5th percentiles, respectively (Additional file 1: Table A2).
For the much longer test–retest interval between the ED and follow-up visits, median within-subjects differences were 0 for individual items and −0.2 to +0.1 for the mean domain scores (Additional file 1: Table A2). There was a small but consistent shift toward higher recall ratings at the follow-up compared with the initial recall ratings in the ED. This was reflected in the negative mean within-subjects differences of approximately −0.3 points for the Immediate Perception items and −0.5 points for the Emotional Response items (Figure 2). This shift was also evident in absolute values of within-subjects differences at the 25th, 10th, and 5th percentiles generally exceeding the absolute values of differences at the 75th, 90th, and 95th percentiles (Additional file 1: Table A2).
The magnitude of these shifts was small across both test–retest intervals. In addition, the 95% CI for differences for a majority of the individual items in Figure 1 (Time 0a–Time 0b) and for all individual items and domain scores in Figure 2 (Time 0a–Time 0c) are consistent with 0 difference, and the 95% CI in Figure 2 are much wider than in Figure 1. However, within each recall interval, the shifts were in same direction throughout the percentile distributions of within-subjects differences for items and domains (Additional file 1: Table A2), suggesting that the shifts are not due to outliers. In Figure 1, it is noteworthy that the point estimates for mean paired differences are > 0 for each mean domain score and for 11 of 12 individual items, whereas in Figure 2, the point estimates for mean paired differences are < 0 for each mean domain score and for 11 of 12 individual items. The consistency of those shifts within each test–retest interval is unlikely under a null hypothesis of random error around 0 difference and, on that basis, we believe systematic error (bias) to be a more plausible explanation. However, these shifts were not anticipated findings and deserve further investigation before any firm conclusions can be drawn.
We found that test–retest reliability for the items and mean domain score for Immediate Perception was stronger than for the Emotional Response items and domain score. In several studies in the pain literature, recall was more reliable and accurate for sensory compared with affective ratings  or pain descriptor choices .Although the component structure of the MDP recall ratings was similar across administrations, one notable difference was that Frustrated was the Emotional Response item with the strongest loading in both ED administrations, whereas Afraid was the strongest loading Emotional Response item during the follow-up visits (Additional file 1: Table A1).
In contrast to our findings, studies of neurological symptoms, specifically dizziness  and headache , have found substantial imprecision or lack of concordance in response to the same questions on two occasions in the ED  or to two semantically similar questions asked concurrently . However, in both of those studies, the recall or concordance task involved nominal categories (i.e., qualitative descriptor categories  or dichotomous, yes/no type, choices ), not rating scales (as in the present study). It may well be the case for self-reported symptoms that test–retest reliability (or the assessment thereof) is facilitated if numerical rating scales are used rather than nominal (unordered) categorical choices. Alternatively, it is conceivable that symptom recall in the ED may be more reliable for dyspnea than it is for dizziness or headache.
An important limitation of the study is that we were unable to measure pre-arrival dyspnea in real time. The use of recall ratings was necessitated by limitations on approaching patients for participation until after initial clinical evaluation. In addition, the protocol did not include objective measures related to dyspnea during the ED visit against which the recall ratings could be assessed. However, in a previous publication  MDP “now” ratings during the follow-up visit were significantly and positively correlated with other measures of functional limitation due to breathlessness or fatigue, somatization, depression, and anxiety.
Other study limitations included convenience sampling, exclusion of patients who were unstable, and practical and ethical constraints on when initial contacts with patients and enrollment could occur relative to arrival in the ED. In addition, there were several limitations to our statistical analysis. Convenience sampling is difficult to avoid in observational studies with acutely ill patients, and we necessarily had to exclude patients who were unstable or whose capacity to consent was adversely impacted by their condition. Although participation was limited to English-speaking patients, nearly all exclusions on that basis were of patients who were Spanish speaking. Nonetheless, more than a quarter of participants were Hispanic. With respect to statistical analysis, we used principal components analysis rather than factor analysis to assess domain structure of the recall ratings. Estimates for component loadings, communalities, and total explained variance tend to be somewhat inflated for principal components compared with factor analysis. However, they generally agree on the number of components or factors to keep and which items load primarily on which factors [62–64] (see Additional file 1: Principal components analysis and Table A1).
At the same time, several strengths of this study are notable. Apart from the limitations noted above, our inclusion criteria were broad, and our sample was diagnostically heterogeneous, suggesting that use of the MDP in the ED is not diagnosis-specific. We believe that enhances its potential usefulness in the ED. In conjunction with previous evidence of internal validity of the MDP (e.g., that items can discriminate between different dyspnea stimuli in controlled experiments  and that “now” ratings are responsive to clinical change in the ED ), results of the present study support its external validity. In addition, as recommended by Broderick and colleagues , we used a multiple-item instrument, gave clear and consistent instructions as to the rating task and dimensions to be rated, and recall was referenced to a specific point in time, the decision to come to the ED. Our results demonstrate high reliability in dyspnea recall when using the MDP during an ED visit and a high degree of similarity in factorial structure to MDP “now” ratings obtained after initiation of treatment . However, we also found that test–retest reliability was poor for individual items and markedly decreased for domain scores over a 4- to 6-week recall interval between the ED and follow-up visits.