WHAT IS ALREADY KNOWN ON THIS TOPIC
-
Minimally important change (MIC) thresholds determine the smallest amount of change captured by a patient-reported outcome measure (PROM) that is clinically meaningful, thus facilitating score interpretation and shared decision-making. At present, MIC thresholds are rarely reported in youth mental health trials.
WHAT THIS STUDY ADDS
-
This is the first study to identify anchor-based MIC thresholds for two commonly used PROMs, the Columbia Impairment Scale (CIS) and the Strengths and Difficulties Questionnaire (SDQ), in Canadian youths who accessed mental healthcare. Score changes equivalent to 12% and 8% reductions in CIS and SDQ baseline scores, respectively, were perceived as meaningful by youths; conventional rules of thumb defining a response to treatment as a 50% reduction in baseline scores may thus underestimate impact.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
-
The MIC provides a patient-centred outcome indicator and may bring attention to subtle but meaningful changes in outcomes. However, in light of measurement error, recall bias and imprecision, clinicians should apply caution when interpreting a youth’s score change against MIC thresholds, ideally using the MIC alongside other information, clinical judgement and in conversations with patients and families.
Background
Evidence-based youth mental healthcare requires research data that clinicians, patients and family members can easily interpret to make shared treatment decisions. Many patient-reported outcome measures (PROMs), however, generate scores that lack inherent meaning, especially when measuring change over time. Effect sizes are difficult to interpret without statistical training and may conceal heterogeneity in treatment effects. For clinical decision-making, clinicians may prefer binary outcome indicators like remission, recovery or clinically significant change which indicate whether a youth has moved below a diagnostic threshold.1 Diagnostic thresholds are, however, often unavailable for PROMs assessing non-symptom constructs like functional impairment, which expert groups and funders have identified as a core outcome to measure.2–4 Another binary indicator, response, does not require diagnostic thresholds but is often defined crudely, for example, as a 50% reduction in the baseline score.5
The ‘minimally important change’ (MIC) threshold is designed to address these challenges and represents the smallest amount of change that is perceived to be clinically meaningful to those receiving treatment.6 7 There are two groups of methods for defining a PROM’s MIC. First, anchor-based methods relate the PROM change score to an external criterion of clinical relevance, typically established by patients (or caregivers) completing a self-report measure of perceived change. Anchor-based methods include means-based estimations, receiver operating characteristic (ROC) analysis and predictive modelling. Second, distribution-based methods define an MIC with reference to the scoring distribution or scale precision and have no inherent clinical meaning. Hence, only anchor-based MIC thresholds enable the interpretation of treatment benefits as defined by patients. Comparing MIC estimates obtained through different methods and examining convergence has been recommended to select the most trustworthy threshold.8
Despite its potential for enhanced interpretability, the MIC was not reported in any of the 98 trials for youth depression included in a 2021 systematic review.5 On 26 November 2024, we conducted a rapid search in PsycINFO and MEDLINE via OVID (see online supplemental appendix S1) to identify studies establishing ‘minimally important change’ or ‘minimally important difference’ thresholds for PROMs used with youths with mental health difficulties. This search yielded only two relevant studies: one examining paediatric measures of the Patient-Reported Outcomes Measurement Information System and another focusing on the Multidimensional Anxiety Scale for Children.9 10 One additional study reported using an MIC threshold for the Patient Health Questionnaire-8 in youths, but without providing any information on how this threshold was obtained.11
Objective
The primary aim of this study is to establish anchor-based MIC thresholds for two commonly used PROMs in an outpatient sample of youths seeking mental health services: the Columbia Impairment Scale (CIS), a measure of youth functional impairment,12 and the Strengths and Difficulties Questionnaire (SDQ), a screener for psychosocial difficulties in children and youth.13 The secondary aim of this study is to compare three anchor-based and three distribution-based estimation methods. We focus on MIC thresholds for improvement, though similar methods can be applied to detect MICs for deterioration if the subsample experiencing deterioration is sufficiently large.
Methods
General setting
This is a secondary analysis of data from the ‘YouthCan IMPACT’ pragmatic randomised controlled trial of an integrative youth mental health and substance use service in the Greater Toronto Area (Trial ID: NCT02836080).14 Between September 2016 and March 2020, the trial randomised 247 youths aged 14–17 years into a community-based Integrated Collaborative Care Team (ICCT) service at three sites, or into treatment as usual (TAU) at five outpatient mental health services. Within the ICCT arm, youths received needs-based or stratified care of flexible intensity (eg, solution-focused brief therapy, dialectical-behavioural therapy skills groups, psychiatric care) from a multidisciplinary team. TAU involved standard outpatient treatment spanning psychiatric assessment, psychotherapy and/or medication. The trial’s main results are reported elsewhere. For this study, we considered PROM and anchor scores collected at baseline (T1) and at 6-month follow-up (T2), pooling data across trial arms. As part of a youth engagement component, youth partners contributed to PROM selection and anchor question design.15
Participants and process
Youths met trial eligibility criteria if they experienced mental health or substance use problems and qualified for outpatient psychiatric care at the participating sites. Exclusion criteria were a primary referral for an eating disorder; autism without mental health or substance use problems; need for specialty forensic or fire setting treatment; or an imminent risk of self-harm or active psychosis. Participants had to be able to consent and to read and write in English. Completion of the CIS or SDQ at T1 and T2 was required for inclusion in this analysis.
Outcome measures
Columbia Impairment Scale
The 13-item CIS measures functional impairment in daily life, asking ‘how much of a problem’ youths have had with interpersonal relations, school/work and leisure activities. The 5-point response scale ranges from 0 (‘no problem’) to 4 (‘very bad problem’), with an additional option 5 (‘not applicable/don’t know’) as a viable skip. Summing item scores yields a total score of up to 52 points, with higher scores indicating greater impairment.12 The CIS demonstrated a unidimensional factor structure using exploratory factor analysis and good internal consistency (α=0.84) in a sample of 134 Canadian youths who accessed substance use outpatient care,16 although a subsequent study in the YouthCan IMPACT trial sample identified a three-factor structure using exploratory structural equation modelling.17 In the present sample, the CIS had a Cronbach’s alpha of 0.69. In a Swiss community sample of 1239 youths, the CIS showed good concurrent validity with a clinician-rated functioning measure and discriminated well between youth who did and did not access mental healthcare.18
Strengths and Difficulties Questionnaire
The 25-item SDQ assesses emotional symptoms, conduct problems, hyperactivity/inattention, peer relationships and prosocial behaviour in children and youth, with five items per domain. Youths respond on a 3-point scale ranging from ‘not true’ to ‘certainly true’. A total difficulties score is obtained by summing item scores from all subscales except the prosocial subscale, yielding a maximum of 40 points, with higher scores indicating more difficulties. A review of 41 studies suggests good evidence supporting a five-factor measurement model, and good discriminative validity, as well as generally good convergent validity with scales measuring similar constructs.19 In the present sample, the SDQ total difficulties score had a Cronbach’s alpha of 0.58.
Anchor questions measuring perceived overall change
At T2, trial participants were asked two questions on completion of each PROM to determine perceived change in each outcome. First, youths were asked, ‘Compared to the last time we met for research, do you feel you have changed in the areas described on this page?’. Response options included: ‘Yes, a change for the better’; ‘Yes, a change for the worse’; ‘No, I haven’t changed’. If applicable, youths were then asked to qualify this as a small, medium or large change. For several estimations, the two anchor questions were collapsed into a single binary indicator of perceived improvement (improvement=a small, medium or large change for the better; no improvement=no change or a change for the worse).
Statistical analysis
We examined three anchor-based MIC estimation methods: (1) the crude but commonly used mean PROM score change within the subgroup of individuals who report a small improvement on the anchor (MICmean); (2) ROC analysis (MICROC, curves provided in online supplemental appendix S2); and (3) predictive modelling using logistic regression (MICpred). We examined three distribution-based estimates: (1) half an SD; (2) the SE of measurement; and (3) the smallest detectable change (SDC). Details can be found in table 1.
For all six MIC threshold estimates, we computed (1) sensitivity and specificity to understand how well each MIC threshold distinguished between youths who reported at least a small improvement via the anchor question, and those who did not; (2) the Youden Index, which represents the threshold at which sensitivity and specificity are maximised, and hence the optimal cut-off; and (3) CIs around the point estimate using 2000 bootstrap replications, to provide a measure of the precision and uncertainty around the point estimates. We display the different MIC estimates and their CIs in forest plots. We also calculated the agreement between the anchor questions for the CIS and SDQ using Cohen’s weighted kappa.20
Analyses were conducted in R V.4.4.2 using the following packages: boot (95% CIs for all estimates excluding MICROC); pROC (MICROC estimates and 95% CIs); caret (sensitivity and specificity estimates); ltm (Cronbach’s alpha); KoRpus (anchor readability, see below); and psych (weighted kappa estimate).
Credibility of the anchor questions used to assess change in the CIS and SDQ
A credibility checklist for anchor-based MIC estimates includes the following criteria: (1) patients (as opposed to clinicians) responded to the PROM and anchor; (2) the anchor is easily understandable; (3) the anchor shows at least a moderate correlation with the PROM (r≥0.5); (4) CIs are sufficiently narrow; and (5) the anchor criterion reflects a small but important difference.21
To meet the first requirement, we used patient report on the PROMs and anchors. To determine if we met the second requirement, we computed the reading age required to understand the anchor via five readability formulas (see online supplemental appendix S3). To determine if we met the third requirement, we assessed correlations between the anchors and the corresponding PROMs’ T1, T2 and change scores using Spearman’s rank correlation coefficients. To determine if we met the fourth requirement, we investigated the precision of threshold estimates by computing CIs using 2000 bootstrap replications. To meet the fifth requirement, we defined a small but important difference on the anchor in two ways. To estimate the MICmean, we analysed the average score change within the group reporting a small perceived positive change on the anchor21; to estimate the MICROC and MICpred, we designated youths who reported a small, medium or large improvement as improvers, and youths who reported no change or a change for the worse as non-improvers.22
Missing data
For the CIS, there were three instances of true missingness due to item-level non-response. In addition, around 9% of participants selected the ‘not applicable/don’t know’ option for item 3 (‘getting along with a father figure’) and item 9 (‘getting along with siblings’) as a viable skip at T1 and T2, with the rate of ‘not applicable/don’t know’ responses ranging from 0% to 2% for other CIS items. For the SDQ, there were four instances of item-level non-response. Mean imputation was used to impute item-level missing data due to non-response and ‘not applicable/don’t know’ responses.
Findings
Participants
Of all trial participants, 213 (86% of the youths randomised) completed the CIS and its anchor at T1 and T2, and 211 (85%) completed the SDQ and its anchor (all of these participants also completed the CIS). The mean age was 15.6 years (SD=1.1); 65% of youths identified as girl/woman. The majority (54%) of youths identified as white, followed by mixed heritage (9%), Latino (8%), South Asian (7%) and East Asian (6%). Another 19% either identified as part of another ethnic group or did not provide their ethnicity.
Descriptive data
Table 2 presents the mean PROM scores at T1 and T2 and the PROM change scores for the CIS and SDQ. It further shows the distribution of youths across the different response options of the anchor questions. The relationship between the PROM change score and anchor question response options is presented in figure 1. There is a substantial level of agreement between the anchor questions of the CIS and SDQ (weighted kappa=0.61; 95% CI 0.5 to 0.71).
MIC threshold estimates for the CIS and SDQ
The MIC threshold estimates for the CIS ranged from −2.6 (MICpred) through −4.0 (MICROC) score points for the anchor-based methods, and from −3.9 (ie, 0.5 SD) through −6.2 score points (SDC) for the distribution-based methods (figure 2). Among the anchor-based estimates, the MICpred provided the highest level of precision (ie, narrowest CI width). At the same time, the combined specificity and sensitivity of the MICpred, as measured by the Youden Index (J=0.33), were inferior to that of the MICROC (J=0.36) and MICmean (J=0.35), suggesting that while offering increased precision, the MICpred provided a less optimal trade-off between specificity and sensitivity in detecting youths who perceive at least a small overall improvement.
The MIC threshold estimates for the SDQ ranged from −1.5 (MICROC) through −2.1 (MICmean) score points for the anchor-based methods, and from −1.8 (0.5 SD) through −3.3 (SDC) score points for the distribution-based methods (figure 2). Like the CIS results, the MICpred provided the highest level of precision. In contrast to the CIS results, the combined specificity and sensitivity of the MICpred as indicated by the Youden Index (J=0.24) were not inferior to that of any other MIC estimate for the SDQ.
For both CIS and SDQ, distribution-based estimates had higher precision than anchor-based estimates but performed less well in maximising sensitivity and specificity. The SDC and measurement error, in particular, showed low sensitivity for identifying youths who reported at least minimal improvement. None of the point estimates reached the magnitude of the SDC, suggesting difficulty determining whether the MIC reflects real change or is due to the measurement error in this sample.
Table 3 displays the number of youths who would be considered to have experienced MIC, according to the MIC point estimates, and the score point thresholds associated with the lower and upper bounds of the 95% CIs. For the most precise estimate, the MICpred, there is a 95% confidence that the number of youths having realised MIC in functioning according to the CIS lies somewhere between 103 (48%) and 126 youths (59%); and that the number of youths having experienced MIC in psychosocial difficulties according to the SDQ lies between 83 (39%) and 103 youths (49%). When using the less precise MICROC, we conclude that the number of youths to have experienced MIC lies between 72 (34%) and 162 (76%) for the CIS and 56 (27%) and 151 youths (72%) for the SDQ.
Credibility of the anchors
The required reading age for the anchor questions was between 6 and 12 years (depending on the readability formula), suggesting they should be understandable to 14–17 year-olds (online supplemental appendix S3). For the CIS, the correlations between the anchor and the PROM scores were r=−0.03 for the T1 score, r=−0.37 for the T2 score and r=−0.40 for the change score. For the SDQ, the correlations between the anchor and the PROM scores were r=0.02 for the T1 score, r=−0.23 for the T2 score and r=−0.26 for the change score. All correlations are below the minimum threshold for high credibility, defined at 0.5 by Devji and colleagues.21 CIs for the threshold estimates were wider for the MICmean and MICROC than for the adjusted MICpred.
Discussion
To our knowledge, this is the first study to identify MIC thresholds for the CIS and SDQ in youths receiving mental health and/or substance use care.
Different MIC estimation methods yielded different thresholds, consistent with MIC research in other areas of health.22–24 Predictive modelling yielded the most precise anchor-based MIC thresholds for both the CIS (−2.6 score points) and SDQ (−1.7 score points). Given the MICpred CIs, the percentage of youths experiencing MIC ranged from 49% to 59% for the CIS, and 39% to 49% for the SDQ. These percentages slightly exceed pooled rates of clinically significant change and reliable improvement measured in youths with anxiety or depression in routine specialist mental healthcare.25 The wide CIs of the less precise MICROC suggest limited utility of this threshold for research or clinical use. Of note, we examined MIC thresholds for improvement. These may not symmetrically reflect MIC thresholds for deterioration, which would need to be established separately.
The MICpred estimates represent relatively small score changes on the target PROMs, corresponding to improvements by 12% and 8% relative to the mean T1 scores for the CIS and SDQ, respectively (see table 2). In contrast, treatment ‘response’ is often defined as a 50% reduction in baseline severity.1 Our findings suggest that smaller reductions in impairment and psychosocial difficulties may meaningfully indicate treatment response. Qualitative research should further explore the meaning assigned by youths to change of this magnitude.
A moderate to high correlation (r>0.5) between the anchor and PROM change score is a sign of high anchor credibility,21 although correlations as low as 0.3 may be acceptable, given that relatively crude measures of perceived change cannot be expected to capture the same degree of variation and breadth as multi-item PROMs.8 We observed weak to moderate correlations for the CIS (r=0.40) and the SDQ (r=0.26), which suggest some discrepancy between youths’ perceptions of improvement and the change captured by the PROMs despite anchor questions directly referring to the PROM content. Correlations between the anchors and PROM T2 scores were of similar magnitudes as correlations between the anchors and PROM change scores, suggesting that the anchor ratings might be overproportionately influenced by the youth’s state at T2. In addition, we observed low correlations between PROM T1 scores and the anchors. Ideally, correlations between the anchor and PROM T1 and T2 scores should be of more or less equal magnitude but opposite signs, indicating that youths considered both their current and baseline state when rating improvement.26 In contrast, the observed pattern suggests recall bias, which is a phenomenon frequently noted in health research.26
It has been suggested that recall periods of not more than 4 weeks and use of salient reference points might help participants recall past feelings.21 26 In this study, T1 and T2 were 6 months apart. The anchor referred to T1 as ‘the last time we met for research’, which may or may not have been a significant moment for youths. These findings emphasise that anchor questions require careful calibration, piloting and validation in combination with their target PROM. Properties to assess include test–retest reliability, convergent validity with the PROM change score and divergent validity with PROMs capturing other constructs.6 Cognitive debriefing interviews, ‘think aloud’ exercises or open-ended questions could explore youths’ ease of remembering T1 health states.
MIC thresholds tend to vary across age groups, treatment groups and baseline severity.27 28 In youth mental health, more research is needed to understand how change perceptions vary across subgroups, especially in the context of ethnic, gender or socioeconomic diversity, and when working with different ages. Ideally, each clinical trial should calculate its own MIC thresholds, thus enabling meta-analyses of pooled estimates. Yet, a recent systematic review of 98 youth depression trials found that none reported an anchor-based MIC threshold.5 A guideline requirement to report anchor-based MICs in clinical trials could promote a shift in this regard.
None of our MIC thresholds exceeded the SDC, highlighting that the reliability and interpretability of the MIC critically depend on the reliability and validity of the PROMs used. The CIS and SDQ both have psychometric limitations, and measurement error was considerable in this sample, calculated based on internal consistency as a proxy for reliability. According to the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) guidelines sufficient internal consistency is defined as a Cronbach’s alpha of ≥0.70 combined with a unidimensional factor structure.29 While in our sample, the CIS approached this threshold, previous findings have been inconsistent in relation to its unidimensionality.16 17 The SDQ’s Cronbach’s alpha was considerably lower than the COSMIN threshold. Although the SDQ is frequently used as a composite screening tool, it has shown multifactor structures in several studies (eg, ref 30 31). In our study, all five subscales were administered, and youths were asked to report their overall perceived change across these five domains. However, the prosocial score, which may represent more of a trait than a state, was not included in the SDQ summary change score, as per scoring instructions. This discrepancy may have led youths to indicate less perceived change overall, compared with the SDQ total change score.
When interpreting the MIC to assess differences in change scores between treatment groups in a trial, randomisation helps control for biases such as regression to the mean (the tendency for extreme baseline scores to move closer to the mean at follow-up), attenuation (a tendency to report fewer problems at follow-up compared with initial assessment) and the fluctuating course of paediatric mental health disorders.32 When interpreting routinely collected outcome data from clinical settings, where control groups are typically unavailable, observed change may reflect these biases or natural fluctuations rather than true treatment effects. An SDQ Added Value Score (AVS) has been developed to address this challenge by using a predictive model derived from a high-risk epidemiological sample to estimate expected follow-up scores in the absence of systematic treatment.32 33 The difference between observed and predicted scores is interpreted as the treatment effect. This could be further contextualised using the MIC to determine whether the differential change is minimally important. The AVS has, however, only been validated in a limited number of populations and is not appropriate for use with individual cases or very small case numbers.34
Strengths and limitations
Our study has several strengths. We codesigned anchor questions with youth partners, appraised anchor credibility, compared anchor-based and distribution-based MIC estimation methods and assessed their precision. Several limitations should be noted. First, as discussed above, several factors may have weakened the credibility and precision of the MIC estimates. These factors include suboptimal correlations between the PROM change scores and the anchors; low correlations between the anchor and T1 PROM scores, and correlations of similar magnitudes between the anchors and PROM T2 scores and anchors and PROM change scores, suggesting the presence of recall bias; the 6-month interval between the T1 and T2 assessments, which may have contributed to recall bias; measurement error on the CIS and SDQ; suboptimal internal consistency of the SDQ total difficulties scale (indicating a multidimensional structure); and the administration of the prosocial subscale to youth, with reference to this and all other SDQ subscales when asking youth to rate perceived change using the anchor, while excluding the prosocial subscale from the SDQ total difficulties score (per scoring instructions).
Second, our sample was relatively small, with participants primarily identifying their ethnic group as ‘white’ and receiving hospital referrals within a single-payer health system, limiting generalisability. Third, while item-level missingness (ie, non-response) was low across CIS items, frequent ‘not applicable/don’t know’ responses for items 3 and 9 may have affected the total score under mean imputation.16 Yet, mean imputation is still recommended for the CIS until an alternative is developed. Lastly, the calculation of CIs differed for MICROC and other estimates, as we used different R packages. Although we used 2000 bootstrapping iterations across all computations, the estimation of CIs for the MICROC used stratified bootstrapping. Variability in this method and wide CIs suggest caution when interpreting the point estimates. Measurement error may also impact the consistency of estimates across different metrics.
Clinical implications
We found that even the comparatively precise MICpred was associated with some uncertainty about the percentage of youths who experienced MIC. Of further note, the MIC represents an average across the individual MIC thresholds of study participants. While MIC thresholds can help with contextualising and interpreting target PROM change scores, clinicians should not apply them mechanically to individual patients when making clinical decisions, but should use the MIC in conjunction with their clinical judgement, to interpret all available information on clinical progress jointly with youths and their families. Clinicians may want to consider the full range of values within the upper and lower bounds of the MIC CIs as potentially indicative of MIC.
Data availability statement
Data may be obtained from a third party and are not publicly available. A data dictionary can be shared upon reasonable request to the corresponding author. Permission must be obtained from each participating site of the YouthCan IMPACT trial to access trial data.
Ethics statements
Patient consent for publication
Ethics approval
The YouthCan IMPACT trial was approved by the Centre for Addiction and Mental Health (CAMH) Research Ethics Board (REB; approval number 2016-012) and via similar review processes at the participating clinical sites. All study participants provided informed consent using a consent form template that is mandated by the CAMH REB. Participants gave informed consent to participate in the study before taking part.
Acknowledgments
We thank the youths and families who participated in the YouthCan IMPACT trial for contributing their time and data. We thank all coinvestigators and youth partners for access to the data, including the coauthors of the present paper as well as Jacqueline Relihan, Mahalia Dixon, Darren Courtney, David O’Brien, Heather McDonald, Krista Lemke, Tony Pignatiello, Suneeta Monga, Nicole Kozloff, Leigh Solomon, Brendan F Andrade, Melanie Barwick, Alice Charach, Lynn Courey, Karleigh Darnay, Paul Kurdyak and Elizabeth Lin. Finally, we thank Darren Courtney, Martin Offringa and Nancy Butcher for conversations about the MIC that have informed the design of this secondary analysis.