This method plots differences between continuous outputs against their mean value, making it easy to identify the magnitude of disagreement in repeated measures.
Assumptions:
Pros:
Cons:
How to use:
This approach produces a scatter plot of differences between 2 observers (e.g. Observer 1's value - Observer 2's value; y-axis) against their mean value [(Observer 1's value+Observer 2's value) / 2; x-axis]. This plot can be used for visual observation prior to running a statistical test. Additional measurements, like limits of agreement (LoA), percentage of points that fall between the LoA, and mean difference can provide useful context and complement the plot. When used for reliability, both the plot and these additional measurements are considered together; this is not a statistical test, but could be followed up with a more formal metric (e.g. ICC and CCC).
Systematic bias can be identified by consistent, non-random patterns in the data distribution - for example, points that consistently fall above or below the line of perfect agreement or gradually increase or decrease as the average value increases. Precision can be identified by most data points falling in a narrow band around the line of perfect agreement, indicating the 2 observers were in close agreement with each other. Data points should be distributed around this line symmetrically, in the absence of bias.
When evaluating reliability with Bland-Altman, there is no agreed-upon standard for acceptable or good reliability. Typically, researchers visually assess the plot to look for any trends or patterns. Mean difference values should not be significantly different from 0, while smaller LoA (mean difference ± 1.96 times the standard deviation of the differences) indicate better reliability than wider LoA. It has been suggested that 95% of the data points should fall within the LoA (Bland and Altman, 1999).
Note: if the assumptions are violated, due to observer errors having non-constant variance and proportional biases, a more complex version of Bland-Altman may be used. In this case, each observer must have repeated measurements on the same technique/video/observation. See the packages "MethodCompare" in R (Taffé et al., 2019) or "biasplot" in Stata (Taffé et al., 2017).
Bland-Altman plots and their associated outputs can be calculated simply in Excel, by running the aforementioned formulas. In R, the package "BlandAltmanLeh" from Lehnert (2015) can generate both plots and values using the function "bland.altman.plot"
Assumptions:
- Data are continuous
- Only 2 observers are being compared at a time, typically a trainee against the expert
Pros:
- Can identify systematic bias between 2 observers
- Not dependent on the range of values or measurements, unlike correlation analyses (e.g. ICC and CCC)
Cons:
- Diagnostic test, but not a statistical test when used on its own
How to use:
This approach produces a scatter plot of differences between 2 observers (e.g. Observer 1's value - Observer 2's value; y-axis) against their mean value [(Observer 1's value+Observer 2's value) / 2; x-axis]. This plot can be used for visual observation prior to running a statistical test. Additional measurements, like limits of agreement (LoA), percentage of points that fall between the LoA, and mean difference can provide useful context and complement the plot. When used for reliability, both the plot and these additional measurements are considered together; this is not a statistical test, but could be followed up with a more formal metric (e.g. ICC and CCC).
Systematic bias can be identified by consistent, non-random patterns in the data distribution - for example, points that consistently fall above or below the line of perfect agreement or gradually increase or decrease as the average value increases. Precision can be identified by most data points falling in a narrow band around the line of perfect agreement, indicating the 2 observers were in close agreement with each other. Data points should be distributed around this line symmetrically, in the absence of bias.
When evaluating reliability with Bland-Altman, there is no agreed-upon standard for acceptable or good reliability. Typically, researchers visually assess the plot to look for any trends or patterns. Mean difference values should not be significantly different from 0, while smaller LoA (mean difference ± 1.96 times the standard deviation of the differences) indicate better reliability than wider LoA. It has been suggested that 95% of the data points should fall within the LoA (Bland and Altman, 1999).
Note: if the assumptions are violated, due to observer errors having non-constant variance and proportional biases, a more complex version of Bland-Altman may be used. In this case, each observer must have repeated measurements on the same technique/video/observation. See the packages "MethodCompare" in R (Taffé et al., 2019) or "biasplot" in Stata (Taffé et al., 2017).
Bland-Altman plots and their associated outputs can be calculated simply in Excel, by running the aforementioned formulas. In R, the package "BlandAltmanLeh" from Lehnert (2015) can generate both plots and values using the function "bland.altman.plot"
Example:
Two observers scored eating behavior continuously from a set of 10 videos from a training subset (each point represents 1 video). Below, you can see their scores plotted against each other in a normal scatterplot (left; first shown and explained here) and a Bland-Altman plot (right).
In the scatterplot, the red line represents what perfect agreement would look like. At low average eating times, both observers were similar. At high average eating times though, Observer 2 overestimated values, and this pattern seemed to get increasingly worse at higher times.
In the Bland-Altman plot, which is generated from the same data in the scatterplot, the horizontal red line is y = 0, and represents perfect agreement. The dark gray line indicates the mean difference across all videos. The two blue lines reflect 1 standard deviation away from the mean (limits of agreement). 100% of the points fall between the LoA, and the mean difference (-3) is close to 0. However, visual inspection of the plot suggests the observers are not yet reliable, as suggested by the scatterplot: the trainee struggles to score eating behavior when it happens for longer periods.
In the scatterplot, the red line represents what perfect agreement would look like. At low average eating times, both observers were similar. At high average eating times though, Observer 2 overestimated values, and this pattern seemed to get increasingly worse at higher times.
In the Bland-Altman plot, which is generated from the same data in the scatterplot, the horizontal red line is y = 0, and represents perfect agreement. The dark gray line indicates the mean difference across all videos. The two blue lines reflect 1 standard deviation away from the mean (limits of agreement). 100% of the points fall between the LoA, and the mean difference (-3) is close to 0. However, visual inspection of the plot suggests the observers are not yet reliable, as suggested by the scatterplot: the trainee struggles to score eating behavior when it happens for longer periods.