Animal Behavior Reliability
  • Home
  • About
  • Foundations
    • Proposal
    • Measurements >
      • Definitions
    • Team makeup
    • Training >
      • Features of test subsets
      • Assessment
    • Metrics
  • Diving deeper
    • Iterative training processes >
      • Tasks and techniques
      • Categorical data
      • Continuous data
      • Rare outcomes
    • Timeline
    • Troubleshooting
    • Reporting
  • Checklist
  • Resources

Regression.​

Continuous data, statistical test
For regression analyses, the first step is to generate a scatterplot to get a sense of the data and the differences present. 

One type of regression that is often used to assess agreement is a type 2 (mixed) linear regression. This estimates the best linear relationship between a response and predictor variable, e.g., a trainer and trainee, or a test and a re-test. Linear regression is more useful than other metrics like simple correlation, but is more commonly used for the visual observation component. This may be a complementary approach to a more robust statistical test, like ICC.

​Caution: Not all linear regression models are appropriate for assessing reliability. For example, ordinary least squares (OLS) regression, or type 1 regression, assume independence of variables, which is not true when evaluating consistency across individuals. 
Assumptions:
  • The data are continuous 
  • The residuals are normal 
  • The residual variance is homogeneous
  • The 2 variables are dependent on one another
  • When used for reliability, data are used to compare observers (vs. post-hoc regression analyses to evaluate hypotheses)

Pros:
  • Provides a better summary of the relationship between 2 variables or observers than a correlation coefficient because it produces an equation (slope and intercept) that allows us to predict the response of y from x. This can help us identify consistent differences between observers, e.g. if Observer 2 (y) consistently overscores an outcome compared to Observer 1 (x). Identifying these patterns can help us improve training.

​Cons: 
  • May produce biased results if used to analyze observer agreement of rare outcomes
  • Recommended to include a "substantial" number of examples to generate accurate reliability, but limited guidance to how many this should be 
  • Data that are very similar to one another may produce skewed responses that suggest unreliable observers, despite visual observation revealing that the observers are quite consistent with one another
  • If multicollinearity is present, then the results may be unreliable 

​How to use:
A type 2 regression analysis should produce a plot, R2, and a line that you can extract the slope and intercept from. Cutoffs for what is deemed "acceptable" may vary by discipline, but are generally considered to be a slope that is not significantly different from 1, an intercept that is not significantly different from 0, and a high R2 value. R2 can range from 0 to 1, with a value of 1 demonstrating that all the variation in Observer 2's values is fully explained by Observer 1's values. In the plot, this would appear as all of the points falling on the regression line. Lower values are associated with points falling further from the regression line.

​We usually aim for an R2 value above 0.9. This R2 value is a subjective cutoff - values such as 0.89 still represent a fairly strong association. We sometimes report > 0.9 as "strong" and > 0.8 as "moderate" agreement when considering repeatability.

The R package "modEval," while designed for model selection, can be used to calculate R2, slope and intercept estimates, and P-values for each. These estimates can also be obtained by running a linear model ("lm" in base R), then using the "linearHypothesis" function ("car" package by Fox and Weisberg, 2019) to obtain P-values. In SAS, the PROC REG function can be used for similar calculations. 

For an example of how regression analyses can be used, along with code for running these in a statistical program, see Downey et al. 2021 (analyses conducted in R) or Chen et al. 2016 (analyses conducted in SAS). In these papers, regression analyses are used to evaluate different sampling methods against one another, but can be used in the same manner to evaluate reliability of different observers. 

Example #1

Picture
Given this scatterplot, first shown and explained here, the dotted line is generated to calculate the regression. An R2, slope, and intercept are generated from this line. In this example, the R2 > 0.9 and the slope was not significantly different from 1, but the intercept was significantly different from 0, so this would indicate Observer 2 overestimates the behavior and is not yet reliable. 

Example #2

Picture
In this scatterplot, the R2 > 0.9 and the intercept was not significantly different from 0, but the slope was significantly different from 1. When the outcome occurs more, Observer 2 increasingly overestimates it. This would indicate Observer 2 is not yet reliable. 

Example #3

Picture
In this scatterplot, the trainee consistently achieved similar scores to the expert, with only minor deviations. In this case, the R2 > 0.9, the slope was not significantly different from 1, and the intercept was not significantly different from 0, suggesting the trainee is reliable.

Example #4

Picture
In this scatterplot, the trainee consistently achieved very similar scores to the expert. The R2 > 0.9, the intercept (0.2) was not significantly different from 0, but the slope (1.02) was significantly different from 1. This is an example where the regression results suggest a problem, but visual inspection of the data suggests strong reliability. This mismatch may be due to the gap between scores: there are many points that cluster at 0, or < 50, and all other values are between 120-170. This type of gap in the data spread can create issues when interpreting the metric output.

More resources

  • Statistical strategies to assess reliability in ophthalmology (2006)​
  • Understanding Bland Altman analysis (2015)​
< Metrics
Picture
Picture
Picture
Proudly powered by Weebly
  • Home
  • About
  • Foundations
    • Proposal
    • Measurements >
      • Definitions
    • Team makeup
    • Training >
      • Features of test subsets
      • Assessment
    • Metrics
  • Diving deeper
    • Iterative training processes >
      • Tasks and techniques
      • Categorical data
      • Continuous data
      • Rare outcomes
    • Timeline
    • Troubleshooting
    • Reporting
  • Checklist
  • Resources