Animal Behavior Reliability
  • Home
  • About
  • Foundations
    • Proposal
    • Measurements >
      • Definitions
    • Team makeup
    • Training >
      • Features of test subsets
      • Assessment
    • Metrics
  • Diving deeper
    • Iterative training processes >
      • Tasks and techniques
      • Categorical data
      • Continuous data
      • Rare outcomes
    • Timeline
    • Troubleshooting
    • Reporting
  • Checklist
  • Resources

Correlation: ICC.​

Continuous data, statistical test
Intraclass Correlation Coefficient (ICC) is used to measure the strength of an association between continuous variables, similar to how concordance calculations are used for categorical data. ICC is typically considered to be robust.
​

Assumptions:
  • The data are typically continuous; although ICC can also be applied to categorical data, this is risky (see Cons, below)
  • When used for reliability, data are used to compare observers (vs. post-hoc ICC analyses to evaluate hypotheses)
  • ​2 or more observers are compared 

Pros:
  • ​Takes both correlation and systematic differences between observers into account
  • Can use with non-normal data

Cons:
  • ​Need to include a substantial number of examples to generate accurate reliability; see Shoukri et al. (2004) for examples of how sample size influences reliability. On a personal note, we try source at least 10-15 examples for each outcome, even if we can't meet the optimal sample size.
  • ​If using for categorical data, caution should be used, as ICC is more forgiving of mismatches between the expert and trainee than concordance calculations
  • Highly dependent on the range of measurements; measurements that fall within a wide range can lead to higher and more forgiving ICC scores than measurements with similar levels of disagreement that fall within a tighter range. For example, if the range of reported values for a test fall between 1 and 5 (tight range), ICC will be lower than a test with values between 1 and 100, even if a similar number of mismatches between the expert and trainee are present. ​
  • May produce biased results if used to analyze observer agreement of rare outcomes
  • Influenced by the number of measurements and observers

How to use:
ICC produces a value between 0 and 1, with values closer to 1 representing better agreement. Cutoffs for what is deemed "acceptable" vary by researchers and disciplines. We often aim for ICC scores above 0.9, which is typically considered to indicate excellent reliability (Koo and Li, 2016). These scores are typically higher than concordance cutoffs because ICC is much more forgiving towards mismatches between observers in continuous data. However, ICC above 0.75 may also be considered good reliability (Koo and Li, 2016). 

When analyzing in R, the "irr" package by Gamer et al. (2019) can generate ICC scores using the function "ICC."


More resources

  • ​Intraclass correlations: Uses in assessing rater reliability (1979)
  • Statistical strategies to assess reliability in ophthalmology (2006)​
  • ​Repeatability for Gaussian and non-Gaussian data: a practical guide for biologists (2010)
  • A guideline of selecting and reporting intraclass correlation coefficients for reliability research (2016)
  • Common pitfalls in statistical analysis: Measures of agreement (2017)​
  • Assessing observer variability: A user's guide (2017)​
< Metrics
Picture
Picture
Picture
Proudly powered by Weebly
  • Home
  • About
  • Foundations
    • Proposal
    • Measurements >
      • Definitions
    • Team makeup
    • Training >
      • Features of test subsets
      • Assessment
    • Metrics
  • Diving deeper
    • Iterative training processes >
      • Tasks and techniques
      • Categorical data
      • Continuous data
      • Rare outcomes
    • Timeline
    • Troubleshooting
    • Reporting
  • Checklist
  • Resources