Animal Behavior Reliability
  • Home
  • About
  • Foundations
    • Proposal
    • Measurements >
      • Definitions
    • Team makeup
    • Training >
      • Features of test subsets
      • Assessment
    • Metrics
  • Diving deeper
    • Iterative training processes >
      • Tasks and techniques
      • Categorical data
      • Continuous data
      • Rare outcomes
    • Timeline
    • Troubleshooting
    • Reporting
  • Checklist
  • Resources

Concordance.​

Categorical data, statistical test
There are a few measures of concordance, or consistency, for categorical variables; here, we describe commonly used scores: Cohen's Kappa, Weighted Kappa, Fleiss' Kappa, PABAK, and Kendall's W. These metrics are similar, but apply to slightly different types of data. See the summary table below for an overview, followed by more details about each metric. These measures are generally considered to be robust.
Picture

Cohen's Kappa

Cohen's kappa compares pairs of observations and is more nuanced than simple percent agreement because it accounts for chance agreement. This metric shows when observers agree about both the occurrence of a behavior or parameter and the absence of it. 

Assumptions:
  • The data are nominal or ordinal
  • Only 2 observers are being compared at a time, typically a trainee against the expert

Pros:
  • Adjusts for chance agreement
​
Cons:
  • ​Need to include a substantial number of examples to generate accurate reliability due to the consideration of chance; kappa can behave erratically when sample sizes are small. A sample size calculator should be used when possible to determine the number of examples needed per category: use the Estimation pane here, or see Shoukri et al. (2004) for examples of how sample size influences reliability. On a personal note, we source at least 10-15 examples per category, even if we can't meet the optimal sample size.
  • Need to include a similar number of examples in each category to generate accurate reliability. It is easy to disagree on behaviors that are underrepresented yet still achieve a high kappa because of the consideration of random chance agreement
  • Kappa scores are influenced by outcome distributions, so are rarely comparable across studies, procedures, or populations
  • Weights can not be applied to different outcomes; if using for ordinal data, this means that disagreements that are closer to one another can not be weighted higher than disagreements further away
​
How to use:
Cohen's Kappa produces a value between -1 and +1, with +1 representing perfect agreement, i.e. both observers gave the same scores for every observation. A kappa score of 0 indicates any agreement between the observers was likely obtained just by chance. As such, values between 0 and 1 represent agreement that is better than chance, while negative values indicate increasingly high disagreement. Cutoffs for what is deemed "acceptable" vary by researchers and disciplines. Some researchers suggest values between 0.41-0.6 show "moderate" agreement (Landis and Koch, 1977), while others suggest scores below 0.6 should not be used, as they indicate poor agreement and suggest at least half of the data are untrustworthy (McHugh, 2012). Values above 0.65 are sometimes considered reliable (Eagan et al. 2020), and may indicate "substantial" agreement (Landis and Koch, 1977). Values above 0.8 are generally considered to indicate strong (McHugh, 2012) or near-perfect agreement (Cohen, 1960, Landis and Koch, 1977). We often aim for kappa scores above 0.8, though depending on the complexity of the metric and training process, we have sometimes considered lower values (e.g. 0.7), and are transparent about this decision in our manuscript. 


When analyzing in R, the "irr" package by Gamer et al. (2019) can generate Cohen's Kappa scores using the function "kappa2." When analyzing in SAS, "PROC FREQ" with the "test" kappa statement can be used.

Fleiss' Kappa

Fleiss' Kappa is an extension of Cohen's Kappa that allows for assessment of >2 observers.

Assumptions:
  • The data are nominal or ordinal 

Pros:
  • Adjusts for chance agreement
  • ​Allows for more than 2 observers to be assessed at once

Cons:
  • ​Need to include a substantial number of examples to generate accurate reliability due to the consideration of chance; kappa can behave erratically when sample sizes are small. A sample size calculator should be used when possible to determine the number of examples needed per category: use the Estimation pane here, or see Shoukri et al. (2004) for examples of how sample size influences reliability. On a personal note, we source at least 10-15 examples per category, even if we can't meet the optimal sample size.​
  • Need to include a similar number of examples in each category to generate accurate reliability. It is easy to disagree on behaviors that are underrepresented yet still achieve a high kappa because of the consideration of random chance agreement
  • Kappa scores are influenced by outcome distributions, so are rarely comparable across studies, procedures, or populations​
  • Weights can not be applied to different outcomes; if using for ordinal data, this means that disagreements that are closer to one another can not be weighted higher than disagreements further away
​
How to use:
Fleiss' Kappa outputs and cutoffs are the same as Cohen's Kappa.

When analyzing in R, the "irr" package by Gamer et al. (2019) can generate kappa scores using the function "kappam.fleiss."

Weighted Kappa

Weighted Kappa is an extension of Cohen's Kappa that allows for assessment of data with 3 or more levels where magnitude is relevant. 

Assumptions:
  • The data are nominal or ordinal 
  • Only 2 observers are being compared at a time, typically a trainee against the expert

Pros:
  • Adjusts for chance agreement
  • Allows assessment of the magnitude of differences between observers
  • ​Allows assigning different weights to different outcomes; for example, if you are scoring wound severity on a scale from 0-5, with 0 being no wound and 5 being severe, a trainee that scored a true score 4 wound as a score 5 could get a score weighted more than a trainee that scored it as a score 2. It also allows different weights to different degrees of disagreement. For example, if you have a 5-point scale with A+, A, B, C, D, you could assign a lower weight between A+ and A, and a higher weight between A and D. Since the weights measure disagreement, differences in ratings that are further apart (i.e., A and D) will be penalized more than ratings that are closer together (i.e., A+ and A).
    ​
Cons:
  • ​Need to include a substantial number of examples to generate accurate reliability due to the consideration of chance; kappa can behave erratically when sample sizes are small. A sample size calculator should be used when possible to determine the number of examples needed per category: use the Estimation pane here, or see Shoukri et al. (2004) for examples of how sample size influences reliability. On a personal note, we source at least 10-15 examples per category, even if we can't meet the optimal sample size.​
  • Need to include a similar number of examples in each category to generate accurate reliability. It is easy to disagree on behaviors that are underrepresented yet still achieve a high kappa because of the consideration of random chance agreement
  • Kappa scores are influenced by outcome distributions, so are rarely comparable across studies, procedures, or populations
  • Weights are typically manually assigned, which adds subjectivity

How to use:
Weighted Kappa outputs and cutoffs are the same as Cohen's Kappa.

When analyzing in R, the "irr" package by Gamer et al. (2019) can generate kappa scores using the function "kappa2" and specifying the "weight" argument. There are popular weight choices depending on the package used; "quadratic" is often used for ordinal data. When using nominal data, choosing weights is even more subjective and requires domain expertise. When analyzing in SAS, kappa scores can be generated using "PROC FREQ" and indicating "test wtkap". The weights do not have to be manually assigned in SAS.

PABAK 

*Note: this section is still a work-in-progress*
PABAK, or Prevalence-Adjusted and Bias-Adjusted Kappa, was designed to address limitations in Cohen's Kappa caused by imbalances in prevalence and rater bias. PABAK compares pairs of observations but adjusts for situations where one category is much more common than the other, or one observer is more conservative in the scores they report.

Assumptions:
​
  • The data are binary
  • Only 2 observers are being compared at a time, typically a trainee against the expert

Pros:
  • Adjusts for one outcome being more or less prevalent than the other (e.g. if the test subset includes more "Yes" examples than "No" examples) 
  • Can be used for more rare outcomes (e.g. outcomes that are unlikely to be well-represented in the test subset)
  • Adjusts for one observer having a bias (e.g. if one observer is more likely to assign a "Yes" outcome)

Cons:
  • If categories are equally represented, can mask errors and lead to artificially high agreement

How to use:
PABAK produces a value between -1 and +1, same as Cohen's Kappa, and is interpreted in the same manner. PABAK also produces a prevalence index (PI), which measures the degree of unequal representation in both categories, and bias index (BI), which measures the degree of bias in the observers. The absolute value of PI and BI is a number between 0 and 1. For PI, a score of 0 means the categories are perfectly balanced, while higher numbers indicate one category is overrepresented compared to the other. For BI, a score of 0 means no bias, or both observers are equally likely to choose both responses. Higher numbers indicate that one observer is more likely to choose one response over the other. Evaluating both PI and BI alongside the PABAK score can provide insights into whether a high score was obtained due to imbalance or an observer bias.

When analyzing in R, the "epiR" package by Stevenson and Sergeant (2024) can generate PABAK scores using the function "epi.kappa" on a contingency table (representing the counts of agreement and disagreement between observers) using method = 'cohen.' This produces PABAK, PI, and BI; this calculation also produces a Cohen's kappa score.

PABAK can also be calculated alongside Cohen's kappa to see how much the latter is affected by prevalence and bias. If PI and BI suggest that the data are imbalanced, it may be better to use the PABAK value instead of Cohen's kappa; at the very least, caution should be used if continuing with Cohen's kappa in light of high PI and BI.

Kendall's W

Kendall's W is used when 3 or more observers measure the same behavior, or when the measure has 3 or more levels where magnitude is relevant. This metric assesses the degree of association among sets of rankings. Unlike the other measures on this page, Kendall's w evaluates concordance, or consistency of rankings between observers, rather than true agreement. Caution should be used when interpreting these values for this reason, as high values can be obtained even if observers never agree on an observation's score.

Assumptions:
  • The data are ordinal
  • If comparing across more than 2 observers at a time, all are weighted as equally important

Pros:
  • Allows for ranking of behaviors; for example, if you are scoring wound severity on a scale from 0-5, with 0 being no wound and 5 being severe, a trainee that scored a true score 4 wound as a score 5 would get a higher score than a trainee that scored it as a score 2
  • Rankings are automatically assigned, reducing subjectivity
  • Allows for more than 2 observers to be assessed at once

Cons:
  • Relying on ranking may mask errors made by trainees​; in the example above, both the trainee that labeled a true score 4 wound as a score 5 and the trainee that labeled it as a score 2 are both wrong, but the one who scored it as a score 5 gets a higher score. Relying only on that higher score may lead to overlooking true errors simply because they were ranked closer to the correct score
  • Evaluates concordance, not agreement: a trainee can consistently over- or under-score relative to the expert and still obtain a high Kendall's w due to the consistency of the ranking despite never agreeing (reporting the exact same score)
  • If there is excessive tying, a larger sample will be needed
  • Less rigid than a kappa score; if weights are needed for ordinal data, using a weighted kappa would be more robust

How to use:
Kendall's W produces a value between 0 and 1. A score of 0 represents poor concordance, while 1 represents high concordance. Caution should be used when interpreting these scores however, as a score of 1 does not mean perfect agreement. If one observer consistently scores higher than the other, they would be classified as having high concordance, as they are consistent in their rank order, leading to a high or perfect Kendall's w score, despite never agreeing on an observation (i.e. reporting the exact same score).

Similar to kappa scores, cutoffs for what is deemed "acceptable" vary by researchers and disciplines. To our knowledge, there is no reference for "good" or "acceptable" scores here like there are for kappa scores. We often aim for scores above 0.8, since Kendall's W, by order of ranking outcomes, can produce higher values than a traditional Cohen's Kappa. 


When analyzing in R, the "irr" package by Gamer et al. (2019) can generate Kendall's W scores using the function "kendall."

More resources

  • A coefficient of agreement for nominal scales (1960)
  • The measurement of observer agreement for categorical data (1977)
  • ​Bias, prevalence, and kappa (1993)
  • ​Sample size requirements for the design of reliability study: Review and new results (2004)
  • Statistical strategies to assess reliability in ophthalmology (2006)
  • Interrater reliability: The kappa statistic (2012)
  • Common pitfalls in statistical analysis: Measures of agreement (2017)
  • Testing the reliability of inter-rater reliability (2020)
  • Evaluation of inter-observer reliability of animal welfare indicators: Which is the best index to use? (2021)​​​​
< Metrics
Picture
Picture
Picture
Proudly powered by Weebly
  • Home
  • About
  • Foundations
    • Proposal
    • Measurements >
      • Definitions
    • Team makeup
    • Training >
      • Features of test subsets
      • Assessment
    • Metrics
  • Diving deeper
    • Iterative training processes >
      • Tasks and techniques
      • Categorical data
      • Continuous data
      • Rare outcomes
    • Timeline
    • Troubleshooting
    • Reporting
  • Checklist
  • Resources