Animal Behavior Reliability
  • Home
  • About
  • Foundations
    • Proposal
    • Measurements >
      • Definitions
    • Team makeup
    • Training >
      • Features of test subsets
      • Assessment
    • Metrics
  • Diving deeper
    • Iterative training processes >
      • Tasks and techniques
      • Categorical data
      • Continuous data
      • Rare outcomes
    • Timeline
    • Troubleshooting
    • Reporting
  • Checklist
  • Resources

Troubleshooting.​

Sometimes, we run into challenges during reliability training: the calculated metric is lower than the cutoff value (e.g. a kappa value of 0.65 instead of 0.8), the task is not performed correctly independently, the metric is higher than our cutoff, but we see a clear mismatch or challenge from visual observation, and so on. These are normal, but moving past these challenges so you can continue with data collection requires first identifying the exact problem, diagnosing it, and taking actionable steps to fix it. This process can be difficult, and is a skill that we find grows with experience.

Below, we provide examples of problems that we face in our reliability trainings and how we approach solving them.

Examples of identifying, diagnosing, and fixing problems:

General steps: 
  • How widespread is the issue - is one trainee having problems, or all of them?
  • Look at the specific instances of disagreement between the trained observer and the trainee - can you identify any patterns? Is the issue a single behavior, outcome, example?
  • Try asking the trainee to walk you through their process while completing the test - can you identify the issue?
  • Ask a trusted colleague or expert to take your training - are they able to do it successfully? If not, the task or training process is likely too difficult
 
General solutions: 
  • The definition is not operational and may be misinterpreted, or difficult to apply consistently - refine the definition
  • The scoring process is too challenging - try to simplify the task (e.g. change the sampling method, number of animals, number of behaviors)
  • The training videos are too confusing, or the scaffolded training has not yet prepared trainees for how to complete the task - try to obtain unambiguous training videos, or incorporate more layers of training to build more confidence in trainees
  • There are not enough examples of a given behavior or outcome in the test, so scoring one incorrectly brings the entire score down - confirm the trainee can apply the definition correctly, then add more examples to the training
Example #1
Problem:
"Kate" and "Miguel" were developing a scoring system to evaluate hocks in dairy cattle. They developed definitions for score 1 ("normal" undamaged), score 2 ("moderate" damage), and score 3 ("severe" wound or swelling) hocks, and both took a 30 question test to evaluate whether they could both apply those definitions correctly, and whether their scores agreed with one another. After taking the test, they found they had low agreement.
Diagnostics:
In looking at the specific instances of disagreement, they realized they only differed in how they evaluated score 2 hocks - Kate had scored many more score 2's than Miguel did. When they looked at the images together, they found out that they had been misinterpreting their own definition compared to one another: A score 2 hock included hair loss > 1" in diameter, but Miguel considered "hair loss" to mean "completely bald", while Kate interpreted it to mean "some hair loss such that you could see bare skin beneath the hair". 
Picture

While pink, hairless skin is visible in the center of this leg, there is hair covering it. Since the spot of partial hair loss was >1" in diameter, Kate considered this a score 2.

Picture

In contrast, Miguel was looking for complete hair loss like this, with no hair covering the bald spot.

​Solution:  
They updated their definition to mean "bald, or complete hair loss."
Picture
Using the flowchart from our categorical data iterative training process, we can see that our troubleshooting led us to revise our definition. After this revision, Kate and Miguel had acceptable agreement, and were able to proceed to data collection. ​
Example #2
Problem:
"Alex" was being trained to score intersucking (i.e. licking and sucking of another heifer/cow's udder or teats) in cattle. They needed to watch a video, which contained 2 pairs of animals, and scan for any occurrences of intersucking. When their results were compared against a trained observer, "Carmen", they had a low ICC (< 0.6).
​Diagnostics:
Carmen reviewed the instances of disagreement and found that Alex was correctly identifying many occurrences of intersucking in pair #1, but was missing many that occurred in pair #2. When reviewing the video, Carmen saw that the missed occurrences were clear cut instances of intersucking, so the mismatch was not clear. Alex had previously been trained to score intersucking continuously in one animal at a time (ICC > 0.9), so the issue did not appear to be related to the definition. Carmen asked Alex to re-score the training video while narrating their thought process. During this, it became clear that Alex was unintentionally focusing mainly on pair #1 because these animals were much more active than pair #2.
Picture

While initially observing all 4 animals at once, the pair on the left was more active than the one on the right, leading Alex to accidentally focus mostly on the left side of the video.

​Solution:
The scoring process was changed - Alex watched each video twice, first watching only pair #1, then again only watching pair #2. Under this new method, Alex's ICC values were >0.9.
Picture
Using the flowchart from our continuous data iterative training process, we can see that our troubleshooting led us to revise our data collection strategy. After this revision, Alex had acceptable agreement, and was able to proceed to data collection. 

Additional examples

There are many different solutions to troubleshooting. We have provided 2 examples that both required some type of revision to an earlier stage of the process: changing definitions and scoring strategies. Sometimes, your data collection approach is very complex, and your team may require more orientation and re-training to your existing definitions and approach before they achieve acceptable reliability. Sometimes, despite many revisions, your approach is too challenging, and you may need to change a continuous variable into a categorical one, or drop the measure entirely. Troubleshooting takes many forms, and becomes easier to diagnose and anticipate with experience.
<< Diving deeper
< Timeline
Reporting >
Picture
Picture
Picture
Proudly powered by Weebly
  • Home
  • About
  • Foundations
    • Proposal
    • Measurements >
      • Definitions
    • Team makeup
    • Training >
      • Features of test subsets
      • Assessment
    • Metrics
  • Diving deeper
    • Iterative training processes >
      • Tasks and techniques
      • Categorical data
      • Continuous data
      • Rare outcomes
    • Timeline
    • Troubleshooting
    • Reporting
  • Checklist
  • Resources