Troubleshooting.
Sometimes, we run into challenges during reliability training: the calculated metric is lower than the cutoff value (e.g. a kappa value of 0.65 instead of 0.8), the task is not performed correctly independently, the metric is higher than our cutoff, but we see a clear mismatch or challenge from visual observation, and so on. These are normal, but moving past these challenges so you can continue with data collection requires first identifying the exact problem, diagnosing it, and taking actionable steps to fix it. This process can be difficult, and is a skill that we find grows with experience.
Below, we provide examples of problems that we face in our reliability trainings and how we approach solving them.
Below, we provide examples of problems that we face in our reliability trainings and how we approach solving them.
Examples of identifying, diagnosing, and fixing problems:
General steps:
General solutions:
- How widespread is the issue - is one trainee having problems, or all of them?
- Look at the specific instances of disagreement between the trained observer and the trainee - can you identify any patterns? Is the issue a single behavior, outcome, example?
- Try asking the trainee to walk you through their process while completing the test - can you identify the issue?
- Ask a trusted colleague or expert to take your training - are they able to do it successfully? If not, the task or training process is likely too difficult
General solutions:
- The definition is not operational and may be misinterpreted, or difficult to apply consistently - refine the definition
- The scoring process is too challenging - try to simplify the task (e.g. change the sampling method, number of animals, number of behaviors)
- The training videos are too confusing, or the scaffolded training has not yet prepared trainees for how to complete the task - try to obtain unambiguous training videos, or incorporate more layers of training to build more confidence in trainees
- There are not enough examples of a given behavior or outcome in the test, so scoring one incorrectly brings the entire score down - confirm the trainee can apply the definition correctly, then add more examples to the training
Example #1
Problem:
"Kate" and "Miguel" were developing a scoring system to evaluate hocks in dairy cattle. They developed definitions for score 1 ("normal" undamaged), score 2 ("moderate" damage), and score 3 ("severe" wound or swelling) hocks, and both took a 30 question test to evaluate whether they could both apply those definitions correctly, and whether their scores agreed with one another. After taking the test, they found they had low agreement.
Diagnostics:
In looking at the specific instances of disagreement, they realized they only differed in how they evaluated score 2 hocks - Kate had scored many more score 2's than Miguel did. When they looked at the images together, they found out that they had been misinterpreting their own definition compared to one another: A score 2 hock included hair loss > 1" in diameter, but Miguel considered "hair loss" to mean "completely bald", while Kate interpreted it to mean "some hair loss such that you could see bare skin beneath the hair".
Problem:
"Kate" and "Miguel" were developing a scoring system to evaluate hocks in dairy cattle. They developed definitions for score 1 ("normal" undamaged), score 2 ("moderate" damage), and score 3 ("severe" wound or swelling) hocks, and both took a 30 question test to evaluate whether they could both apply those definitions correctly, and whether their scores agreed with one another. After taking the test, they found they had low agreement.
Diagnostics:
In looking at the specific instances of disagreement, they realized they only differed in how they evaluated score 2 hocks - Kate had scored many more score 2's than Miguel did. When they looked at the images together, they found out that they had been misinterpreting their own definition compared to one another: A score 2 hock included hair loss > 1" in diameter, but Miguel considered "hair loss" to mean "completely bald", while Kate interpreted it to mean "some hair loss such that you could see bare skin beneath the hair".
Solution:
They updated their definition to mean "bald, or complete hair loss."
They updated their definition to mean "bald, or complete hair loss."
Using the flowchart from our categorical data iterative training process, we can see that our troubleshooting led us to revise our definition. After this revision, Kate and Miguel had acceptable agreement, and were able to proceed to data collection.
|
Example #2
Problem:
"Alex" was being trained to score intersucking (i.e. licking and sucking of another heifer/cow's udder or teats) in cattle. They needed to watch a video, which contained 2 pairs of animals, and scan for any occurrences of intersucking. When their results were compared against a trained observer, "Carmen", they had a low ICC (< 0.6).
Diagnostics:
Carmen reviewed the instances of disagreement and found that Alex was correctly identifying many occurrences of intersucking in pair #1, but was missing many that occurred in pair #2. When reviewing the video, Carmen saw that the missed occurrences were clear cut instances of intersucking, so the mismatch was not clear. Alex had previously been trained to score intersucking continuously in one animal at a time (ICC > 0.9), so the issue did not appear to be related to the definition. Carmen asked Alex to re-score the training video while narrating their thought process. During this, it became clear that Alex was unintentionally focusing mainly on pair #1 because these animals were much more active than pair #2.
Problem:
"Alex" was being trained to score intersucking (i.e. licking and sucking of another heifer/cow's udder or teats) in cattle. They needed to watch a video, which contained 2 pairs of animals, and scan for any occurrences of intersucking. When their results were compared against a trained observer, "Carmen", they had a low ICC (< 0.6).
Diagnostics:
Carmen reviewed the instances of disagreement and found that Alex was correctly identifying many occurrences of intersucking in pair #1, but was missing many that occurred in pair #2. When reviewing the video, Carmen saw that the missed occurrences were clear cut instances of intersucking, so the mismatch was not clear. Alex had previously been trained to score intersucking continuously in one animal at a time (ICC > 0.9), so the issue did not appear to be related to the definition. Carmen asked Alex to re-score the training video while narrating their thought process. During this, it became clear that Alex was unintentionally focusing mainly on pair #1 because these animals were much more active than pair #2.
Solution:
The scoring process was changed - Alex watched each video twice, first watching only pair #1, then again only watching pair #2. Under this new method, Alex's ICC values were >0.9.
The scoring process was changed - Alex watched each video twice, first watching only pair #1, then again only watching pair #2. Under this new method, Alex's ICC values were >0.9.
Using the flowchart from our continuous data iterative training process, we can see that our troubleshooting led us to revise our data collection strategy. After this revision, Alex had acceptable agreement, and was able to proceed to data collection.
|
Additional examples
There are many different solutions to troubleshooting. We have provided 2 examples that both required some type of revision to an earlier stage of the process: changing definitions and scoring strategies. Sometimes, your data collection approach is very complex, and your team may require more orientation and re-training to your existing definitions and approach before they achieve acceptable reliability. Sometimes, despite many revisions, your approach is too challenging, and you may need to change a continuous variable into a categorical one, or drop the measure entirely. Troubleshooting takes many forms, and becomes easier to diagnose and anticipate with experience.