Animal Behavior Reliability
  • Home
  • About
  • Foundations
    • Proposal
    • Measurements >
      • Definitions
    • Team makeup
    • Training >
      • Features of test subsets
      • Assessment
    • Metrics
  • Diving deeper
    • Iterative training processes >
      • Tasks and techniques
      • Categorical data
      • Continuous data
      • Rare outcomes
    • Timeline
    • Troubleshooting
    • Reporting
  • Checklist
  • Resources

Features of test subsets.​

​Choosing the exercises, photos, and/or videos used in training can require troubleshooting to ensure they achieve your goal of developing reliable trainees. 
The first step in developing a test subset is to identify what your goal is: orientation or assessment. 
If your goal is to orient or familiarize trainees to the outcomes of interest, your test subset should include clear cut examples so as to not confuse trainees. At this stage, you do not need trainees to identify every potential occurrence of your outcome, nor understand gray areas. We aim to to create confidence in the definitions and cultivate the ability to successfully score under ideal conditions. Depending on the complexity of the task and the skill level of the trainees, you may end up developing a more extensive orientation that involve short tests for practice. In this case, you may need to keep the considerations for assessment​ test subsets (below) in mind.
If your goal is assessment, to confirm reliability, there are some considerations, regardless of the type of data being collected:
  • Variation
  • Duration
  • Modality
  • Sampling method
  • Randomization
<
>
Picture
Variation is key. Test subsets should capture the range of what trainees will encounter in independent work. This differs from the orientation stage, as trainees should now be comfortable with clear, straightforward examples, and should practice applying that knowledge under a range of conditions. There are a few types of variation that should be considered when developing a test subset:

First, outcomes with broad definitions should have examples that cover the range of ways the definition could be applied. For example, if self-grooming is defined as "Touching hair with the tongue or mouth on animal's own body; includes if mouth is not visible but directed toward body and the head moves in a vertical motion", the test subset should include examples where the animal is self-grooming when the mouth is and is not visible, and examples where the animal is not self-grooming with the mouth visible and not visible. Ideally, each of these examples would be relatively equally represented in the test subset. Similarly, if a health scoring test includes a measure of nasal discharge, and a "score 1" outcome describes that nasal discharge can be clear or cloudy, test subsets should include an equal number of images that are clear or cloudy.

Second, test subsets should capture diversity, usually by including examples of different individuals, as they may express a given behavior in unique ways, or outcomes may appear different depending on coat color, age, or size. For example, wounds may look different on black or white hair, as blood or scabbing may stand out more against a light background. If trainees will be expected to collect wound data on animals of many colors, then their training should include this same diversity.

Third, consider challenges associated with the tool used for observing the animals. For example, if animals are observed via a single camera, such that animals closer to the camera are bigger and easier to see, the test subset should include examples of animals close to the camera and farther away, as these may be more difficult to score. Other common challenges involve needing to track individuals across multiple camera angles, or dealing with variation in lighting when analyzing videos and photos. Providing practice and training with these expected challenges will help trainees.

Finally, test subsets should include numerous examples for each behavior or outcome, since reliability is evaluated for each outcome individually. Some reliability metrics, like kappa scores, have specific guidance about the number and variability of examples that should be included. 
Picture
Test subsets should, ideally, cover a duration of training, observation, or measurement that is representative of real data collection. This decision should be balanced against practicality, as it should also be possible for trainees to complete the testing in a reasonable period of time.

​For example, training observers on 10 minutes worth of video may not accurately describe their reliability if data are ultimately scored across 24-hour periods. However, while it may be ideal for trainees to score 24 hours of video continuously to best match real data collection, this may not be practical if it takes 50 hours to complete, especially anticipating potential re-scoring time. Consider your training goals when making decisions about test duration and practicality.
Picture
When possible, match the modality of the test subset to the methodology used during data collection.

​For example, if behavior or outcomes will be scored in person, training observers only on video and photos may not represent their ability to score the same outcomes in person. Time, efficiency, and efficacy would all be good reasons to violate this recommendation, particularly if certain outcomes (e.g. behaviors, wound types) are uncommon, and unlikely to be observed in a live training session.


Picture
Test subsets should be scored in the same manner as they would be scored for true data collection. 

For example, training individuals to identify behavior using instantaneous sampling may not be appropriate if data are ultimately scored with a continuous approach. However, this will depend on practical considerations and your goals. ​
Picture
When possible, presentation order within test subsets should be re-ordered or randomized.

​If re-training, caution is needed to prevent memorization.
< Training
Assessment >
Picture
Picture
Picture
Proudly powered by Weebly
  • Home
  • About
  • Foundations
    • Proposal
    • Measurements >
      • Definitions
    • Team makeup
    • Training >
      • Features of test subsets
      • Assessment
    • Metrics
  • Diving deeper
    • Iterative training processes >
      • Tasks and techniques
      • Categorical data
      • Continuous data
      • Rare outcomes
    • Timeline
    • Troubleshooting
    • Reporting
  • Checklist
  • Resources