Metrics.Formally evaluating reliability allows others to assess our approach. We describe the metrics we often use in our scientific papers and the rationale behind each.
|
Visualization and metrics
There are a few common metrics or strategies used to evaluate reliability. When deciding which metric to use, consider the type of data, the goals, and the pros and cons of each suitable method. In some cases, multiple metrics may be needed to provide robust and trustworthy information about reliability.
-
Step 1
-
Step 2
-
Step 3
<
>
Visualization
Visually checking data with graphical representation is an essential step. We recommend starting here and returning to these visualizations as you calculate metrics. Mismatches between metrics and the visual story can help identify problems.
Identify an appropriate, robust metric for your data type
Reliability can be evaluated with descriptive tools and statistical tests. Using a combination may provide more robust interpretations of observer consistency, but using at least a statistical test is often considered best practice. These metrics are typically calculated with statistical coding programs, like R or SAS, though some can be generated through behavior coding programs, like BORIS or Noldus.
These tests should be evaluated for each outcome of interest individually, generating one reliability score per outcome.
If your data are categorical:
These tests should be evaluated for each outcome of interest individually, generating one reliability score per outcome.
If your data are categorical:
Descriptive statistics:
|
Statistical tests:
|
If your data are continuous:
Descriptive statistics:
|
Statistical tests:
|
Note: There are more methods and approaches than those listed here, like including observer in your models. We endeavored to explain some of the most common metrics used for reliability testing, whether or not they are considered robust or best practice. We provide details, and warnings, about these approaches on each sub-page. If you have other approaches that you have used and think would be beneficial to include, please contact Dr. Blair Downey.
Iterate as needed
Reliability training is an iterative process. Oftentimes, after calculating metrics, we achieve scores that indicate our team or method is not yet reliable. When this happens, we troubleshoot what the issue was. This can be a tricky, nuanced step.
We gratefully acknowledge the review and valuable feedback provided by Dr. Andrew Blandino, Senior Statistician at UC Davis, on the metrics section of this website.