Reliability reference scripts¶

RABET 1.3.2 introduced the Reliability tab, which computes inter-rater and intra-rater agreement entirely inside the application using the pingouin Python package. This folder contains an independent R implementation that exists for two reasons:

Cross-language reproducibility. Researchers who run their statistics pipeline in R can verify the in-app numbers by re-computing the same agreement matrix with the canonical R packages (psych and optionally irr).
Reviewer transparency. When publishing reliability numbers from RABET, citing both the in-app pingouin computation and an independent R reference reassures reviewers that the agreement matrix is implementation- neutral.

What is provided¶

File	Purpose
`compute_agreement.R`	Stand-alone R script. Loads two `summary_table.csv` files, matches rows by `animal_id`, computes per-metric ICC(2,1), Pearson r, mean absolute difference, and writes a results CSV. Mirrors RABET's Summary mode.

A Detailed-mode (time-window Cohen's kappa / Krippendorff's alpha) R reference will follow in a later release. The pingouin implementation inside RABET is the authoritative computation in the meantime.

Quick start¶

# Install dependencies (once):
Rscript -e 'install.packages(c("psych"))'

# Reproduce the Summary-mode agreement matrix:
Rscript docs/reliability/compute_agreement.R \
        path/to/scorer_A_summary.csv \
        path/to/scorer_B_summary.csv \
        reliability_summary_R.csv

The script prints the per-metric agreement table and writes it to the output CSV (defaults to reliability_summary_R.csv next to the current working directory if no third argument is given).

Definitions¶

ICC(2,1) here refers to Pingouin's ICC2 output, corresponding to the ICC2 row returned by psych::ICC: a single-rater, absolute-agreement ICC.

Terminology differs across ICC conventions and software packages. In the Shrout and Fleiss / Pingouin convention, ICC2 treats raters as random and ICC3 treats raters as fixed. Some McGraw and Wong-style labels can map the same numerical form to two-way random or two-way mixed absolute-agreement interpretations. Therefore, RABET reports the software label (ICC2) and the form ICC(2,1), and users should interpret the fixed/random rater assumption according to their study design.

Pearson r is the standard product-moment correlation across the matched animals.

Mean absolute difference is mean(abs(A - B)) over animals present in both summary files.

Expected differences from RABET's in-app output¶

The two implementations should agree to within ~1e-6 for ICC and r, and exactly for mean absolute difference. Larger discrepancies usually mean:

One side dropped an animal that the other kept (check unmatched_a / unmatched_b in RABET's status panel and the R script's stdout).
Numerical precision differences in how the underlying linear-mixed model solver handles degenerate inputs (e.g. all-zero columns).

If the values diverge by more than that, please open an issue at https://github.com/mi2e-K/RABET/issues with both CSVs attached.