Artifact Layout¶

Overview¶

Aspect	Details
Purpose	Explain where evaluation outputs and reports live.
Audience	Operators archiving evidence and CI outputs.
Scope	`runs/` scratch outputs and `reports/` long-lived evidence.
Source of truth	`src/invarlock/core/evaluate_contract.py`, `src/invarlock/core/evaluate_plan.py`, `src/invarlock/reporting/report_make.py`, `src/invarlock/reporting/report_bundle.py`, `src/invarlock/reporting/report_console.py`, `src/invarlock/reporting/report_files.py`, `src/invarlock/cli/commands/evaluate.py`.

Quick Start¶

# Compare baseline and subject on the default runtime-container path
invarlock evaluate --allow-network \
  --baseline gpt2 \
  --subject gpt2 \
  --report-out reports/eval

# Render HTML from the emitted evaluation bundle
invarlock report html -i reports/eval/evaluation.report.json -o reports/eval/evaluation.html
invarlock report explain --evaluation-report reports/eval/evaluation.report.json

Model-loading commands use the runtime container by default unless a host-side invarlock evaluate --execution-mode host workflow explicitly bypasses it.

Repo-owned presets under configs/ remain available for maintainers, but the quick-start path above stays wheel-compatible by using direct flags only.

Concepts¶

runs/ is scratch space: evaluate emits baseline/subject working artifacts there.
reports/ is evidence: archive evaluation.report.json and runtime.manifest.json for audit, plus any HTML or evidence-pack outputs you distribute.
evaluation bundles reference baseline/subject report artifacts; keep them together to preserve pairing and make later review easier.

Command outputs¶

Command	Writes	What to archive
`invarlock evaluate`	`runs/`, `reports/<name>/evaluation.report.json`, `runtime.manifest.json`	Evaluation report bundle plus runtime provenance for container-backed runs.
`invarlock report html`	`reports/<name>/evaluation.html`	Optional (can be rebuilt).

Reference¶

Evaluate scratch outputs (`runs/`)¶

runs/
  baseline/
    ...
  subject/
    ...

Evaluation reports (`reports/`)¶

reports/
  eval/
    evaluation.report.json
    runtime.manifest.json
    evaluation.html

Archive checklist¶

Keep evaluation.report.json with runtime.manifest.json.
Retain HTML exports only when you need reviewer-friendly artifacts.
Retain scratch runs/ only if debugging or rebuilding derived artifacts.
Prune timestamped runs/ once evidence is archived.

Artifact	Why archive	Required for verify
`evaluation.report.json`	Evaluation report snapshot	Yes
`runtime.manifest.json`	Runtime provenance for container-backed outputs	Yes
`events.jsonl`	Debugging timeline	No
`evaluation.html`	Human review	No

Seeds, hashes, and policy digests¶

report.meta.seeds includes Python/NumPy/Torch seeds.
report.meta.tokenizer_hash and dataset digests support pairing verification.
reports record policy_digest and resolved tier policy snapshots.

Cleanup checklist¶

Copy evaluation.report.json and runtime.manifest.json into reports/ for retention.
Keep any referenced baseline/subject artifacts alongside derived reports for pairing checks and report explain.
Remove stale timestamped runs once evidence is archived.

Troubleshooting¶

Missing pairing artifacts: report explain and some advanced workflows need the baseline/subject artifacts referenced by the evaluation bundle.
Large run dirs: prune old timestamped runs after archiving reports.

Observability¶

evaluation.report.json is the canonical distribution artifact.
scratch run artifacts provide per-phase logs for debugging when needed.

reports — Schema, telemetry, and HTML export
CLI Reference