Evidence Pack Internals¶
This guide explains how the evidence pack suite is wired internally: entrypoints, task graph, scheduling, and artifact generation. It complements Evidence Packs, which focuses on how to run a suite.
Scope note: in this guide,
CALIBRATION_RUN -> GENERATE_PRESETis called Preset Derivation. It produces run-scopedcalibrated_preset_<model>.yaml/jsonfiles and does not directly modify globalruntime/tiers.yaml.
Overview¶
| Aspect | Details |
|---|---|
| Purpose | Hardware-agnostic Phase 0 validation harness for edit detection |
| Version | evidence-packs-v1 |
| Hardware | NVIDIA GPUs where models fit VRAM; multi-GPU recommended for full |
| Models | subset (1 model), showcase/workshop3 (3 models), or full (6 models); all ungated public |
| Edits | Scenario-driven; default suites use 4 clean + 4 stress edit scenarios per model, and filtered manifests may select any subset |
| Preset Derivation | CALIBRATION_RUN + GENERATE_PRESET create run-scoped calibrated presets |
| Scheduling | Dynamic work-stealing, small_first priority strategy |
| Multi-GPU | Profile-based; required_gpus grows only when memory requires it |
| Output | Evidence pack with manifest.json, checksums.sha256, and report bundles (--layout v2 nests results + metadata) |
| Source of truth | scripts/evidence_packs/run_suite.sh, scripts/evidence_packs/run_pack.sh, src/invarlock/evidence_pack.py, src/invarlock/cli/commands/evidence_pack.py |
Quick Start (Context)¶
# Run the subset suite (offline by default)
./scripts/evidence_packs/run_suite.sh --suite subset
# Run the full suite and build an evidence pack
./scripts/evidence_packs/run_pack.sh --suite full --net 1
# Verify an existing evidence pack
invarlock advanced evidence-pack verify ./evidence_pack_runs/subset_20250101_000000/evidence_pack --strict
Hardware Target¶
- Hardware-agnostic by design; run on any NVIDIA GPU topology where the models fit in VRAM.
- Multi-GPU scheduling is enabled automatically when a task’s memory plan exceeds per-device capacity.
- Set
GPU_MEMORY_GBorGPU_MEMORY_PER_DEVICEto match your hardware when running on GPUs with unusual memory sizes.
Entrypoints and modules¶
Entrypoints¶
scripts/evidence_packs/run_suite.shruns a suite and setsPACK_*runtime flags before calling the main orchestrator.scripts/evidence_packs/run_pack.shruns a suite, then packages artifacts into a portable evidence pack (manifest + checksums + reports).scripts/evidence_packs/verify_pack.shvalidates an evidence pack in repo workflows.invarlock advanced evidence-pack verifyprovides the package-native verifier path for installed wheels.scripts/evidence_packs/suites.shdefines the model suites and allowsMODEL_1–MODEL_8overrides.scripts/evidence_packs/lib/validation_suite.shorchestrates the run: preflight, queue creation, worker launch, and monitoring.
Library modules¶
lib/task_serialization.sh: task schema, JSON helpers, GPU planning.lib/queue_manager.sh: queue states, dependency resolution, task generation.lib/scheduler.sh: dynamic priority, memory gating, reservations.lib/gpu_worker.sh: worker loop, heartbeats, task execution glue.lib/task_functions.sh: implementations for each task type.lib/model_creation.sh: edit and error-model creation helpers (create_model_variantdispatcher).lib/config_generator.sh: InvarLock config generation and wrapper helpers.lib/result_compiler.sh: analysis and verdict compilation.lib/fault_tolerance.sh: error classification and retry/backoff logic.scripts/evidence_packs/python/manifest_writer.py: evidence packmanifest.jsonwriter.scripts/evidence_packs/python/preset_generator.py: preset derivation + edit-type variants.
Module dependency graph¶
┌───────────────────────────────────────────────────────────────────────┐
│ MODULE DEPENDENCY GRAPH │
├───────────────────────────────────────────────────────────────────────┤
│ ENTRYPOINTS │
│ run_pack.sh | run_suite.sh | verify_pack.sh │
│ (pack+run) | (run only) | (checksums+reports verify) │
│ │ │
│ ▼ │
│ ORCHESTRATION LAYER │
│ lib/validation_suite.sh (main_dynamic) │
│ Phase 0: setup + preflight │
│ Phase 1: queue init -> Phase 2: worker launch -> Phase 3: monitor │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ ▼ ▼ │
│ TASK EXECUTION CORE SERVICES │
│ lib/gpu_worker.sh queue_manager │
│ task claim -> precheck -> execute -> cleanup scheduler │
│ task_serialization │
│ fault_tolerance │
│ │ │
│ ▼ │
│ TASK FUNCTIONS │
│ SETUP_BASELINE, CALIBRATION_RUN, GENERATE_PRESET │
│ CREATE_EDITS(_BATCH), CREATE_ERROR, evaluate_* │
└───────────────────────────────────────────────────────────────────────┘
Troubleshooting decision tree¶
Evidence pack issues?
│
├─ Missing manifest.json/checksums.sha256?
│ └─ Used run_suite.sh instead of run_pack.sh
│ → Run: ./scripts/evidence_packs/run_pack.sh --suite ... --net ...
│
├─ Spectral guard failing “clean” quantization edits?
│ ├─ Check: caps_exceeded in report spectral.summary
│ │ └─ Use edit-type presets (generated from preset derivation) or increase max_caps
│ └─ Check: high z-scores in attention layers
│ └─ Expected for quantization; tune thresholds if needed
│
├─ OOM errors?
│ ├─ Lower GPU_MEMORY_PER_DEVICE / GPU_MEMORY_GB
│ ├─ Disable batching: PACK_USE_BATCH_EDITS=false
│ └─ Reduce InvarLock batch/seq_len (INVARLOCK_EVAL_BATCH, INVARLOCK_SEQ_LEN)
│
└─ Disk pressure / ENOSPC?
├─ Check OUTPUT_DIR filesystem free space
└─ Use a larger volume and rerun (suite writes caches under OUTPUT_DIR/.hf)
Model Suite¶
Model suites are defined in scripts/evidence_packs/suites.sh and applied by
run_suite.sh.
| Suite | Models | Notes |
|---|---|---|
subset |
mistralai/Mistral-7B-v0.1 |
Single-GPU friendly |
showcase |
mistralai/Mistral-7B-v0.1, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-32B |
Multi-GPU recommended; guard-focused scenarios |
workshop3 |
mistralai/Mistral-7B-v0.1, mistralai/Mixtral-8x7B-v0.1, 01-ai/Yi-34B |
Workshop-friendly 3-model suite (architecture diversity) |
full |
mistralai/Mistral-7B-v0.1, Qwen/Qwen2.5-14B, Qwen/Qwen2.5-32B, 01-ai/Yi-34B, mistralai/Mixtral-8x7B-v0.1, Qwen/Qwen1.5-72B |
Multi-GPU recommended |
Default full-suite model sizes (weights-only, approximate):
| Model | VRAM | Category | Notes |
|---|---|---|---|
mistralai/Mistral-7B-v0.1 |
~14 GB | Small | Flash Attention 2 compatible |
Qwen/Qwen2.5-14B |
~28 GB | Small | Flash Attention 2 compatible |
Qwen/Qwen2.5-32B |
~64 GB | Medium | Flash Attention 2 compatible |
01-ai/Yi-34B |
~68 GB | Medium | Flash Attention 2 compatible |
mistralai/Mixtral-8x7B-v0.1 |
~90 GB | MoE | MoE architecture |
Qwen/Qwen1.5-72B |
~144 GB | Large | Flash Attention 2 compatible |
Notes:
- Override models via
MODEL_1–MODEL_8; set an empty string to disable a slot. validation_suite.shincludes a fallback list of large causal models if it is run directly withoutsuites.sh.
Edit Types¶
Each model runs 8 edit experiments (4 types × 2 versions) plus optional error injection tests.
Clean edits (tuned)¶
Clean edits use tuned parameters supplied via PACK_TUNED_EDIT_PARAMS_FILE.
The suite uses :clean: as a sentinel in the edit spec and resolves concrete
parameters at runtime.
| Edit Type | Parameters | Scope |
|---|---|---|
| Quantization RTN | tuned (bitwidth, group_size) from tuned params file |
FFN only |
| FP8 Quantization | tuned (format) from tuned params file |
FFN only |
| Magnitude Pruning | tuned (prune_level) from tuned params file |
FFN only |
| Low-Rank SVD | tuned (rank) from tuned params file |
FFN only |
Stress edits¶
Stress edits are split into required-fail (catastrophic) and informational scenarios. Required-fail scenarios are gating in the final verdict; informational scenarios are tracked as detection-quality signals and are validated by a minimum signal-fraction criterion.
Important nuance: some guards remediate without flipping a boolean validation gate. For
example, Spectral can remain validation.spectral_stable=true while applying caps
(spectral.caps_applied > 0). Informational stress scenarios treat both hard gate flips
and remediation events (caps applied) as a “signal” so the suite measures guard activity
without manufacturing clean false positives.
| Edit Type | Parameters | Scope |
|---|---|---|
| Quantization RTN | quant_rtn:8:all (8-bit) |
All layers |
| FP8 Quantization | fp8_quant:e5m2:all |
All layers |
| Magnitude Pruning | magnitude_prune:0.5:all (50% sparsity) |
All layers |
| Low-Rank SVD | lowrank_svd:32:all (rank 32) |
All layers |
Error injection tests¶
Enabled when RUN_ERROR_INJECTION=true (default):
- Required detection (
must_detect):nan_injection,inf_injection,shape_mismatch,missing_tensors,extreme_quant,scale_explosion,rank_collapse,norm_collapse,weight_tying_break - Informational detection:
rmt_norm_noise,spectral_moderate_scale,ve_mlp_scale_skew
rmt_norm_noise additionally emits an rmt_probe.json sidecar next to the error report.
This runs an explicit cross-model RMT probe on shared calibration windows (stored in the
baseline report) so the evidence pack can demonstrate RMT’s delta policy even when compare-mode
evaluation keeps validation.rmt_stable=true.
ve_mlp_scale_skew additionally emits a ve_probe.json sidecar next to the error report.
Variance (DD-VE) is a remediation guard and compare-mode evaluation runs the subject model
with a no-op edit, which can mute VE’s in-report evidence. The VE probe runs VE calibration
directly on shared windows and records whether VE proposes scales and produces a meaningful
primary-metric improvement.
Source of truth: scripts/evidence_packs/scenarios.json strictness + intent +
primary_guard metadata.
Scheduling¶
The suite uses dynamic work-stealing scheduling with a file-backed task queue.
validation_suite.sh seeds the queue and launches one worker per GPU; workers
claim tasks under a scheduler lock with GPU reservation files.
small_first priority strategy¶
Base task priorities (queue manager) are combined with dynamic boosts in
scheduler.sh (model size, blocked dependents, age, and fairness penalties).
Priority (base) Task type
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
90 ┤ SETUP_BASELINE
85 ┤ CALIBRATION_RUN
75 ┤ GENERATE_PRESET
70 ┤ CREATE_EDITS_BATCH / CREATE_EDIT
65 ┤ evaluate_EDIT
60 ┤ CREATE_ERROR
55 ┤ evaluate_ERROR
Dynamic boosts (scheduler):
- Model size boosts: <30GB (+30), <70GB (+20), <100GB (+10).
- Critical tasks:
SETUP_BASELINE(+50),CALIBRATION_RUN(+20). - Unblock boost: +2 per dependent task (capped).
- Age boost: +1 per 5 minutes in the queue (capped).
- Fairness penalty: -3 per running task for the same model (capped).
- Work-stealing boost: raises priority for lagging models.
Dynamic scheduling diagram¶
run_pack.sh (optional)
-> run_suite.sh
-> validation_suite.sh (main_dynamic)
-> init_queue + generate_all_tasks
-> start gpu_worker per GPU
-> monitor loop (resolve deps, progress, restarts)
Work-stealing timeline (illustrative)¶
Time→ T=0 T=50% T=100%
GPU 0 ████ small ████ small ████ large (helping) ████░░░░░░
GPU 1 ████ small ████ medium ████ large (helping) ███░░░░░░
GPU 2 ████ small ████ medium ████ large ████░░░░░░░░░░░░░░░
GPU 3 ████ medium ████ medium ████ large ████░░░░░░░░░░░░░░
GPU 4 ████ medium ████ large ████████████████░░░░░░░░░░░░░░
GPU 5 ████ MoE ████████ large ████████████████░░░░░░░░░░░░░
Illustrative only; actual scheduling depends on queue state and memory.
Multi-GPU Model Distribution¶
After baseline setup, the suite writes model_profile.json and updates per-task
memory estimates. task_serialization.sh calculates required_gpus based on
GPU_MEMORY_PER_DEVICE and NUM_GPUS:
- Tasks reserve multiple GPUs only when memory exceeds per-device capacity.
- Adaptive under-allocation is disabled by default (
get_minimum_gpusmatchesrequired_gpus) to avoid OOM. - Set
GPU_MEMORY_PER_DEVICEexplicitly for non-80/180GB hardware.
Memory-aware selection example¶
GPU 2: 80GB total, 28GB free
Ready queue scan (highest-priority fit):
qwen-14b_CALIBRATION_RUN_002 req=24GB pri=85 FITS ✓
mixtral_CREATE_EDITS_BATCH_001 req=92GB pri=70 SKIP ✗
yi-34b_evaluate_EDIT_001 req=72GB pri=65 SKIP ✗
GPU reservation protection¶
Reservations are stored under OUTPUT_DIR/workers/gpu_reservations/ and guarded
by a queue/scheduler.lock (mkdir-based). The scheduler also expires stale
reservations by TTL (GPU_RESERVATION_TTL).
Reservation state example¶
GPU 0 GPU 1 GPU 2 GPU 3
FREE RSVD FREE RSVD
^ ^
| |
task_a task_b (multi-GPU: 1,3)
queue/scheduler.lock
workers/gpu_reservations/
├── gpu_1.lock
├── task_<task_id>.gpus
└── task_<task_id>.meta
Task lifecycle¶
┌─────────┐ ┌───────┐ ┌─────────┐ ┌───────────┐
│ PENDING │───▶│ READY │───▶│ RUNNING │───▶│ COMPLETED │
└─────────┘ └───────┘ └─────────┘ └───────────┘
│
▼
┌────────┐
│ FAILED │
└────────┘
GPU worker loop¶
START gpu_worker
│
├─ check shutdown? ── yes → exit
│
├─ query GPU memory
├─ find_and_claim_task (scheduler lock + reservation)
│ ├─ none → sleep → loop
│ └─ task → execute_task → complete/fail → release_gpus
└─ update heartbeat/status → loop
Batch optimizations¶
Small/medium models default to batch edit creation:
- Batch edit creation:
CREATE_EDITS_BATCHloads a model once and creates all 8 edits (cuts repeated model loads).
Large or MoE models disable batch edits automatically (or via
PACK_USE_BATCH_EDITS=false) and fall back to per-edit tasks
(CREATE_EDIT → evaluate_EDIT).
Task dependency graphs¶
Batch (default):
SETUP_BASELINE
├─ CALIBRATION_RUN × N ──> GENERATE_PRESET ──┐
├─ CREATE_EDITS_BATCH ------------------------┴─> evaluate_EDIT × runs
└─ CREATE_ERROR × types ----------------------┴─> evaluate_ERROR × types
Notes:
- Error injection tasks (
CREATE_ERROR→evaluate_ERROR) branch offSETUP_BASELINEand require the preset for evaluation.
Per-edit path (large/MoE or PACK_USE_BATCH_EDITS=false):
SETUP_BASELINE
├─ CALIBRATION_RUN × N ──> GENERATE_PRESET ──┐
├─ CREATE_EDIT × edits -----------------------┴─> evaluate_EDIT × runs
└─ CREATE_ERROR × types ----------------------┴─> evaluate_ERROR × types
Task breakdown per model (defaults)¶
Defaults: DRIFT_CALIBRATION_RUNS=5, CLEAN_EDIT_RUNS=3,
STRESS_EDIT_RUNS=2, RUN_ERROR_INJECTION=true.
Batch path (default for small/medium):
- Setup baseline: 1 task
- Preset-derivation runs + preset generation: 6 tasks
- Batch edits: 1 task
- evaluate edits: 20 tasks
- Error injection: 10 tasks
Total: ~38 tasks/model (varies with overrides).
Per-edit path (large/MoE or PACK_USE_BATCH_EDITS=false):
- Setup baseline: 1 task
- Preset-derivation runs + preset generation: 6 tasks
- Create edits: 8 tasks
- evaluate edits: 20 tasks
- Error injection: 10 tasks
Total: ~45 tasks/model (varies with overrides).
Execution phases¶
PHASE 0: Environment setup
- Dependency checks, GPU pool configuration, disk preflight
PHASE 1: Task queue initialization
- Generate tasks for all models, resolve initial dependencies
PHASE 2: GPU worker launch
- Spawn one worker per GPU, dynamic scheduling in loop
PHASE 3: Reports + verdict
- Compile reports into final verdict reports
Run directory layout¶
OUTPUT_DIR/
analysis/
determinism_repeats.json # optional (when --repeats is used)
reports/
final_verdict.txt
final_verdict.json
category_summary.json
guard_signal_summary.json
guard_intervention_summary.json
scenario_signal_summary.json
presets/
state/
model_revisions.json # pinned HF revisions (when --net 1)
progress.json
disk_pressure.json
tuned_edit_params.json # copy of PACK_TUNED_EDIT_PARAMS_FILE
queue/
pending/ ready/ running/ completed/ failed/
queue.lock
scheduler.lock
logs/
gpu_<id>.log
tasks/<task_id>.log
workers/
gpu_<id>.pid
gpu_<id>.heartbeat
gpu_<id>.status
gpu_reservations/
SHUTDOWN
<model_name>/
models/
baseline/
<edit_name>/
error_<type>/
reports/
calibration/
<edit_name>/run_<n>/
errors/<type>/
Some scenarios emit additional sidecar artifacts alongside evaluation.report.json
(for example reports/errors/rmt_norm_noise/rmt_probe.json or
reports/errors/ve_mlp_scale_skew/ve_probe.json). When present, run_pack.sh copies
these sidecars into the packaged evidence pack under reports/**/.
Run modes¶
--calibrate-only/PACK_SUITE_MODE=calibrate-only- Preset derivation only mode.
- Only promotes
SETUP_BASELINE,CALIBRATION_RUN, andGENERATE_PRESETtasks. - The monitor exits after all
GENERATE_PRESETtasks complete. --run-only- Continue a prior run after preset derivation. This is effectively
--resumewithPACK_SUITE_MODE=full. --resume- Reuses an existing queue and continues from where the run stopped.
Determinism vs throughput¶
PACK_DETERMINISM controls harness-level determinism:
# Throughput (default)
PACK_DETERMINISM=throughput ./scripts/evidence_packs/run_suite.sh --suite subset
# Strict
PACK_DETERMINISM=strict ./scripts/evidence_packs/run_suite.sh --suite subset
- Throughput:
NVIDIA_TF32_OVERRIDE=1,CUDNN_BENCHMARK=1. - Strict:
NVIDIA_TF32_OVERRIDE=0,CUDNN_BENCHMARK=0,CUBLAS_WORKSPACE_CONFIG=:4096:8.
Network mode and model revisions¶
Evidence packs are offline by default:
PACK_NET=0setsINVARLOCK_ALLOW_NETWORK=0and enables HF offline modes.PACK_NET=1enables downloads and writesstate/model_revisions.json(ungated models only).- Offline runs require
model_revisions.json; missing revisions trigger a hard error duringSETUP_BASELINE.
Use PACK_MODEL_REVISIONS_FILE to override the revisions path.
Disk and cache behavior¶
Large runs can be storage-heavy (baseline + edits + error models):
- Disk preflight estimates required storage and aborts early when insufficient.
- Override with
PACK_SKIP_DISK_PREFLIGHT=1(not recommended). - The minimum free space guard is
MIN_FREE_DISK_GB(default 200). PACK_BASELINE_STORAGE_MODE=snapshot_symlinkbuilds a local symlink tree that points into the Hugging Face cache snapshot. This avoids a second baseline copy underOUTPUT_DIR, but it requires one full model copy inHF_HUB_CACHEwhen that cache shares the output filesystem.PACK_BASELINE_STORAGE_MODE=snapshot_copymaterializes a full baseline copy underOUTPUT_DIR/models/<model>/baseline.- Baseline downloads prefer one weight format only. When both
.safetensorsand.binweights are published, evidence packs download the safetensors set and ignore the.bincopy. - HF caches default to
OUTPUT_DIR/.hf(override withHF_HOME,HF_HUB_CACHE,HF_DATASETS_CACHE).
For the default subset suite (mistralai/Mistral-7B-v0.1), the model-weight
budget is roughly:
- ~42 GB on the output filesystem with
snapshot_symlinkwhenHF_HUB_CACHElives on the same filesystem asOUTPUT_DIR(one cached baseline + one clean edit peak + one error-model peak under cleanup mode). - ~28 GB on the output filesystem with
snapshot_symlinkwhenHF_HUB_CACHEis on a separate volume. - ~56 GB on the output filesystem with
snapshot_copyon the same filesystem.
Those figures are for model weights only; the default preflight also requires
MIN_FREE_DISK_GB=200 headroom.
Evidence pack packaging and verification¶
run_pack.sh builds a portable pack:
- Copies
reports/final_verdict.{txt,json}plus verdict sidecars (category_summary,guard_signal_summary,scenario_signal_summary) and keyanalysis/*artifacts. - Collects all reports into
evidence_pack/reports/.... - Generates
manifest.json,checksums.sha256, optionalmanifest.signature.json. - Writes pack-contained provenance metadata such as
metadata/source_repo.jsonandmetadata/environment.jsonbefore sealing the pack. - Stages the pack in a hidden sibling temporary directory and renames it into
place only after sealing succeeds, so failed builds do not leave partial
evidence_pack/output behind. - Optional HTML export can be disabled with
PACK_SKIP_HTML=1.
Packaging flow¶
run_pack.sh
├─ run_suite.sh → OUTPUT_DIR
├─ collect reports + sidecars
├─ write manifest + checksums
└─ optional HTML + package-native signature
invarlock advanced evidence-pack verify checks the pack:
- Verifies
manifest.jsonbindschecksums.sha256viachecksums_sha256_digest. - Verifies digest-backed manifest references (
subject,invocation.config_source,environment, andmaterials) against on-pack files. - Verifies
checksums.sha256(and thus all hashed artifacts). - Verifies the package-native Ed25519 signature bundle when present;
--strictrequires it. - Enforces “no extra files” semantics in
--strictmode. - Runs
invarlock verifyacross all bundled reports (JSON output optional) with runtime-manifest enforcement on; each packagedevaluation.report.jsoncarries an adjacentruntime.manifest.json. - Returns structured exit codes so callers can distinguish usage, missing-file, manifest-format, signature, integrity, and report-verification failures.
The installed-wheel package-native CLI is self-contained:
invarlock advanced evidence-pack keygengenerates Ed25519 signing keys.invarlock advanced evidence-pack build --signing-key ...emitsmanifest.signature.json.invarlock advanced evidence-pack verifyvalidates the signature bundle in-process and does not depend on external signature binaries.
The repo shell harness remains a separate maintainer path, but it uses the same package-native Ed25519 manifest-signature format as the installed CLI.
Maintainer evidence-pack packaging also treats source provenance as fail-closed:
run_pack.shwritesmetadata/source_repo.jsonfrom the active Git checkout.- If
gitis unavailable or the repository metadata cannot be collected, pack creation stops instead of silently emitting partial provenance. - If you need to package from a detached artifact tree, write a complete
metadata/source_repo.jsonfirst rather than relying on fallback inference.
Remote setup helper¶
scripts/evidence_packs/lib/setup_remote.sh is an optional bootstrap script for
fresh GPU hosts. It clones the repo, creates a venv, installs PyTorch and
InvarLock, and leaves the host ready to run run_pack.sh.
Operational guidance for remote evidence-pack work:
- Prefer a fresh clone or work tree per campaign instead of reusing an older editable-install checkout.
- If you intentionally run from a work tree that is not the editable install
behind
.venv, either reinstall that work tree or exportPYTHONPATH=srcsoinvarlockresolves to the intended source tree. run_suite.shandrun_pack.shdefault toSKIP_FLASH_ATTN=trueandPACK_BASELINE_STORAGE_MODE=snapshot_copyfor bulk default runtime-container runs.- Bulk evidence-pack runs fail fast unless
INVARLOCK_ALLOW_REMOTE_CODE=1is set. - Export non-default runtime roots before launching the suite when you expect
them inside delegated container jobs:
INVARLOCK_CONFIG_ROOT,HF_HOME,HF_HUB_CACHE,HF_DATASETS_CACHE,TRANSFORMERS_CACHE,TMPDIR,TMP. - If a staged preset or profile uses
!includeoutside its config directory, setINVARLOCK_ALLOW_CONFIG_INCLUDE_OUTSIDE=1on the remote host before the evidence-pack entrypoint; the default runtime-container launcher rejects that config graph before container start when the override is missing. - After Qwen2.5-14B campaigns, run
scripts/evidence_packs/run_qwen14_sentinels.shfrom the same fresh work tree to validate saved-model direct evaluate and the public quant smoke.
Recommended remote validation checklist after security-default changes:
- Run an evidence-pack subset lane with explicit external
HF_HOMEandINVARLOCK_CONFIG_ROOToverrides. - Run one delegated
invarlock evaluatewith external--edit-config,TMPDIR, andINVARLOCK_EXPORT_DIRroots. - Run one
scripts/model_evidence_sweep.py --execution-mode containerlane with an external output root and confirm the published report path is populated.
Common knobs for the setup script:
REPO_DIR,REPO_URL,BRANCH,PYTHON_BIN,VENV_DIR.TORCH_INDEX_URL,TORCH_PACKAGES,PACK_SKIP_TORCH_CHECK.HF_HOME,HF_HUB_CACHE,HF_DATASETS_CACHE.
Tuning reference¶
Core configuration¶
| Variable | Default | Description |
|---|---|---|
PACK_SUITE |
subset |
Suite name (subset or full) |
PACK_NET |
0 |
Enable network preflight/downloads |
PACK_OUTPUT_DIR |
unset | Sets OUTPUT_DIR when provided |
OUTPUT_DIR |
auto | ./evidence_pack_runs/<suite>_<timestamp> via entrypoint |
PACK_OUTPUT_DIR_ABSOLUTE |
false |
Normalize OUTPUT_DIR to absolute path |
PACK_SUITE_MODE |
full |
full, calibrate-only, or run-only |
PACK_DETERMINISM |
throughput |
Harness determinism mode |
PACK_REPEATS |
0 |
Determinism repeat metadata |
PACK_MODEL_REVISIONS_FILE |
OUTPUT_DIR/state/model_revisions.json |
Revisions path |
PACK_USE_BATCH_EDITS |
auto |
Force/disable batch edit creation |
RESUME_MODE |
true |
Skip completed steps when outputs exist |
Hardware selection¶
| Variable | Default | Description |
|---|---|---|
CUDA_VISIBLE_DEVICES |
unset | Explicit GPU pool (comma-separated) |
GPU_ID_LIST |
unset | Alternate GPU pool list |
NUM_GPUS |
auto | Number of GPUs to use (clamped to pool) |
GPU_MEMORY_GB |
auto | Per-GPU memory hint for planning |
GPU_MEMORY_PER_DEVICE |
GPU_MEMORY_GB |
Per-device memory for required_gpus |
GPU_MIN_FREE_GB |
10 |
Minimum free VRAM for eligibility |
GPU_REQUIRE_IDLE |
true |
Require GPUs with no compute processes |
GPU_CACHE_TTL |
5 |
GPU cache TTL (seconds) |
GPU_RESERVATION_TTL |
60 |
Reservation TTL (seconds) |
GPU_RESERVATION_LOCK_TIMEOUT |
5 |
Reservation lock timeout (seconds) |
Model overrides¶
| Variable | Default | Description |
|---|---|---|
MODEL_1–MODEL_8 |
suite-defined | Override model slots; empty disables |
InvarLock settings¶
| Variable | Default | Description |
|---|---|---|
INVARLOCK_DATASET |
wikitext2 |
Dataset provider |
INVARLOCK_DATASET_PROVIDER_YAML |
unset | Raw YAML mapping for dataset.provider (advanced; overrides provider kind + args) |
INVARLOCK_DATASET_PROVIDER_JSON |
unset | Raw JSON object for dataset.provider (advanced; overrides provider kind + args) |
INVARLOCK_HF_DATASET_NAME |
allenai/c4 |
HF dataset name when INVARLOCK_DATASET=hf_text |
INVARLOCK_HF_CONFIG_NAME |
en (for allenai/c4) |
HF dataset config when INVARLOCK_DATASET=hf_text |
INVARLOCK_HF_TEXT_FIELD |
text |
Text field when INVARLOCK_DATASET=hf_text |
INVARLOCK_HF_MAX_SAMPLES |
2000 |
Max rows consumed when INVARLOCK_DATASET=hf_text |
INVARLOCK_HF_TRUST_REMOTE_CODE |
unset | Pass trust_remote_code to HF load_dataset (not needed for allenai/c4 Parquet) |
INVARLOCK_HF_CACHE_DIR |
unset | datasets cache_dir override when INVARLOCK_DATASET=hf_text |
INVARLOCK_LOCAL_JSONL_FILE |
unset | JSONL file path when INVARLOCK_DATASET=local_jsonl |
INVARLOCK_LOCAL_JSONL_PATH |
unset | JSONL file/dir path when INVARLOCK_DATASET=local_jsonl |
INVARLOCK_LOCAL_JSONL_DATA_FILES |
unset | JSONL glob/list when INVARLOCK_DATASET=local_jsonl |
INVARLOCK_LOCAL_JSONL_TEXT_FIELD |
text |
Text field when INVARLOCK_DATASET=local_jsonl |
INVARLOCK_LOCAL_JSONL_MAX_SAMPLES |
2000 |
Max rows consumed when INVARLOCK_DATASET=local_jsonl |
INVARLOCK_TIER |
balanced |
Guard tier preset |
INVARLOCK_PREVIEW_WINDOWS |
32 |
Preview windows |
INVARLOCK_FINAL_WINDOWS |
32 |
Final windows |
INVARLOCK_SEQ_LEN |
512 |
Sequence length |
INVARLOCK_STRIDE |
256 |
Stride |
INVARLOCK_EVAL_BATCH |
32 |
InvarLock batch size |
PACK_GUARDS_ORDER |
invariants,spectral,rmt,variance,invariants |
Guards included in preset derivation and generated presets |
Primary metric acceptance/drift gates should be configured via profile/config
(primary_metric.acceptance_range, primary_metric.drift_band), not env vars.
Tuned edit presets¶
| Variable | Default | Description |
|---|---|---|
PACK_TUNED_EDIT_PARAMS_FILE |
unset | JSON file with tuned clean edit params (required when CLEAN_EDIT_RUNS>0). |
Preset derivation reuse¶
| Variable | Default | Description |
|---|---|---|
PACK_CALIBRATION_PRESET_DIR |
unset | Directory containing calibrated_preset_<model>.yaml/json to reuse; skips preset-derivation runs. |
PACK_CALIBRATION_PRESET_FILE |
unset | Single preset file applied to all models (advanced). |
Experiment controls¶
| Variable | Default | Description |
|---|---|---|
DRIFT_CALIBRATION_RUNS |
5 |
Preset-derivation run count |
CLEAN_EDIT_RUNS |
3 |
Clean edit evaluate runs |
STRESS_EDIT_RUNS |
2 |
Stress edit evaluate runs |
RUN_ERROR_INJECTION |
true |
Enable error injection |
Storage and memory planning¶
| Variable | Default | Description |
|---|---|---|
PACK_BASELINE_STORAGE_MODE |
snapshot_symlink |
Baseline storage mode (snapshot_symlink, snapshot_copy, or save_pretrained) |
MIN_FREE_DISK_GB |
200 |
Disk pressure threshold |
PACK_SKIP_DISK_PREFLIGHT |
0 |
Skip storage preflight |
CUDA_MEMORY_FRACTION |
0.92 |
Target GPU memory fraction |
MODEL_LOAD_OVERHEAD_GB |
4 |
Load overhead for planning |
EDIT_OVERHEAD_GB |
8 |
Per-edit overhead for planning |
BATCH_EDIT_OVERHEAD_GB |
8 |
Batch edit overhead |
INVARLOCK_OVERHEAD_GB |
6 |
InvarLock overhead |
Worker + reliability controls¶
| Variable | Default | Description |
|---|---|---|
WORKER_HEARTBEAT_INTERVAL |
30 |
Heartbeat interval (seconds) |
WORKER_IDLE_SLEEP |
5 |
Sleep when idle (seconds) |
WORKER_MAX_FAILURES |
10 |
Stop worker after N failures |
WORKER_TIMEOUT |
2700 |
Worker heartbeat timeout (seconds) |
CANCEL_BLOCKED_TASKS_GRACE_SECONDS |
90 |
Fail blocked tasks after grace |
TASK_TIMEOUT_DEFAULT |
21600 |
Default task timeout (seconds) |
TASK_TIMEOUT_<TASKTYPE> |
unset | Per-task timeout override |
Packaging and verification¶
| Variable | Default | Description |
|---|---|---|
PACK_DIR |
OUTPUT_DIR/evidence_pack |
Evidence pack output dir |
PACK_SIGN_MANIFEST |
1 |
Sign manifest.json with a package-native Ed25519 key (auto-generated if PACK_SIGNING_KEY is unset) |
PACK_SIGNING_KEY |
unset | Optional Ed25519 private key PEM for deterministic signer identity |
PACK_SKIP_HTML |
0 |
Skip HTML rendering |
PACK_VERIFY_PROFILE |
dev |
Profile for invarlock verify |
Troubleshooting¶
Missing model revisions (offline)¶
If offline runs fail with “requires model revisions”, run a preflight:
./scripts/evidence_packs/run_suite.sh --suite subset --net 1
Or point to an existing revisions file with PACK_MODEL_REVISIONS_FILE.
OOM on large models¶
- Lower
GPU_MEMORY_PER_DEVICEso the planner requests more GPUs. - Disable batch edits:
PACK_USE_BATCH_EDITS=false. - Reduce InvarLock batch/seq_len (e.g.,
INVARLOCK_EVAL_BATCH=16 INVARLOCK_SEQ_LEN=256). - Increase memory overhead knobs (
MODEL_LOAD_OVERHEAD_GB,EDIT_OVERHEAD_GB).
Disk pressure / preflight failures¶
Check state/disk_pressure.json and ensure the output filesystem has headroom.
Use MIN_FREE_DISK_GB=0 or PACK_SKIP_DISK_PREFLIGHT=1 only if you accept
risk of partial artifacts.
Task timeouts¶
Increase the default or per-task timeout:
TASK_TIMEOUT_DEFAULT=28800 ./scripts/evidence_packs/run_suite.sh --suite subset
TASK_TIMEOUT_CREATE_EDIT=28800 ./scripts/evidence_packs/run_suite.sh --suite subset
Stuck queues or dead workers¶
- Inspect
state/progress.jsonandworkers/gpu_<id>.status. - Check worker logs:
logs/gpu_<id>.logandlogs/tasks/<task_id>.log. - Re-run with
--resumeto recover from a crash.