Skip to content

Observability

Overview

Aspect Details
Purpose Runtime monitoring, health checking, and telemetry collection for InvarLock operations.
Audience Operators running production workloads and developers debugging performance issues.
Supported surface MonitoringManager, HealthChecker, TelemetryCollector, MetricsRegistry.
Requires psutil (included in base install); torch for GPU monitoring.
Network Fully offline; no external telemetry is sent.
Source of truth src/invarlock/observability/*.py.

Quick Start

from invarlock.observability import MonitoringManager, MonitoringConfig

# Start monitoring with default config
config = MonitoringConfig(
    metrics_interval=10.0,
    health_check_interval=30.0,
)
monitor = MonitoringManager(config)
monitor.start()

# Record an operation
monitor.record_operation("evaluate", duration=45.2, model_id="gpt2")

# Get status
status = monitor.get_status()
print(status["health_status"])

# Stop when done
monitor.stop()

Concepts

  • Metrics: counters, gauges, and histograms tracked during runs.
  • Health checks: periodic checks for system resources (CPU, memory, disk, GPU).
  • Telemetry: operation tracking with start/end times and metadata.
  • Alerting: configurable thresholds for warnings (not external notifications).

Component Architecture

Component Responsibility
MonitoringManager Central coordinator; starts/stops monitoring threads.
MetricsRegistry Stores counters, gauges, and histograms by name.
HealthChecker Runs periodic health checks on system components.
TelemetryCollector Tracks operation lifecycles (start → end).
ResourceMonitor Collects CPU, memory, disk, and GPU usage.
PerformanceMonitor Tracks operation durations and percentiles.
AlertManager Evaluates threshold rules and records violations.

Reference

MonitoringConfig

from invarlock.observability import MonitoringConfig

config = MonitoringConfig(
    # Collection intervals (seconds)
    metrics_interval=10.0,
    health_check_interval=30.0,
    resource_check_interval=5.0,

    # Data retention
    metrics_retention_hours=24,
    max_events=10000,

    # Alerting
    enable_alerting=True,
    alert_channels=[],

    # Export settings
    prometheus_enabled=False,
    prometheus_port=9090,
    json_export_enabled=True,
    json_export_path="./monitoring",

    # Resource thresholds (percent)
    cpu_threshold=80.0,
    memory_threshold=85.0,
    gpu_memory_threshold=90.0,

    # Performance monitoring
    latency_percentiles=[50, 90, 95, 99],
    slow_request_threshold=30.0,
)

Default Metrics

InvarLock registers these metrics automatically:

Metric Type Description
'invarlock.operations.total' Counter Total operations by type and status.
'invarlock.errors.total' Counter Total errors by type.
'invarlock.edits.applied' Counter Total edits applied.
'invarlock.guards.triggered' Counter Guard triggers.
'invarlock.operation.duration' Histogram Operation duration distribution.
'invarlock.edit.duration' Histogram Edit operation duration.
'invarlock.guard.duration' Histogram Guard execution duration.
'invarlock.memory.usage' Gauge Memory usage.
'invarlock.gpu.memory.usage' Gauge GPU memory usage.
'invarlock.cpu.usage' Gauge CPU usage.
'invarlock.model.parameters' Gauge Model parameter count.
'invarlock.model.size_mb' Gauge Model size in MB.
'invarlock.model.loads' Counter Model loads.

Health Checks

The HealthChecker runs these checks by default:

Check Status Thresholds Details
memory WARNING > 80%, CRITICAL > 90% System RAM usage.
cpu WARNING > 85%, CRITICAL > 95% CPU utilization.
disk WARNING > 85%, CRITICAL > 95% Disk space on /.
gpu WARNING > 85%, CRITICAL > 95% GPU memory (if CUDA available).
pytorch CRITICAL on failure PyTorch functionality test.

InvarLock-specific checks (via InvarLockHealthChecker):

Check Description
adapters Verifies adapter classes can be instantiated.
guards Verifies guard classes can be instantiated.
dependencies Checks for torch, transformers, numpy, psutil.

Health Status API

from invarlock.observability import InvarLockHealthChecker

checker = InvarLockHealthChecker()

# Check all components
results = checker.check_all()
for name, health in results.items():
    print(f"{name}: {health.status.value} - {health.message}")

# Get overall status
overall = checker.get_overall_status()
print(f"Overall: {overall.value}")

# Get summary
summary = checker.get_summary()
print(summary["status_counts"])

TelemetryCollector

Track operation lifecycles:

from invarlock.observability import MonitoringManager, TelemetryCollector

monitor = MonitoringManager()
telemetry = TelemetryCollector(monitor)

# Start tracking
op_id = telemetry.start_operation(
    "op-123",
    "evaluate",
    model_id="gpt2",
    profile="ci",
)

# ... perform operation ...

# End tracking
telemetry.end_operation(
    op_id,
    status="success",
    ratio_vs_baseline=1.02,
)

# Get stats
stats = telemetry.get_operation_stats()
print(f"Success rate: {stats['success_rate']:.2%}")

ResourceMonitor

from invarlock.observability.core import ResourceMonitor, MonitoringConfig
from invarlock.observability import MetricsRegistry

config = MonitoringConfig()
metrics = MetricsRegistry()
resource_monitor = ResourceMonitor(metrics, config)

# Get usage
usage = resource_monitor.get_current_usage()
print(f"CPU: {usage['cpu_percent']:.1f}%")
print(f"Memory: {usage['memory_percent']:.1f}%")
print(f"Disk free: {usage['disk_free_gb']:.1f} GB")

# Check thresholds
warnings = resource_monitor.check_thresholds()
for warning in warnings:
    print(f"WARNING: {warning}")

JSON Export

When json_export_enabled=True, metrics are periodically exported to json_export_path:

ls ./monitoring/
# metrics_20260115_120000.json
# metrics_20260115_121000.json

Health HTTP Endpoint

For containerized deployments:

from invarlock.observability.health import create_health_endpoint

HTTPServer, HealthHandler = create_health_endpoint()
if HTTPServer:
    server = HTTPServer(("0.0.0.0", 8080), HealthHandler)
    server.serve_forever()

Access at http://localhost:8080/health:

{
  "overall_status": "healthy",
  "total_components": 8,
  "status_counts": {
    "healthy": 8,
    "warning": 0,
    "critical": 0,
    "unknown": 0
  },
  "components": {
    "memory": { "status": "healthy", "message": "Memory usage normal: 45.2%" },
    "cpu": { "status": "healthy", "message": "CPU usage normal: 12.3%" }
  }
}

Troubleshooting

  • High memory warnings: reduce batch size or use --device cpu for smaller footprint.
  • GPU memory critical: clear CUDA cache between runs or use chunked snapshots.
  • Health check failures: run invarlock doctor for detailed diagnostics.
  • Missing metrics: ensure monitoring is started before operations.

Observability in CLI

The CLI doesn't start full monitoring by default, but you can enable telemetry:

# Enable single-line telemetry summary
INVARLOCK_TELEMETRY=1 invarlock evaluate --baseline gpt2 --subject gpt2

This uses the runtime container by default unless a host-side workflow adds --execution-mode host.

Reports include telemetry under report.metrics:

  • latency_ms_per_tok — mean latency per token
  • memory_mb_peak — peak memory during run
  • throughput_tok_per_s — average throughput

reports copy these to the telemetry block.