Recorder Docs

Proof recording and tamper-evident evidence documentation.

Other product docs

Recorder System-Test and Validation Architecture

Audience: Engineers maintaining release confidence, deterministic validation, and audit-ready test artifacts.


Table of Contents

  1. Executive Overview
  2. System-Test Contract
  3. Registry-Driven Inventory
  4. Coverage Matrix and Gap Model
  5. Artifact and Transcript Contract
  6. Execution Tooling
  7. Suite Structure
  8. File-by-File Cross Reference

Executive Overview

Recorder system-tests validate behavior through real production boundaries (CLI and adapter ingest), emit structured artifacts, and track coverage using registry + gaps TOML files. The harness is designed to keep runs deterministic and inspectable.

F:system-tests/README.md L12-L39 F:system-tests/AGENTS.md L12-L35


System-Test Contract

System-tests require:

  • fail-closed assertions,
  • no sleep-based correctness,
  • production command and data surfaces,
  • mandatory per-test artifact emission.

F:system-tests/AGENTS.md L14-L37

Feature gating keeps system-tests explicit in CI and local runs. F:system-tests/README.md L34-L45


Registry-Driven Inventory

system-tests/test_registry.toml is authoritative for:

  • categories,
  • per-test metadata,
  • command entrypoints,
  • required artifacts,
  • estimated runtime.

F:system-tests/test_registry.toml L5-L14 F:system-tests/test_registry.toml L15-L480


Coverage Matrix and Gap Model

system-tests/TEST_MATRIX.md defines P0/P1/P2 objective coverage snapshots. F:system-tests/TEST_MATRIX.md L12-L42

system-tests/test_gaps.toml tracks open/closed gaps with explicit acceptance criteria and category/priority mapping. F:system-tests/test_gaps.toml L4-L74 F:system-tests/test_gaps.toml L75-L190

As of 2026-02-07, all tracked P1 core gaps are closed (composite selector dedup, partial-segment proof anchors, cross-segment mismatch fail-closed, attachment closure fail-closed, SQLite/in-memory parity). As of 2026-02-14, P2 stress/perf coverage is also closed with deterministic core and sidecar performance artifacts and profile-matrix CI gating. As of the same date, OSS Launch 0 security findings are system-test gated for CLI boundary limits/policies, manifest structural tamper paths, SQLite corruption materialization, and single-open-segment enforcement. As of the same date, sidecar HTTP lifecycle and restart idempotency persistence are system-test gated with real sidecar subprocess + TCP transport workflows. As of 2026-02-08, the CLI OSS world-class expansion follow-up is fully mirrored in system-tests: recorder-id shape validation parity, attachment-recording hostile-input fail-closed checks, auto-seal duration/combined lifecycle lanes, query JSON pagination + over-limit guardrails, and Decision Gate CLI ingest-fixture strict-fail/success command paths. As of 2026-02-08, sidecar container packaging is also system-test gated via the sidecar_docker suite (asset hardening checks + Docker Compose e2e lane with explicit skip/fail policy via RECORDER_REQUIRE_DOCKER). As of 2026-02-08, the Docker Compose lane additionally validates containerized startup/readiness probe behavior (/startup, /ready) before and after segment-open lifecycle transitions. As of 2026-02-13, the Docker Compose lane validates secret bootstrap hardening: token source material is provided via Docker secret input, then materialized into a tmpfs token file path inside the container before sidecar startup. As of 2026-02-10, operations coverage includes a contract-generation lane that asserts typed SDK/OpenAPI projection artifacts (sdk/types.json, sdk/methods.json, strict Bundle/VerificationVerdict OpenAPI components) for downstream SDK generator readiness. As of 2026-02-11, the operations suite is module-split (operations/{query,config,locale,contract}.rs) to keep test files scoped and reviewable while preserving deterministic artifact/reporting behavior. As of 2026-02-11, the security and integration suites are also file-split (security_{support,limits,signer,contract_bundle}.rs, integration_openclaw_{support,core,additional}.rs, integration_decision_gate_{support,contract_shape,ingest}.rs) to keep test surfaces reviewable while preserving deterministic behavior and run metadata. As of the same date, security coverage includes SQLite operation queue saturation fail-closed validation with deterministic saturation summary artifacts (sqlite_operation_queue_saturation_fails_closed). As of 2026-02-14, performance coverage additionally includes:

  • performance::stress_concurrent_append_query_sqlite,
  • performance::performance_bundle_materialization_smoke,
  • sidecar_perf::sidecar_perf_smoke_generates_gate_report,
  • sidecar_capacity::sidecar_capacity_sweep_generates_report, with profile-matrix orchestration (scripts/ci/perf_matrix.py) and release gating (scripts/ci/perf_gate.py) against committed baseline policy files (system-tests/perf_baselines/*.json). The sidecar perf model is split into two explicit lanes:
  • regression lane (sidecar_perf): fixed deterministic workload for CI anti-regression safety;
  • capacity lane (sidecar_capacity): release-only saturation sweep used to characterize max sustainable ingest throughput and enforce pinned-runner floors. The sidecar regression/capacity lanes are aligned with the current single-writer v1 contract: stream-scoped query requests include required tenant_id and recorder_id, and envelope writes include canonical stream identity fields. As of 2026-02-14 (perf architecture split), both lanes emit explicit workload.* metadata and lane identity (regression/capacity) in artifacts; capacity reports additionally emit max_sustainable_rps, knee_rps, and error/latency-at-knee fields. As of 2026-02-15, benchmark orchestration is standardized through scripts/perf/run.py presets (smoke, regression, capacity) with schema-v2 normalized report emission under target/perf/v2/*.json, triage output under target/perf/triage/latest.md, and capability-aware gate evaluation via the same baseline policy surface. As of the same date, core macro performance reports are emitted both at legacy target/performance/latest.json and per-suite paths target/performance/stress/latest.json, target/performance/bundle/latest.json to prevent suite overwrite ambiguity. As of the same date (Phase 3 expansion), the sidecar regression lane executes multi-trial timed windows and emits percentile confidence-spread metadata, the core performance suite emits query-cardinality and SQLite durability sweeps plus bundle scale-curve and deterministic replay hash-equality metrics, and the capacity lane emits explicit authoritative vs non-authoritative gate labeling. As of 2026-02-16, sidecar regression memory metrics are split into hot-path high-water (rss_growth_bytes) and quiesced retained-growth (rss_growth_quiesced_bytes) signals, with rss_reclaim_ratio emitted for allocator high-water context. As of 2026-02-16, sidecar durability characterization coverage includes post-saturation graceful-stop restart timing (restart_ready_ms) and restart-correctness hardening for rolled parallel load paths; integration coverage now explicitly guards against multi-open-segment persistence across restart boundaries. As of the same date, durability measurement strictness captures live wal_size_bytes plus coupled PRAGMA wal_checkpoint(NOOP) state per case and emits durability_wal_ratio_vs_1000_comparable so WAL-ratio decisions can fail-closed when baseline WAL sampling is non-comparable. As of the same date, sidecar perf coverage also includes an informational retained-memory attribution lane (sidecar_perf_rss_retained_memory_attribution_characterization, ignored by default due runtime) that emits target/sidecar-perf/rss_attribution_latest.json with sequential/parallel quiesce-window RSS checkpoints and smaps_rollup class deltas. As of the same date, sidecar perf coverage also includes an ignored ingest- tuning characterization lane (sidecar_perf_rss_ingest_tuning_characterization) that sweeps ingest queue and batch knobs across sequential/parallel workers, emits target/sidecar-perf/rss_ingest_tuning_latest.json, and records a data-derived mechanism classification (queue_backlog_dominated, batch_churn_dominated, or mixed) plus mitigation recommendation metadata.

Artifact and Transcript Contract

Each test run emits at minimum:

  • summary.json,
  • summary.md,
  • tool_transcript.json.

TestReporter and TestArtifacts create deterministic run roots, enforce run-root reuse policy, and produce standardized summary documents. CliSession now resolves recorder across both direct profile and deps binary layouts and prefers workspace target/debug resolution before build fallback, which keeps operation-lane command execution deterministic across cargo test and cargo nextest process layouts on Windows and Unix.

F:system-tests/tests/helpers/artifacts.rs L65-L131 F:system-tests/tests/helpers/artifacts.rs L133-L214 F:system-tests/tests/helpers/cli.rs L19-L107 F:system-tests/tests/helpers/cli.rs L109-L240


Execution Tooling

Python helpers:

  • test_runner.py: registry-based execution with optional parallelism, per-test artifact roots, and manifest generation.
  • coverage_report.py: generates docs from registry + gaps.
  • gap_tracker.py: lists/shows/closes gaps and generates implementation prompts.

CI also enforces source-level coverage gates through scripts/ci/coverage_gate.sh, which runs cargo llvm-cov --workspace --all-features --all-targets --summary-only and fails closed when workspace or sidecar line coverage fall below policy thresholds, or when sidecar handler/middleware/server files drop below a minimum per-file line floor. The gate runs in both PR and main workflows after test lanes.

F:scripts/system_tests/test_runner.py L64-L112 F:scripts/system_tests/test_runner.py L119-L199 F:scripts/system_tests/coverage_report.py L43-L101 F:scripts/system_tests/gap_tracker.py L92-L140 F:scripts/ci/coverage_gate.sh F:.github/workflows/ci_pr.yml F:.github/workflows/ci_main.yml


Suite Structure

Suite modules cover:

  • smoke: CLI startup and help/version checks,
  • bundle: build/verify/inspect and tamper detection,
  • persistence: restart, determinism, and SQLite/in-memory parity checks,
  • operations: query ordering/cursor plus JSON pagination/limit guardrails and recorder-id + auto-seal config validation parity checks, plus typed contract-generation SDK projection checks,
  • security: bounded CLI input surfaces, malformed-identifier rejection, secure signer-file policy, signer-rotation recovery/corruption behavior, contract path safety, hostile bundle parse-boundary checks, and hostile record-with-attachments boundary checks, plus SQLite operation queue saturation fail-closed behavior with deterministic summary artifacts,
  • recorder: lifecycle plus auto-seal count/duration/combined behavior and attachment-recording persistence checks over the real CLI boundary,
  • sidecar: real sidecar process lifecycle over HTTP (record/query/build/verify) and restart-boundary idempotency replay/conflict persistence checks,
  • sidecar_perf: deterministic regression-lane sidecar perf artifact emission for CI gating,
  • sidecar_capacity: release-only capacity-lane sidecar saturation sweep for pinned-runner throughput characterization/gating,
  • sidecar_docker: Dockerfile/Compose/config hardening checks and Docker Compose build/up/down with containerized sidecar startup/readiness probe checks plus record/query workflow, including secret-source + tmpfs token bootstrap parity,
  • integration_openclaw: fixture-driven OpenClaw gateway/CLI ingest, signed/unsigned verification lanes, sequence-gap policy checks, sensitive field redaction, and bounded payload handling checks.
  • integration_decision_gate: fixture-driven Decision Gate MCP runpack flow ingest through the production recorder-decision-gate-adapter crate, signed/unsigned verification lanes, runpack-integrity strict-vs-anomaly policy checks (including manifest self-integrity recomputation), sensitive transcript-field redaction, bounded transcript payload handling checks, CLI decision-gate ingest-fixture command-path validation, and a fixture conformance gate that enforces canonical Decision Gate tool request/response shapes (including export-vs-verify checked_files semantics).

F:system-tests/tests/suites/smoke.rs L15-L43 F:system-tests/tests/suites/recorder.rs L20-L678 F:system-tests/tests/suites/bundle.rs L64-L684 F:system-tests/tests/suites/persistence.rs L24-L468 F:system-tests/tests/suites/operations/mod.rs F:system-tests/tests/suites/operations/query.rs F:system-tests/tests/suites/operations/config.rs F:system-tests/tests/suites/operations/locale.rs F:system-tests/tests/suites/operations/contract.rs F:system-tests/tests/suites/security.rs F:system-tests/tests/suites/security_support.rs F:system-tests/tests/suites/security_limits.rs F:system-tests/tests/suites/security_signer.rs F:system-tests/tests/suites/security_contract_bundle.rs F:system-tests/tests/suites/sidecar.rs F:system-tests/tests/suites/sidecar_docker.rs F:system-tests/tests/suites/integration_openclaw.rs F:system-tests/tests/suites/integration_openclaw_support.rs F:system-tests/tests/suites/integration_openclaw_core.rs F:system-tests/tests/suites/integration_openclaw_additional.rs F:system-tests/tests/suites/integration_decision_gate.rs F:system-tests/tests/suites/integration_decision_gate_support.rs F:system-tests/tests/suites/integration_decision_gate_contract_shape.rs F:system-tests/tests/suites/integration_decision_gate_ingest.rs F:Docs/architecture/recorder_openclaw_integration_architecture.md L1-L160 F:Docs/architecture/recorder_decision_gate_integration_architecture.md L1-L170


File-by-File Cross Reference

AreaFileNotes
Contract and standardssystem-tests/AGENTS.mdBehavioral and artifact requirements for system-tests.
Execution overviewsystem-tests/README.mdHow to run and extend suites.
Coverage snapshotsystem-tests/TEST_MATRIX.mdP0/P1/P2 matrix.
Test registrysystem-tests/test_registry.tomlAuthoritative inventory and run commands.
Gap tracker datasystem-tests/test_gaps.tomlCoverage gaps and acceptance criteria.
Artifact helpersystem-tests/tests/helpers/artifacts.rsRun-root and summary generation contract.
CLI helpersystem-tests/tests/helpers/cli.rsReal CLI command execution and transcript capture.
Sidecar helpersystem-tests/tests/helpers/sidecar.rsReal sidecar process start/stop and HTTP transcript capture.
Sidecar perf helpersystem-tests/tests/helpers/perf_sidecar.rsShared keep-alive client, deterministic payload generation, and lane metadata/report utilities for sidecar perf suites.
Docker helpersystem-tests/tests/helpers/docker.rsDocker daemon/compose probes and command helpers for containerized lanes.
Sidecar suitesystem-tests/tests/suites/sidecar.rsSidecar HTTP lifecycle (record/query/build/verify) and restart-idempotency validation.
Sidecar perf suitessystem-tests/tests/suites/sidecar_perf.rs, system-tests/tests/suites/sidecar_capacity.rsRegression lane and capacity lane artifact generation for sidecar ingest performance policy, plus ignored informational retained-memory attribution lane (target/sidecar-perf/rss_attribution_latest.json) and ingest-tuning characterization lane (target/sidecar-perf/rss_ingest_tuning_latest.json).
Perf orchestrationscripts/perf/run.py, scripts/perf/report_schema_v2.json, scripts/perf/collect_runner_fingerprint.py, scripts/perf/update_baseline.pyCanonical preset entrypoint, schema-v2 report contract, runner fingerprint capture, and governed baseline refresh workflow.
Perf profiling helperscripts/perf/profile_hotspot.shFast failing-suite-to-profiler workflow for hotspot triage.
Sidecar Docker suitesystem-tests/tests/suites/sidecar_docker.rsSidecar container packaging hardening and Docker Compose workflow validation, including startup/readiness probes and secret-source tmpfs token bootstrap behavior.
OpenClaw integration suitesystem-tests/tests/suites/integration_openclaw.rs, system-tests/tests/suites/integration_openclaw_support.rs, system-tests/tests/suites/integration_openclaw_core.rs, system-tests/tests/suites/integration_openclaw_additional.rsFixture-driven adapter ingest validation for gateway + CLI mock flows.
OpenClaw fixturessystem-tests/tests/fixtures/openclaw_gateway_mock_events.jsonGateway mock flow event fixture aligned to OpenClaw event schema.
OpenClaw fixturessystem-tests/tests/fixtures/openclaw_cli_mock_events.jsonCLI fallback-style flow event fixture aligned to OpenClaw event schema.
OpenClaw integration architectureDocs/architecture/recorder_openclaw_integration_architecture.mdVersioned mapping, redaction, and bounded payload policy contract.
Decision Gate production adaptercrates/recorder-decision-gate-adapter/src/adapter.rsCanonical Decision Gate-to-Recorder mapping implementation exercised by system-tests.
Decision Gate integration suitesystem-tests/tests/suites/integration_decision_gate.rs, system-tests/tests/suites/integration_decision_gate_support.rs, system-tests/tests/suites/integration_decision_gate_contract_shape.rs, system-tests/tests/suites/integration_decision_gate_ingest.rsFixture-driven MCP runpack flow validation for control-plane coupling.
Decision Gate fixturesystem-tests/tests/fixtures/decision_gate_runpack_mock_flow.jsonMock runpack MCP flow fixture aligned to Decision Gate transcript and runpack manifest layout.
Decision Gate integration architectureDocs/architecture/recorder_decision_gate_integration_architecture.mdVersioned MCP flow mapping, runpack integrity policy, and transcript redaction/bounds contract.
Env parsingsystem-tests/src/config/env.rsStrict environment parsing for test config.
Runner scriptscripts/system_tests/test_runner.pyRegistry-driven execution engine.
Coverage docs generatorscripts/system_tests/coverage_report.pyGenerated testing docs pipeline.
Gap management scriptscripts/system_tests/gap_tracker.pyGap lifecycle tooling.