Recorder System-Test and Validation Architecture

Audience: Engineers maintaining release confidence, deterministic validation, and audit-ready test artifacts.

Executive Overview
System-Test Contract
Registry-Driven Inventory
Coverage Matrix and Gap Model
Artifact and Transcript Contract
Execution Tooling
Suite Structure
File-by-File Cross Reference

Executive Overview

Recorder system-tests validate behavior through real production boundaries (CLI and adapter ingest), emit structured artifacts, and track coverage using registry + gaps TOML files. The harness is designed to keep runs deterministic and inspectable.

F:system-tests/README.md L12-L39 F:system-tests/AGENTS.md L12-L35

System-Test Contract

System-tests require:

fail-closed assertions,
no sleep-based correctness,
production command and data surfaces,
mandatory per-test artifact emission.

F:system-tests/AGENTS.md L14-L37

Feature gating keeps system-tests explicit in CI and local runs. F:system-tests/README.md L34-L45

Registry-Driven Inventory

system-tests/test_registry.toml is authoritative for:

categories,
per-test metadata,
command entrypoints,
required artifacts,
estimated runtime.

F:system-tests/test_registry.toml L5-L14 F:system-tests/test_registry.toml L15-L480

Coverage Matrix and Gap Model

system-tests/TEST_MATRIX.md defines P0/P1/P2 objective coverage snapshots. F:system-tests/TEST_MATRIX.md L12-L42

system-tests/test_gaps.toml tracks open/closed gaps with explicit acceptance criteria and category/priority mapping. F:system-tests/test_gaps.toml L4-L74 F:system-tests/test_gaps.toml L75-L190

As of 2026-02-07, all tracked P1 core gaps are closed (composite selector dedup, partial-segment proof anchors, cross-segment mismatch fail-closed, attachment closure fail-closed, SQLite/in-memory parity). As of 2026-02-14, P2 stress/perf coverage is also closed with deterministic core and sidecar performance artifacts and profile-matrix CI gating. As of the same date, OSS Launch 0 security findings are system-test gated for CLI boundary limits/policies, manifest structural tamper paths, SQLite corruption materialization, and single-open-segment enforcement. As of the same date, sidecar HTTP lifecycle and restart idempotency persistence are system-test gated with real sidecar subprocess + TCP transport workflows. As of 2026-02-08, the CLI OSS world-class expansion follow-up is fully mirrored in system-tests: recorder-id shape validation parity, attachment-recording hostile-input fail-closed checks, auto-seal duration/combined lifecycle lanes, query JSON pagination + over-limit guardrails, and Decision Gate CLI ingest-fixture strict-fail/success command paths. As of 2026-02-08, sidecar container packaging is also system-test gated via the sidecar_docker suite (asset hardening checks + Docker Compose e2e lane with explicit skip/fail policy via RECORDER_REQUIRE_DOCKER). As of 2026-02-08, the Docker Compose lane additionally validates containerized startup/readiness probe behavior (/startup, /ready) before and after segment-open lifecycle transitions. As of 2026-02-13, the Docker Compose lane validates secret bootstrap hardening: token source material is provided via Docker secret input, then materialized into a tmpfs token file path inside the container before sidecar startup. As of 2026-02-10, operations coverage includes a contract-generation lane that asserts typed SDK/OpenAPI projection artifacts (sdk/types.json, sdk/methods.json, strict Bundle/VerificationVerdict OpenAPI components) for downstream SDK generator readiness. As of 2026-02-11, the operations suite is module-split (operations/{query,config,locale,contract}.rs) to keep test files scoped and reviewable while preserving deterministic artifact/reporting behavior. As of 2026-02-11, the security and integration suites are also file-split (security_{support,limits,signer,contract_bundle}.rs, integration_openclaw_{support,core,additional}.rs, integration_decision_gate_{support,contract_shape,ingest}.rs) to keep test surfaces reviewable while preserving deterministic behavior and run metadata. As of the same date, security coverage includes SQLite operation queue saturation fail-closed validation with deterministic saturation summary artifacts (sqlite_operation_queue_saturation_fails_closed). As of 2026-02-14, performance coverage additionally includes:

performance::stress_concurrent_append_query_sqlite,
performance::performance_bundle_materialization_smoke,
sidecar_perf::sidecar_perf_smoke_generates_gate_report,
sidecar_capacity::sidecar_capacity_sweep_generates_report, with profile-matrix orchestration (scripts/ci/perf_matrix.py) and release gating (scripts/ci/perf_gate.py) against committed baseline policy files (system-tests/perf_baselines/*.json). The sidecar perf model is split into two explicit lanes:
regression lane (sidecar_perf): fixed deterministic workload for CI anti-regression safety;
capacity lane (sidecar_capacity): release-only saturation sweep used to characterize max sustainable ingest throughput and enforce pinned-runner floors. The sidecar regression/capacity lanes are aligned with the current single-writer v1 contract: stream-scoped query requests include required tenant_id and recorder_id, and envelope writes include canonical stream identity fields. As of 2026-02-14 (perf architecture split), both lanes emit explicit workload.* metadata and lane identity (regression/capacity) in artifacts; capacity reports additionally emit max_sustainable_rps, knee_rps, and error/latency-at-knee fields. As of 2026-02-15, benchmark orchestration is standardized through scripts/perf/run.py presets (smoke, regression, capacity) with schema-v2 normalized report emission under target/perf/v2/*.json, triage output under target/perf/triage/latest.md, and capability-aware gate evaluation via the same baseline policy surface. As of the same date, core macro performance reports are emitted both at legacy target/performance/latest.json and per-suite paths target/performance/stress/latest.json, target/performance/bundle/latest.json to prevent suite overwrite ambiguity. As of the same date (Phase 3 expansion), the sidecar regression lane executes multi-trial timed windows and emits percentile confidence-spread metadata, the core performance suite emits query-cardinality and SQLite durability sweeps plus bundle scale-curve and deterministic replay hash-equality metrics, and the capacity lane emits explicit authoritative vs non-authoritative gate labeling. As of 2026-02-16, sidecar regression memory metrics are split into hot-path high-water (rss_growth_bytes) and quiesced retained-growth (rss_growth_quiesced_bytes) signals, with rss_reclaim_ratio emitted for allocator high-water context. As of 2026-02-16, sidecar durability characterization coverage includes post-saturation graceful-stop restart timing (restart_ready_ms) and restart-correctness hardening for rolled parallel load paths; integration coverage now explicitly guards against multi-open-segment persistence across restart boundaries. As of the same date, durability measurement strictness captures live wal_size_bytes plus coupled PRAGMA wal_checkpoint(NOOP) state per case and emits durability_wal_ratio_vs_1000_comparable so WAL-ratio decisions can fail-closed when baseline WAL sampling is non-comparable. As of the same date, sidecar perf coverage also includes an informational retained-memory attribution lane (sidecar_perf_rss_retained_memory_attribution_characterization, ignored by default due runtime) that emits target/sidecar-perf/rss_attribution_latest.json with sequential/parallel quiesce-window RSS checkpoints and smaps_rollup class deltas. As of the same date, sidecar perf coverage also includes an ignored ingest- tuning characterization lane (sidecar_perf_rss_ingest_tuning_characterization) that sweeps ingest queue and batch knobs across sequential/parallel workers, emits target/sidecar-perf/rss_ingest_tuning_latest.json, and records a data-derived mechanism classification (queue_backlog_dominated, batch_churn_dominated, or mixed) plus mitigation recommendation metadata.

Artifact and Transcript Contract

Each test run emits at minimum:

summary.json,
summary.md,
tool_transcript.json.

TestReporter and TestArtifacts create deterministic run roots, enforce run-root reuse policy, and produce standardized summary documents. CliSession now resolves recorder across both direct profile and deps binary layouts and prefers workspace target/debug resolution before build fallback, which keeps operation-lane command execution deterministic across cargo test and cargo nextest process layouts on Windows and Unix.

F:system-tests/tests/helpers/artifacts.rs L65-L131 F:system-tests/tests/helpers/artifacts.rs L133-L214 F:system-tests/tests/helpers/cli.rs L19-L107 F:system-tests/tests/helpers/cli.rs L109-L240

Execution Tooling

Python helpers:

test_runner.py: registry-based execution with optional parallelism, per-test artifact roots, and manifest generation.
coverage_report.py: generates docs from registry + gaps.
gap_tracker.py: lists/shows/closes gaps and generates implementation prompts.

CI also enforces source-level coverage gates through scripts/ci/coverage_gate.sh, which runs cargo llvm-cov --workspace --all-features --all-targets --summary-only and fails closed when workspace or sidecar line coverage fall below policy thresholds, or when sidecar handler/middleware/server files drop below a minimum per-file line floor. The gate runs in both PR and main workflows after test lanes.

F:scripts/system_tests/test_runner.py L64-L112 F:scripts/system_tests/test_runner.py L119-L199 F:scripts/system_tests/coverage_report.py L43-L101 F:scripts/system_tests/gap_tracker.py L92-L140 F:scripts/ci/coverage_gate.sh F:.github/workflows/ci_pr.yml F:.github/workflows/ci_main.yml

Suite Structure

Suite modules cover:

smoke: CLI startup and help/version checks,
bundle: build/verify/inspect and tamper detection,
persistence: restart, determinism, and SQLite/in-memory parity checks,
operations: query ordering/cursor plus JSON pagination/limit guardrails and recorder-id + auto-seal config validation parity checks, plus typed contract-generation SDK projection checks,
security: bounded CLI input surfaces, malformed-identifier rejection, secure signer-file policy, signer-rotation recovery/corruption behavior, contract path safety, hostile bundle parse-boundary checks, and hostile record-with-attachments boundary checks, plus SQLite operation queue saturation fail-closed behavior with deterministic summary artifacts,
recorder: lifecycle plus auto-seal count/duration/combined behavior and attachment-recording persistence checks over the real CLI boundary,
sidecar: real sidecar process lifecycle over HTTP (record/query/build/verify) and restart-boundary idempotency replay/conflict persistence checks,
sidecar_perf: deterministic regression-lane sidecar perf artifact emission for CI gating,
sidecar_capacity: release-only capacity-lane sidecar saturation sweep for pinned-runner throughput characterization/gating,
sidecar_docker: Dockerfile/Compose/config hardening checks and Docker Compose build/up/down with containerized sidecar startup/readiness probe checks plus record/query workflow, including secret-source + tmpfs token bootstrap parity,
integration_openclaw: fixture-driven OpenClaw gateway/CLI ingest, signed/unsigned verification lanes, sequence-gap policy checks, sensitive field redaction, and bounded payload handling checks.
integration_decision_gate: fixture-driven Decision Gate MCP runpack flow ingest through the production recorder-decision-gate-adapter crate, signed/unsigned verification lanes, runpack-integrity strict-vs-anomaly policy checks (including manifest self-integrity recomputation), sensitive transcript-field redaction, bounded transcript payload handling checks, CLI decision-gate ingest-fixture command-path validation, and a fixture conformance gate that enforces canonical Decision Gate tool request/response shapes (including export-vs-verify checked_files semantics).

File-by-File Cross Reference

Area	File	Notes
Contract and standards	`system-tests/AGENTS.md`	Behavioral and artifact requirements for system-tests.
Execution overview	`system-tests/README.md`	How to run and extend suites.
Coverage snapshot	`system-tests/TEST_MATRIX.md`	P0/P1/P2 matrix.
Test registry	`system-tests/test_registry.toml`	Authoritative inventory and run commands.
Gap tracker data	`system-tests/test_gaps.toml`	Coverage gaps and acceptance criteria.
Artifact helper	`system-tests/tests/helpers/artifacts.rs`	Run-root and summary generation contract.
CLI helper	`system-tests/tests/helpers/cli.rs`	Real CLI command execution and transcript capture.
Sidecar helper	`system-tests/tests/helpers/sidecar.rs`	Real sidecar process start/stop and HTTP transcript capture.
Sidecar perf helper	`system-tests/tests/helpers/perf_sidecar.rs`	Shared keep-alive client, deterministic payload generation, and lane metadata/report utilities for sidecar perf suites.
Docker helper	`system-tests/tests/helpers/docker.rs`	Docker daemon/compose probes and command helpers for containerized lanes.
Sidecar suite	`system-tests/tests/suites/sidecar.rs`	Sidecar HTTP lifecycle (record/query/build/verify) and restart-idempotency validation.
Sidecar perf suites	`system-tests/tests/suites/sidecar_perf.rs`, `system-tests/tests/suites/sidecar_capacity.rs`	Regression lane and capacity lane artifact generation for sidecar ingest performance policy, plus ignored informational retained-memory attribution lane (`target/sidecar-perf/rss_attribution_latest.json`) and ingest-tuning characterization lane (`target/sidecar-perf/rss_ingest_tuning_latest.json`).
Perf orchestration	`scripts/perf/run.py`, `scripts/perf/report_schema_v2.json`, `scripts/perf/collect_runner_fingerprint.py`, `scripts/perf/update_baseline.py`	Canonical preset entrypoint, schema-v2 report contract, runner fingerprint capture, and governed baseline refresh workflow.
Perf profiling helper	`scripts/perf/profile_hotspot.sh`	Fast failing-suite-to-profiler workflow for hotspot triage.
Sidecar Docker suite	`system-tests/tests/suites/sidecar_docker.rs`	Sidecar container packaging hardening and Docker Compose workflow validation, including startup/readiness probes and secret-source tmpfs token bootstrap behavior.
OpenClaw integration suite	`system-tests/tests/suites/integration_openclaw.rs`, `system-tests/tests/suites/integration_openclaw_support.rs`, `system-tests/tests/suites/integration_openclaw_core.rs`, `system-tests/tests/suites/integration_openclaw_additional.rs`	Fixture-driven adapter ingest validation for gateway + CLI mock flows.
OpenClaw fixtures	`system-tests/tests/fixtures/openclaw_gateway_mock_events.json`	Gateway mock flow event fixture aligned to OpenClaw event schema.
OpenClaw fixtures	`system-tests/tests/fixtures/openclaw_cli_mock_events.json`	CLI fallback-style flow event fixture aligned to OpenClaw event schema.
OpenClaw integration architecture	Docs/architecture/recorder_openclaw_integration_architecture.md	Versioned mapping, redaction, and bounded payload policy contract.
Decision Gate production adapter	`crates/recorder-decision-gate-adapter/src/adapter.rs`	Canonical Decision Gate-to-Recorder mapping implementation exercised by system-tests.
Decision Gate integration suite	`system-tests/tests/suites/integration_decision_gate.rs`, `system-tests/tests/suites/integration_decision_gate_support.rs`, `system-tests/tests/suites/integration_decision_gate_contract_shape.rs`, `system-tests/tests/suites/integration_decision_gate_ingest.rs`	Fixture-driven MCP runpack flow validation for control-plane coupling.
Decision Gate fixture	`system-tests/tests/fixtures/decision_gate_runpack_mock_flow.json`	Mock runpack MCP flow fixture aligned to Decision Gate transcript and runpack manifest layout.
Decision Gate integration architecture	Docs/architecture/recorder_decision_gate_integration_architecture.md	Versioned MCP flow mapping, runpack integrity policy, and transcript redaction/bounds contract.
Env parsing	`system-tests/src/config/env.rs`	Strict environment parsing for test config.
Runner script	`scripts/system_tests/test_runner.py`	Registry-driven execution engine.
Coverage docs generator	`scripts/system_tests/coverage_report.py`	Generated testing docs pipeline.
Gap management script	`scripts/system_tests/gap_tracker.py`	Gap lifecycle tooling.