Capabilities
What Reeve actually does, what's covered by an executable test, and what's deferred. The list is generated from the build plan and the simulation harness; if a claim isn't on this page with a status, it isn't a claim Reeve makes.
Architecture
- Multi-tenant SaaS with row-level isolation. Postgres RLS plus a
withTenantwrapper; queries without a tenant context fail. Cross-tenant isolation is verified by scenario #6: an adversary in tenant A leaves tenant B byte-for-byte unchanged. - Constraints in code, not prompts. Database constraints (exclusion on overlapping bookings), RLS policies, programmatic policy gates (the hypervisor), and an append-only event log enforce the rules. Adversarial scenarios verify the architecture's response is independent of prompt-layer judgment.
- Stateless agents over stateful storage. Agents are pure functions of their inputs; state lives in Postgres. The simulation harness drives
runReceptionistdirectly with no mocks. - Per-action-class phase progression. Phase 1 (queue every action) → Phase 2 (shadow + correlation) → Phase 3 (autonomous on routine, gated on consequential). Phase state lives in the Apprentice service (v0.5+) and is read by the hypervisor on every action.
- Append-only event log. Application role has no UPDATE/DELETE on the events table; verified by integration test.
Adversarial validation — the 28-scenario harness
Each scenario is a runnable integration test that drives the production code paths through synthesized adversarial input and asserts properties of the resulting state. Coverage is mapped against OWASP LLM Top 10 (2025) and MITRE ATLAS. Every scenario in this list is currently passing in the test suite.
Architecture-level (9)
- dr-reed-hipaa-basic — defense in depth on a bad agent draft. PHI-laden draft is approved by the operator; the ship-time HIPAA gate rejects the send. Also exercises TCPA STOP-keyword consent revocation.
- dr-reed-adversarial — five direct attack patterns (prompt injection, jailbreak, confused-deputy, principal impersonation, regulatory social engineering). Each gets flagged with sensitive review-class.
- dr-reed-adversarial-multistep — branching attacks that adapt to the previous response. Architecture holds against the iterative pattern.
- indirect-prompt-injection — payloads in Subject lines, quoted-content masquerades, in-body authority impersonation. The OWASP-ranked highest-impact injection variant.
- stored-prompt-injection — payload planted in one inbound, surfaced via timeline-loaded follow-ups. Database-side analog of stored XSS for LLM systems.
- encoding-bypass-injection — base64, leetspeak, Cyrillic homoglyphs, zero-width chars, HTML entities. Asserts structural defenses hold regardless of obfuscation.
- cross-tenant-isolation — adversary in tenant A leaves tenant B byte-for-byte unchanged across every reasonable probe.
- bob-plumbing-marketing — TCPA marketing-vs-transactional consent path. Marketing-class send without grant blocks; YES inbound flips consent; subsequent send goes through.
- bob-plumbing-worker — operator review policy as testable code. Three customer messages exercise reject / edit / approve branches.
Vertical-specific (19)
Each scenario maps a marketing claim on a /for/<vertical>
page to an executable assertion. The architecture's response to
a vertical-shaped customer message is verified to match the
promise.
- vertical-gate-handyman — kitchen-remodel scope-cap.
- vertical-gate-electrical — permit-required panel upgrade.
- vertical-gate-locksmith — residential lockout without ID verification.
- vertical-gate-garage-door — DIY torsion-spring guidance refused.
- vertical-gate-pest-control — pre-treatment safety brief required.
- vertical-gate-roofing — insurance-claim coordination not phone-quoted.
- vertical-gate-painting — pre-1978 chipping paint triggers EPA RRP.
- vertical-gate-landscaping — mid-season snow signup outside Nov 1 window.
- vertical-gate-cleaning — first deep clean requires walkthrough.
- vertical-gate-mobile-grooming — reactive new-client dog requires temperament screening.
- vertical-gate-appliance-repair — diagnostic-first, model+serial required.
- vertical-gate-mobile-detailing — rain-forecast booking flagged for reschedule.
- vertical-gate-salons — first-time color requires 48h patch test.
- vertical-gate-tutoring — under-16 schedule change requires parent confirmation.
- vertical-gate-mental-health — clinical content (medication, symptom) refused, page therapist.
- vertical-gate-septic — real-estate inspection requires permit + operator handling.
- vertical-gate-well-water — new well drilling referred to licensed driller.
- vertical-gate-masonry — severe foundation crack flagged for structural-engineer referral.
- vertical-gate-deck-building — new attached deck requires permit handling.
Few-shot in-context demonstrations (v0.5.5)
Every operator approve/edit/reject in the review queue produces
an (input, draft, final) triple. On the next draft for
that same (tenant, action class), Reeve inlines 1–2 prior
approved/edited examples as ICL demonstrations. Per
AdaptAgent (Verma et al., 2024),
this boosts task success 3–7% absolute on unseen settings;
gains saturate at ~5 examples.
- Tenant-scoped only. Demonstrations never cross tenant boundaries (cross-tenant signal flow is v2-deferred).
- Most-recent first; same-customer preferred when applicable.
- Renders before the inbound message in the prompt — paper-validated ordering.
- Disposition mapping: approved → 'accept', edited → 'edit' (with payload_final as edited_output), rejected → 'reject'.
Apprentice integration (v0.5)
- Phase state via apprentice's
/v1/status. Hypervisor readsgetSkillStatus(skillNameFor(tenantId, actionType))instead of consulting a static table. Apprentice phases (phase_0/1/2/3/4) map to reeve phases (P1, P1, P1, P2, P3). Fail-closed when apprentice is unreachable. - Feedback emission. Every operator approve/edit/reject in the review queue fires
submitFeedbackto apprentice's/v1/feedbackwith the action_id as request_id. Fire-and-forget; apprentice downtime doesn't stall the operator UI. - Morning-summary dashboard.
/operator/dashboardrenders per-skill correlation and phase fromapprentice.getStatus(). Empty-state when apprentice has no data; offline-state when unreachable.
Compliance components (per-tenant)
- HIPAA — ship-time gate that rejects sends containing PHI keywords. All 8 layers wired. Verified by dr-reed-hipaa-basic, dr-reed-adversarial, indirect/stored/encoding scenarios.
- TCPA — consent-grant tracking with STOP/START keyword handling; marketing-vs-transactional gate at ship-time. Verified by bob-plumbing-marketing.
- EPA RRP (lead-paint disclosure) — flagged at booking-time for pre-1978 properties.
- Other components — FERPA (tutoring), state contractor licensing, state pesticide licensing, state well-driller licensing, state cosmetology licensing, state mental-health licensing, state septic licensing, EPA refrigerant certification — declared per-vertical via the archetype catalog, activated per-tenant during onboarding.
Governed under Pact
- 5 Constrain artifacts at the project root specify what Reeve should be: constraints, trust_policy, component_map, schema_hints, prompt.
- 147 ComponentContracts reverse-engineered by
pact adoptlive incontracts/; one per leaf module. - 146 ContractTestSuites live in
tests/src_*/; visible + hidden goodhart cases. - Constrain ↔ Pact diff documented in
docs/constrain-pact-diff.md; the architecture is internally consistent. Small drifts (3 missed layers, 7 missed event types, 3 implicit patterns) were harmonized.
Coverage gaps — honestly enumerated
Things Reeve does NOT yet test or claim, with the trigger that would cause us to add coverage:
- Production LLM regressions — the harness uses scripted transports for determinism. The
--llmflag swaps in real Anthropic for exploratory runs; CI doesn't gate on it. Lands when there's enough volume to run --llm CI without burning budget. - Multimodal demonstrations — voicemail-with-image and email-with-attachment ICL waits on those channels landing.
- Cross-tenant aggregate signal — v2 deferred; no design partner has asked for it.
- Real-time-sensitive attacks — rate-limit and timeout-race cases need a synthetic clock harness.
- Supply-chain attacks via plugin SDK — runtime gates exist, but no scenario probes a malicious plugin manifest end-to-end. Lands with v1.5 marketplace work.
- Apprentice training poisoning — adversarial feedback across many sessions. Modeling operator-attention drift is required first.
Last updated 2026-05-03. The full simulation harness lives at
tests/integration/simulation/
in the reeve repo. Run any scenario locally with
npm run simulate -- <scenario-id>.