12 KiB
Phase 1b Child Spec: Staging Proof
Created: 2026-05-29
Status: active draft
Parent spec: docs/phase1b-agent-routing-spec.md
Product Outcome Contract
Phase 1b must be tested without mutating the production VPS or production decision-engine PRs. A staging clone or disposable remote test box must prove routing, verdict posting, and merge or feedback behavior against a sandbox target before production cutover.
Goal
Define the staging proof path for Phase 1b: provision an isolated production-like runtime, disable production authority, run six single-domain PR cycles plus one cross-domain PR cycle, save a machine-readable proof artifact, then destroy or shut down the staging environment.
Non-Goals
- Do not mutate production PRs.
- Do not use production GitHub tokens in staging.
- Do not prove 24-hour production stability.
- Do not promote a mutated staging server as production.
- Do not test payment, wallet, Twitter, or mainnet flows.
Current Implementation Audit
Known repo truth:
systemd/teleo-pipeline.servicedefines the production-style pipeline service.deploy/contains deployment and mirror scripts.docs/ARCHITECTURE.mddocuments VPS path assumptions and SQLite state.docs/INFRASTRUCTURE.mddocuments production as Hetzner77.42.65.182, root path/opt/teleo-eval, diagnostics on port8081, and health on port8080.deploy/auto-deploy.shpulls from/opt/teleo-eval/workspaces/deploy-infra, syncs code into runtime paths, restarts changed Python services, and updates/opt/teleo-eval/.last-deploy-shaafter smoke checks.systemd/teleo-pipeline.serviceexpects/opt/teleo-eval/pipeline/fix-ownership.sh, while this repo stores that script underdeploy/fix-ownership.sh; staging bootstrap must verify the live runtime path before assuming the unit works.handoff/phase1-step3-script-migration.mddocuments GitHub migration posture anddecision-enginetarget for scripts.handoff/deprecated/eval-scripts.mdconfirms old eval scripts are dead.- Fwaz described the current production update path as
pull -> services recognize pull -> edit on VPS -> PR to Leo; staging proof must treat that as an unsafe legacy behavior to replace, not as a release gate. - Fwaz approved Crabbox as the long-term disposable staging/test-box direction.
Unknown production truth:
- Exact current deployed SHA.
- Whether production service files match this repo.
- Whether production still points at Forgejo in the live daemon.
- Exact restart/deploy commands used by Fwaz or agents.
- Current secrets layout.
- Current
systemctl catoutput forteleo-pipeline,teleo-diagnostics, auto-deploy timers, cron-like research jobs, Telegram-related services, and any agent daemons. - Whether production has uncommitted hotfixes, generated scripts, or local service patches under
/opt/teleo-eval. - Read-only live access is not available in this workspace; the infrastructure audit attempted SSH readback and hit authentication denial, so no production SHA or service state should be claimed from this spec.
Existing-Spec Inventory
| Existing doc | Relevance | Decision |
|---|---|---|
docs/phase1b-agent-routing-spec.md |
Parent proof requirements. | Reuse. |
docs/ARCHITECTURE.md |
VPS topology and service assumptions. | Reuse with current-readback requirement. |
systemd/teleo-pipeline.service |
Service command template. | Reuse as staging baseline. |
handoff/phase1-step3-script-migration.md |
GitHub decision-engine target context. |
Reuse. |
Goal-Vs-Repo-Truth Diff
Goal:
- Staging proof runs against sandbox
decision-engine. - Production services and secrets are disabled before test daemon starts.
- Proof artifact captures routes, verdicts, final PR states, SHAs, DB schema, feature flags, and logs.
Repo truth:
- Staging automation does not exist.
- No proof script exists for seven PR cases.
- No machine-readable Phase 1b proof schema exists outside the umbrella spec.
Completion Percent And Remaining Delta
Current completion: 0 percent.
Remaining delta:
- Choose staging substrate: Hetzner snapshot clone, Crabbox, or another disposable test box.
- Define sandbox repo.
- Define staging secrets.
- Write or run proof sequence.
- Retain proof artifact.
- Confirm staging cannot mutate production.
Closure, Endpoint, And Deployment Truth
Staging closure means:
- Staging environment is isolated.
- Sandbox PRs are created and processed.
- Required reviewer verdicts appear in PR comments.
- Pipeline state transitions match expected behavior.
- Proof artifact exists.
Production closure is separate and requires exact reviewed SHA deployment plus 24-hour readback.
Critical Assumptions And Invalidators
Assumptions:
- A VPS snapshot or disposable equivalent can run the pipeline.
- Production secrets can be removed or replaced before daemon start.
- A sandbox GitHub repo can be used.
- The proof can run without real production inference spend, or spend is explicitly approved.
Invalidators:
- Clone boots production services before quarantine.
- Sandbox target cannot receive PRs/comments.
- No operator has cloud or VPS access.
- Secrets cannot be separated from production.
- Service paths on production are materially different from repo docs.
State And Truth Contract
Proof artifact path should be under staging, then copied back into the PR or retained artifact location. Suggested filename:
proof/phase1b-staging-proof-YYYYMMDD-HHMMSS.json
Required JSON fields:
{
"phase": "1b",
"schema_version": 1,
"environment": {
"kind": "hetzner_snapshot|crabbox|disposable_remote",
"host": "...",
"snapshot_id": "...",
"created_from_prod_host": "77.42.65.182"
},
"teleo_infrastructure_sha": "...",
"decision_engine_target": "living-ip/decision-engine-sandbox",
"pipeline_db_schema": 26,
"feature_flags": {"PHASE1B_AGENT_ROUTING_ENABLED": "true"},
"safety": {
"prod_services_disabled": true,
"prod_timers_disabled": true,
"prod_crons_disabled": true,
"prod_secrets_removed": true,
"auto_merge_constrained": true
},
"test_cases": [],
"verification_outputs": {
"service_status_path": "...",
"journal_excerpt_path": "...",
"db_snapshot_path": "...",
"github_comments_path": "..."
},
"rollback": {
"production_sha_before": "...",
"candidate_sha": "...",
"rollback_command": "..."
},
"created_at": "..."
}
Each test case:
{
"case": "internet-finance",
"pr": 12,
"required_agents": ["Rio"],
"posted_verdicts": {"Rio": "approve"},
"final_state": "approved",
"route_kind": "single"
}
Measurement Contract
Minimum staging cases:
- grand strategy -> Leo
- ai systems or ai alignment -> Theseus
- internet finance -> Rio
- health -> Vida
- entertainment -> Clay
- space, robotics, energy, or advanced manufacturing -> Astra
- cross-domain ai plus x402 -> Theseus and Rio
Pass criteria:
- 7 of 7 route decisions match expected required agents.
- 7 of 7 PRs receive parseable verdict comments.
- No production repo receives comments.
- No production service remains enabled during staging run.
Backend Work Required
Owned surfaces:
- Staging host.
- Sandbox repo.
- Staging env/config.
- Proof artifact generator or manual proof script.
Implementation steps:
- Snapshot or provision staging environment.
- Block public/prod access.
- Disable production services.
- Remove production secrets.
- Set hostname to staging.
- Configure sandbox target.
- Deploy exact implementation SHA.
- Enable Phase 1b feature flag.
- Create seven sandbox PRs.
- Run pipeline until verdicts and states are visible.
- Save proof artifact.
- Shut down or destroy staging.
Frontend Work Required
None.
Expected Runtime And User-Visible Behavior
Operator sees:
- Staging service status.
- Sandbox PR comments with agent verdict tags.
- SQLite rows or logs showing route decisions.
- Proof artifact summarizing pass/fail.
No production user-visible behavior should change during staging.
Validation And Test Matrix
Commands will vary by staging substrate. Baseline readback:
hostname
git -C /opt/teleo-eval/workspaces/deploy-infra rev-parse HEAD
cat /opt/teleo-eval/.last-deploy-sha
systemctl is-active teleo-pipeline teleo-diagnostics teleo-auto-deploy.timer
systemctl list-timers | grep -E 'teleo|sync|extract|research' || true
curl -s localhost:8080/health | python3 -m json.tool
journalctl -u teleo-pipeline --since "1 hour ago" --no-pager
sqlite3 /opt/teleo-eval/pipeline/pipeline.db "select max(version) from schema_version;"
sqlite3 /opt/teleo-eval/pipeline/pipeline.db "select number,status,domain,domain_agent,leo_verdict,domain_verdict,auto_merge,github_pr from prs order by number desc limit 20;"
gh pr list --repo living-ip/decision-engine-sandbox --state all
gh pr view --repo living-ip/decision-engine-sandbox PR_NUMBER --comments
Safety checks:
systemctl is-enabled teleo-pipeline
systemctl cat teleo-pipeline
systemctl cat teleo-diagnostics
grep -R "github-admin-token" /opt/teleo-eval/secrets 2>/dev/null
git -C /opt/teleo-eval/workspaces/main remote -v
CI/CD, Release, And Pre-Push Gate Contract
Before staging:
- Code PR has passed local tests.
- Sandbox target selected.
- Staging secrets prepared.
Before production:
- Staging proof artifact exists.
- Exact SHA to deploy is recorded.
- Rollback path is recorded.
- Leo approval/signoff for the exact reviewed SHA is recorded.
- The production cutover avoids direct agent self-edits on the VPS.
Independent CLI Audit Contract
Auditor should verify:
- Staging host is not production.
- Production services were disabled before test daemon start.
- GitHub target is sandbox.
- Proof artifact PR IDs belong to sandbox repo.
- Logs show no production mutation.
Outside-The-Box Fix Paths
If Hetzner snapshot clone is too risky:
- Use Crabbox with a synced checkout and fake/sandbox services.
- Use a fresh Hetzner server and repo checkout instead of disk clone.
- Use local fake GitHub/Forgejo API for pure pipeline proof.
Substrate guidance:
- Prefer a Hetzner snapshot clone for canonical staging proof because it exercises
/opt/teleo-eval, systemd units, timers, runtime user ownership, SQLite path assumptions, and deploy scripts. - Crabbox is acceptable and preferred long-term as
disposable_remoteproof for command/test execution, but it does not count as VPS-clone fidelity unless it recreates the same unit files, runtime paths, service user, database path, and deploy flow. - A local fake GitHub/Forgejo API can prove parser and state logic, but it cannot close the staging acceptance gate for real GitHub comments.
If inference spend is a concern:
- Mock agent review responses in staging.
- Use a staging-specific cheap model.
- Run only one real model call after mocked proof passes.
Maintenance Capture
Beneficial now:
- Add a reusable
proof/phase1bscript later if manual staging repeats. - Record exact service and config readback.
Avoid now:
- Building a full deployment platform.
- Giving Crabbox or staging production secrets.
- Replacing production with staging server.
Parallelization And Fanout
Classification: draft_gated.
This can be delegated to Fwaz or the infrastructure owner after code PR exists.
Worker-ready prompt:
run phase 1b staging proof without mutating production. provision or clone a staging box, disable production services and secrets before starting the daemon, point the runtime at a sandbox decision-engine repo, enable phase 1b routing, run six single-domain prs plus one cross-domain pr, and save a machine-readable proof artifact. do not touch production prs or production secrets.
Acceptance Criteria
- Staging is isolated.
- Seven sandbox PR cases run.
- Required agents match expected matrix.
- Verdicts are parseable.
- Proof artifact exists.
- Staging is stopped or destroyed after proof.
Readiness And Claim Boundaries
Allowed claim:
- "Phase 1b passed staging proof."
Forbidden claim:
- "Production Phase 1b is complete."
Spec Quality Self-Audit
All required execution-grade headings are present. Exact production commands remain unknown until VPS truth is read back.
Assistant-Added Caveats
Crabbox is useful here only as a disposable staging/test substrate. It should not receive production secrets until there is a deliberate security review.