twentyOne2x 7390e1e843 Implement phase 1b agent routing

2026-05-29 14:00:13 +02:00

12 KiB

Raw Blame History

Phase 1b Child Spec: Staging Proof

Created: 2026-05-29 Status: active draft Parent spec: docs/phase1b-agent-routing-spec.md

Product Outcome Contract

Phase 1b must be tested without mutating the production VPS or production decision-engine PRs. A staging clone or disposable remote test box must prove routing, verdict posting, and merge or feedback behavior against a sandbox target before production cutover.

Goal

Define the staging proof path for Phase 1b: provision an isolated production-like runtime, disable production authority, run six single-domain PR cycles plus one cross-domain PR cycle, save a machine-readable proof artifact, then destroy or shut down the staging environment.

Non-Goals

Do not mutate production PRs.
Do not use production GitHub tokens in staging.
Do not prove 24-hour production stability.
Do not promote a mutated staging server as production.
Do not test payment, wallet, Twitter, or mainnet flows.

Current Implementation Audit

Known repo truth:

systemd/teleo-pipeline.service defines the production-style pipeline service.
deploy/ contains deployment and mirror scripts.
docs/ARCHITECTURE.md documents VPS path assumptions and SQLite state.
docs/INFRASTRUCTURE.md documents production as Hetzner 77.42.65.182, root path /opt/teleo-eval, diagnostics on port 8081, and health on port 8080.
deploy/auto-deploy.sh pulls from /opt/teleo-eval/workspaces/deploy-infra, syncs code into runtime paths, restarts changed Python services, and updates /opt/teleo-eval/.last-deploy-sha after smoke checks.
systemd/teleo-pipeline.service expects /opt/teleo-eval/pipeline/fix-ownership.sh, while this repo stores that script under deploy/fix-ownership.sh; staging bootstrap must verify the live runtime path before assuming the unit works.
handoff/phase1-step3-script-migration.md documents GitHub migration posture and decision-engine target for scripts.
handoff/deprecated/eval-scripts.md confirms old eval scripts are dead.
Fwaz described the current production update path as pull -> services recognize pull -> edit on VPS -> PR to Leo; staging proof must treat that as an unsafe legacy behavior to replace, not as a release gate.
Fwaz approved Crabbox as the long-term disposable staging/test-box direction.

Unknown production truth:

Exact current deployed SHA.
Whether production service files match this repo.
Whether production still points at Forgejo in the live daemon.
Exact restart/deploy commands used by Fwaz or agents.
Current secrets layout.
Current systemctl cat output for teleo-pipeline, teleo-diagnostics, auto-deploy timers, cron-like research jobs, Telegram-related services, and any agent daemons.
Whether production has uncommitted hotfixes, generated scripts, or local service patches under /opt/teleo-eval.
Read-only live access is not available in this workspace; the infrastructure audit attempted SSH readback and hit authentication denial, so no production SHA or service state should be claimed from this spec.

Existing-Spec Inventory

Existing doc	Relevance	Decision
`docs/phase1b-agent-routing-spec.md`	Parent proof requirements.	Reuse.
`docs/ARCHITECTURE.md`	VPS topology and service assumptions.	Reuse with current-readback requirement.
`systemd/teleo-pipeline.service`	Service command template.	Reuse as staging baseline.
`handoff/phase1-step3-script-migration.md`	GitHub `decision-engine` target context.	Reuse.

Goal-Vs-Repo-Truth Diff

Goal:

Staging proof runs against sandbox decision-engine.
Production services and secrets are disabled before test daemon starts.
Proof artifact captures routes, verdicts, final PR states, SHAs, DB schema, feature flags, and logs.

Repo truth:

Staging automation does not exist.
No proof script exists for seven PR cases.
No machine-readable Phase 1b proof schema exists outside the umbrella spec.

Completion Percent And Remaining Delta

Current completion: 0 percent.

Remaining delta:

Choose staging substrate: Hetzner snapshot clone, Crabbox, or another disposable test box.
Define sandbox repo.
Define staging secrets.
Write or run proof sequence.
Retain proof artifact.
Confirm staging cannot mutate production.

Closure, Endpoint, And Deployment Truth

Staging closure means:

Staging environment is isolated.
Sandbox PRs are created and processed.
Required reviewer verdicts appear in PR comments.
Pipeline state transitions match expected behavior.
Proof artifact exists.

Production closure is separate and requires exact reviewed SHA deployment plus 24-hour readback.

Critical Assumptions And Invalidators

Assumptions:

A VPS snapshot or disposable equivalent can run the pipeline.
Production secrets can be removed or replaced before daemon start.
A sandbox GitHub repo can be used.
The proof can run without real production inference spend, or spend is explicitly approved.

Invalidators:

Clone boots production services before quarantine.
Sandbox target cannot receive PRs/comments.
No operator has cloud or VPS access.
Secrets cannot be separated from production.
Service paths on production are materially different from repo docs.

State And Truth Contract

Proof artifact path should be under staging, then copied back into the PR or retained artifact location. Suggested filename:

proof/phase1b-staging-proof-YYYYMMDD-HHMMSS.json

Required JSON fields:

{
  "phase": "1b",
  "schema_version": 1,
  "environment": {
    "kind": "hetzner_snapshot|crabbox|disposable_remote",
    "host": "...",
    "snapshot_id": "...",
    "created_from_prod_host": "77.42.65.182"
  },
  "teleo_infrastructure_sha": "...",
  "decision_engine_target": "living-ip/decision-engine-sandbox",
  "pipeline_db_schema": 26,
  "feature_flags": {"PHASE1B_AGENT_ROUTING_ENABLED": "true"},
  "safety": {
    "prod_services_disabled": true,
    "prod_timers_disabled": true,
    "prod_crons_disabled": true,
    "prod_secrets_removed": true,
    "auto_merge_constrained": true
  },
  "test_cases": [],
  "verification_outputs": {
    "service_status_path": "...",
    "journal_excerpt_path": "...",
    "db_snapshot_path": "...",
    "github_comments_path": "..."
  },
  "rollback": {
    "production_sha_before": "...",
    "candidate_sha": "...",
    "rollback_command": "..."
  },
  "created_at": "..."
}

Each test case:

{
  "case": "internet-finance",
  "pr": 12,
  "required_agents": ["Rio"],
  "posted_verdicts": {"Rio": "approve"},
  "final_state": "approved",
  "route_kind": "single"
}

Measurement Contract

Minimum staging cases:

grand strategy -> Leo
ai systems or ai alignment -> Theseus
internet finance -> Rio
health -> Vida
entertainment -> Clay
space, robotics, energy, or advanced manufacturing -> Astra
cross-domain ai plus x402 -> Theseus and Rio

Pass criteria:

7 of 7 route decisions match expected required agents.
7 of 7 PRs receive parseable verdict comments.
No production repo receives comments.
No production service remains enabled during staging run.

Backend Work Required

Owned surfaces:

Staging host.
Sandbox repo.
Staging env/config.
Proof artifact generator or manual proof script.

Implementation steps:

Snapshot or provision staging environment.
Block public/prod access.
Disable production services.
Remove production secrets.
Set hostname to staging.
Configure sandbox target.
Deploy exact implementation SHA.
Enable Phase 1b feature flag.
Create seven sandbox PRs.
Run pipeline until verdicts and states are visible.
Save proof artifact.
Shut down or destroy staging.

Frontend Work Required

None.

Expected Runtime And User-Visible Behavior

Operator sees:

Staging service status.
Sandbox PR comments with agent verdict tags.
SQLite rows or logs showing route decisions.
Proof artifact summarizing pass/fail.

No production user-visible behavior should change during staging.

Validation And Test Matrix

Commands will vary by staging substrate. Baseline readback:

hostname
git -C /opt/teleo-eval/workspaces/deploy-infra rev-parse HEAD
cat /opt/teleo-eval/.last-deploy-sha
systemctl is-active teleo-pipeline teleo-diagnostics teleo-auto-deploy.timer
systemctl list-timers | grep -E 'teleo|sync|extract|research' || true
curl -s localhost:8080/health | python3 -m json.tool
journalctl -u teleo-pipeline --since "1 hour ago" --no-pager
sqlite3 /opt/teleo-eval/pipeline/pipeline.db "select max(version) from schema_version;"
sqlite3 /opt/teleo-eval/pipeline/pipeline.db "select number,status,domain,domain_agent,leo_verdict,domain_verdict,auto_merge,github_pr from prs order by number desc limit 20;"
gh pr list --repo living-ip/decision-engine-sandbox --state all
gh pr view --repo living-ip/decision-engine-sandbox PR_NUMBER --comments

Safety checks:

systemctl is-enabled teleo-pipeline
systemctl cat teleo-pipeline
systemctl cat teleo-diagnostics
grep -R "github-admin-token" /opt/teleo-eval/secrets 2>/dev/null
git -C /opt/teleo-eval/workspaces/main remote -v

CI/CD, Release, And Pre-Push Gate Contract

Before staging:

Code PR has passed local tests.
Sandbox target selected.
Staging secrets prepared.

Before production:

Staging proof artifact exists.
Exact SHA to deploy is recorded.
Rollback path is recorded.
Leo approval/signoff for the exact reviewed SHA is recorded.
The production cutover avoids direct agent self-edits on the VPS.

Independent CLI Audit Contract

Auditor should verify:

Staging host is not production.
Production services were disabled before test daemon start.
GitHub target is sandbox.
Proof artifact PR IDs belong to sandbox repo.
Logs show no production mutation.

Outside-The-Box Fix Paths

If Hetzner snapshot clone is too risky:

Use Crabbox with a synced checkout and fake/sandbox services.
Use a fresh Hetzner server and repo checkout instead of disk clone.
Use local fake GitHub/Forgejo API for pure pipeline proof.

Substrate guidance:

Prefer a Hetzner snapshot clone for canonical staging proof because it exercises /opt/teleo-eval, systemd units, timers, runtime user ownership, SQLite path assumptions, and deploy scripts.
Crabbox is acceptable and preferred long-term as disposable_remote proof for command/test execution, but it does not count as VPS-clone fidelity unless it recreates the same unit files, runtime paths, service user, database path, and deploy flow.
A local fake GitHub/Forgejo API can prove parser and state logic, but it cannot close the staging acceptance gate for real GitHub comments.

If inference spend is a concern:

Mock agent review responses in staging.
Use a staging-specific cheap model.
Run only one real model call after mocked proof passes.

Maintenance Capture

Beneficial now:

Add a reusable proof/phase1b script later if manual staging repeats.
Record exact service and config readback.

Avoid now:

Building a full deployment platform.
Giving Crabbox or staging production secrets.
Replacing production with staging server.

Parallelization And Fanout

Classification: draft_gated.

This can be delegated to Fwaz or the infrastructure owner after code PR exists.

Worker-ready prompt:

run phase 1b staging proof without mutating production. provision or clone a staging box, disable production services and secrets before starting the daemon, point the runtime at a sandbox decision-engine repo, enable phase 1b routing, run six single-domain prs plus one cross-domain pr, and save a machine-readable proof artifact. do not touch production prs or production secrets.

Acceptance Criteria

Staging is isolated.
Seven sandbox PR cases run.
Required agents match expected matrix.
Verdicts are parseable.
Proof artifact exists.
Staging is stopped or destroyed after proof.

Readiness And Claim Boundaries

Allowed claim:

"Phase 1b passed staging proof."

Forbidden claim:

"Production Phase 1b is complete."

Spec Quality Self-Audit

All required execution-grade headings are present. Exact production commands remain unknown until VPS truth is read back.

Assistant-Added Caveats

Crabbox is useful here only as a disposable staging/test substrate. It should not receive production secrets until there is a deliberate security review.

12 KiB Raw Blame History