teleo-infrastructure/schemas/teleo-agent-research-eval-v1.md
twentyOne2x 1a71efcde2
Add Teleo research eval schema
Adds graph schema prerequisite plus research-eval schema/docs/tests for Leo tool-use benchmarks and x402 research telemetry. Validated by full local pytest and green CI.
2026-06-24 14:21:03 +02:00

73 lines
3.3 KiB
Markdown

# Teleo Agent Research Eval Schema v1
Apply this schema after `teleo-agent-graph-v1.sql`.
This schema records how Leo and other agents answer research requests, which
tools they choose, what sources they cite, and whether benchmark cases passed.
It is operational/economic telemetry, not the claim/evidence graph itself.
## Design Commitments
- The graph schema remains the knowledge spine: persona, strategy, beliefs,
claims, evidence, graph evals, and cascades.
- Research-eval rows explain how a request was handled and whether the route was
good enough to trust or ship.
- Payment funds work. It does not directly mutate claims, confidence, beliefs,
or rewards.
- Tool-use benchmarking must distinguish candidates, selected tools, executed
tools, skipped tools, and rejected tools.
- Secrets and private payloads are never stored. Tables store hashes, redacted
excerpts, proof references, source metadata, and receipt ids.
## Main Tables
| Table | Purpose |
| --- | --- |
| `agent_research_runs` | One row per research request from Telegram, API, checkout, CLI, or benchmark. |
| `agent_tool_invocations` | One row per candidate, selected, executed, skipped, rejected, fallback, or failed tool decision. |
| `agent_research_sources` | Retrieved or cited source rows tied to a run and optionally a tool invocation. |
| `agent_eval_cases` | Versioned benchmark prompts, expected routes/providers, tool constraints, tags, and rubrics. |
| `agent_eval_results` | Per-case result, routing correctness, tool score, source quality, groundedness, cost, and safety scores. |
| `work_order_graph_links` | Links sponsored work orders to research runs, tool traces, graph evals, evidence, claims, and outcomes. |
## Leo x402 Research Flow
```text
Telegram/API question
-> agent_research_runs
-> agent_tool_invocations
-> agent_research_sources
-> agent_eval_results when a benchmark case applies
-> work_order_graph_links when a paid work order or graph artifact is involved
```
For paid research, `agent_research_runs.sponsored_work_order_id` and
`payment_receipt_id` carry the external work-order/payment anchors. The payment
receipt table is still owned by the economic/payment layer; this schema only
keeps references.
## Ranger Liquidation Guard
The Ranger benchmark class should be represented as:
- `agent_eval_cases.expected_route = 'web_search'`
- `agent_eval_cases.tags_json` includes `ranger_liquidated`
- `agent_eval_cases.must_not_use_tools_json` includes market-data-only routes
- `agent_tool_invocations` records market data as `rejected` or `skipped` when
it is not the right tool
- `agent_eval_results.routing_correct = 1` only if Leo routed to source-backed
research instead of live-token valuation
This ensures "Ranger is liquidated/gone" is verified before any valuation
framing and never silently treated as a normal live fair-value token question.
## Minimum Invariants
- No row may set `secret_values_included = 1`.
- A benchmark result must link to both an eval case and a research run.
- Tool invocation sequence numbers are unique per research run.
- Scores are bounded between `0` and `1`.
- Research runs store prompt and answer hashes plus optional redacted excerpts,
not raw private prompts.
- `outcome_observations` remain the downstream business-value layer; raw tool
traces belong here, not there.