Adds graph schema prerequisite plus research-eval schema/docs/tests for Leo tool-use benchmarks and x402 research telemetry. Validated by full local pytest and green CI.
73 lines
3.3 KiB
Markdown
73 lines
3.3 KiB
Markdown
# Teleo Agent Research Eval Schema v1
|
|
|
|
Apply this schema after `teleo-agent-graph-v1.sql`.
|
|
|
|
This schema records how Leo and other agents answer research requests, which
|
|
tools they choose, what sources they cite, and whether benchmark cases passed.
|
|
It is operational/economic telemetry, not the claim/evidence graph itself.
|
|
|
|
## Design Commitments
|
|
|
|
- The graph schema remains the knowledge spine: persona, strategy, beliefs,
|
|
claims, evidence, graph evals, and cascades.
|
|
- Research-eval rows explain how a request was handled and whether the route was
|
|
good enough to trust or ship.
|
|
- Payment funds work. It does not directly mutate claims, confidence, beliefs,
|
|
or rewards.
|
|
- Tool-use benchmarking must distinguish candidates, selected tools, executed
|
|
tools, skipped tools, and rejected tools.
|
|
- Secrets and private payloads are never stored. Tables store hashes, redacted
|
|
excerpts, proof references, source metadata, and receipt ids.
|
|
|
|
## Main Tables
|
|
|
|
| Table | Purpose |
|
|
| --- | --- |
|
|
| `agent_research_runs` | One row per research request from Telegram, API, checkout, CLI, or benchmark. |
|
|
| `agent_tool_invocations` | One row per candidate, selected, executed, skipped, rejected, fallback, or failed tool decision. |
|
|
| `agent_research_sources` | Retrieved or cited source rows tied to a run and optionally a tool invocation. |
|
|
| `agent_eval_cases` | Versioned benchmark prompts, expected routes/providers, tool constraints, tags, and rubrics. |
|
|
| `agent_eval_results` | Per-case result, routing correctness, tool score, source quality, groundedness, cost, and safety scores. |
|
|
| `work_order_graph_links` | Links sponsored work orders to research runs, tool traces, graph evals, evidence, claims, and outcomes. |
|
|
|
|
## Leo x402 Research Flow
|
|
|
|
```text
|
|
Telegram/API question
|
|
-> agent_research_runs
|
|
-> agent_tool_invocations
|
|
-> agent_research_sources
|
|
-> agent_eval_results when a benchmark case applies
|
|
-> work_order_graph_links when a paid work order or graph artifact is involved
|
|
```
|
|
|
|
For paid research, `agent_research_runs.sponsored_work_order_id` and
|
|
`payment_receipt_id` carry the external work-order/payment anchors. The payment
|
|
receipt table is still owned by the economic/payment layer; this schema only
|
|
keeps references.
|
|
|
|
## Ranger Liquidation Guard
|
|
|
|
The Ranger benchmark class should be represented as:
|
|
|
|
- `agent_eval_cases.expected_route = 'web_search'`
|
|
- `agent_eval_cases.tags_json` includes `ranger_liquidated`
|
|
- `agent_eval_cases.must_not_use_tools_json` includes market-data-only routes
|
|
- `agent_tool_invocations` records market data as `rejected` or `skipped` when
|
|
it is not the right tool
|
|
- `agent_eval_results.routing_correct = 1` only if Leo routed to source-backed
|
|
research instead of live-token valuation
|
|
|
|
This ensures "Ranger is liquidated/gone" is verified before any valuation
|
|
framing and never silently treated as a normal live fair-value token question.
|
|
|
|
## Minimum Invariants
|
|
|
|
- No row may set `secret_values_included = 1`.
|
|
- A benchmark result must link to both an eval case and a research run.
|
|
- Tool invocation sequence numbers are unique per research run.
|
|
- Scores are bounded between `0` and `1`.
|
|
- Research runs store prompt and answer hashes plus optional redacted excerpts,
|
|
not raw private prompts.
|
|
- `outcome_observations` remain the downstream business-value layer; raw tool
|
|
traces belong here, not there.
|