teleo-infrastructure/schemas/teleo-agent-research-eval-v1.md

# Teleo Agent Research Eval Schema v1

Apply this schema after `teleo-agent-graph-v1.sql`.

This schema records how Leo and other agents answer research requests, which
tools they choose, what sources they cite, and whether benchmark cases passed.
It is operational/economic telemetry, not the claim/evidence graph itself.

## Design Commitments

- The graph schema remains the knowledge spine: persona, strategy, beliefs,
  claims, evidence, graph evals, and cascades.
- Research-eval rows explain how a request was handled and whether the route was
  good enough to trust or ship.
- Payment funds work. It does not directly mutate claims, confidence, beliefs,
  or rewards.
- Tool-use benchmarking must distinguish candidates, selected tools, executed
  tools, skipped tools, and rejected tools.
- Secrets and private payloads are never stored. Tables store hashes, redacted
  excerpts, proof references, source metadata, and receipt ids.

## Main Tables

| Table | Purpose |
| --- | --- |
| `agent_research_runs` | One row per research request from Telegram, API, checkout, CLI, or benchmark. |
| `agent_tool_invocations` | One row per candidate, selected, executed, skipped, rejected, fallback, or failed tool decision. |
| `agent_research_sources` | Retrieved or cited source rows tied to a run and optionally a tool invocation. |
| `agent_eval_cases` | Versioned benchmark prompts, expected routes/providers, tool constraints, tags, and rubrics. |
| `agent_eval_results` | Per-case result, routing correctness, tool score, source quality, groundedness, cost, and safety scores. |
| `work_order_graph_links` | Links sponsored work orders to research runs, tool traces, graph evals, evidence, claims, and outcomes. |

## Leo x402 Research Flow

```text
Telegram/API question
-> agent_research_runs
-> agent_tool_invocations
-> agent_research_sources
-> agent_eval_results when a benchmark case applies
-> work_order_graph_links when a paid work order or graph artifact is involved
```

For paid research, `agent_research_runs.sponsored_work_order_id` and
`payment_receipt_id` carry the external work-order/payment anchors. The payment
receipt table is still owned by the economic/payment layer; this schema only
keeps references.

## Ranger Liquidation Guard

The Ranger benchmark class should be represented as:

- `agent_eval_cases.expected_route = 'web_search'`
- `agent_eval_cases.tags_json` includes `ranger_liquidated`
- `agent_eval_cases.must_not_use_tools_json` includes market-data-only routes
- `agent_tool_invocations` records market data as `rejected` or `skipped` when
  it is not the right tool
- `agent_eval_results.routing_correct = 1` only if Leo routed to source-backed
  research instead of live-token valuation

This ensures "Ranger is liquidated/gone" is verified before any valuation
framing and never silently treated as a normal live fair-value token question.

## Minimum Invariants

- No row may set `secret_values_included = 1`.
- A benchmark result must link to both an eval case and a research run.
- Tool invocation sequence numbers are unique per research run.
- Scores are bounded between `0` and `1`.
- Research runs store prompt and answer hashes plus optional redacted excerpts,
  not raw private prompts.
- `outcome_observations` remain the downstream business-value layer; raw tool
  traces belong here, not there.