teleo-infrastructure/schemas/teleo-agent-research-eval-v1.md
twentyOne2x 1a71efcde2
Add Teleo research eval schema
Adds graph schema prerequisite plus research-eval schema/docs/tests for Leo tool-use benchmarks and x402 research telemetry. Validated by full local pytest and green CI.
2026-06-24 14:21:03 +02:00

3.3 KiB

Teleo Agent Research Eval Schema v1

Apply this schema after teleo-agent-graph-v1.sql.

This schema records how Leo and other agents answer research requests, which tools they choose, what sources they cite, and whether benchmark cases passed. It is operational/economic telemetry, not the claim/evidence graph itself.

Design Commitments

  • The graph schema remains the knowledge spine: persona, strategy, beliefs, claims, evidence, graph evals, and cascades.
  • Research-eval rows explain how a request was handled and whether the route was good enough to trust or ship.
  • Payment funds work. It does not directly mutate claims, confidence, beliefs, or rewards.
  • Tool-use benchmarking must distinguish candidates, selected tools, executed tools, skipped tools, and rejected tools.
  • Secrets and private payloads are never stored. Tables store hashes, redacted excerpts, proof references, source metadata, and receipt ids.

Main Tables

Table Purpose
agent_research_runs One row per research request from Telegram, API, checkout, CLI, or benchmark.
agent_tool_invocations One row per candidate, selected, executed, skipped, rejected, fallback, or failed tool decision.
agent_research_sources Retrieved or cited source rows tied to a run and optionally a tool invocation.
agent_eval_cases Versioned benchmark prompts, expected routes/providers, tool constraints, tags, and rubrics.
agent_eval_results Per-case result, routing correctness, tool score, source quality, groundedness, cost, and safety scores.
work_order_graph_links Links sponsored work orders to research runs, tool traces, graph evals, evidence, claims, and outcomes.

Leo x402 Research Flow

Telegram/API question
-> agent_research_runs
-> agent_tool_invocations
-> agent_research_sources
-> agent_eval_results when a benchmark case applies
-> work_order_graph_links when a paid work order or graph artifact is involved

For paid research, agent_research_runs.sponsored_work_order_id and payment_receipt_id carry the external work-order/payment anchors. The payment receipt table is still owned by the economic/payment layer; this schema only keeps references.

Ranger Liquidation Guard

The Ranger benchmark class should be represented as:

  • agent_eval_cases.expected_route = 'web_search'
  • agent_eval_cases.tags_json includes ranger_liquidated
  • agent_eval_cases.must_not_use_tools_json includes market-data-only routes
  • agent_tool_invocations records market data as rejected or skipped when it is not the right tool
  • agent_eval_results.routing_correct = 1 only if Leo routed to source-backed research instead of live-token valuation

This ensures "Ranger is liquidated/gone" is verified before any valuation framing and never silently treated as a normal live fair-value token question.

Minimum Invariants

  • No row may set secret_values_included = 1.
  • A benchmark result must link to both an eval case and a research run.
  • Tool invocation sequence numbers are unique per research run.
  • Scores are bounded between 0 and 1.
  • Research runs store prompt and answer hashes plus optional redacted excerpts, not raw private prompts.
  • outcome_observations remain the downstream business-value layer; raw tool traces belong here, not there.