Adds graph schema prerequisite plus research-eval schema/docs/tests for Leo tool-use benchmarks and x402 research telemetry. Validated by full local pytest and green CI.

2026-06-24 14:21:03 +02:00

3.3 KiB

Raw Blame History

Teleo Agent Research Eval Schema v1

Apply this schema after teleo-agent-graph-v1.sql.

This schema records how Leo and other agents answer research requests, which tools they choose, what sources they cite, and whether benchmark cases passed. It is operational/economic telemetry, not the claim/evidence graph itself.

Design Commitments

The graph schema remains the knowledge spine: persona, strategy, beliefs, claims, evidence, graph evals, and cascades.
Research-eval rows explain how a request was handled and whether the route was good enough to trust or ship.
Payment funds work. It does not directly mutate claims, confidence, beliefs, or rewards.
Tool-use benchmarking must distinguish candidates, selected tools, executed tools, skipped tools, and rejected tools.
Secrets and private payloads are never stored. Tables store hashes, redacted excerpts, proof references, source metadata, and receipt ids.

Main Tables

Table	Purpose
`agent_research_runs`	One row per research request from Telegram, API, checkout, CLI, or benchmark.
`agent_tool_invocations`	One row per candidate, selected, executed, skipped, rejected, fallback, or failed tool decision.
`agent_research_sources`	Retrieved or cited source rows tied to a run and optionally a tool invocation.
`agent_eval_cases`	Versioned benchmark prompts, expected routes/providers, tool constraints, tags, and rubrics.
`agent_eval_results`	Per-case result, routing correctness, tool score, source quality, groundedness, cost, and safety scores.
`work_order_graph_links`	Links sponsored work orders to research runs, tool traces, graph evals, evidence, claims, and outcomes.

Leo x402 Research Flow

Telegram/API question
-> agent_research_runs
-> agent_tool_invocations
-> agent_research_sources
-> agent_eval_results when a benchmark case applies
-> work_order_graph_links when a paid work order or graph artifact is involved

For paid research, agent_research_runs.sponsored_work_order_id and payment_receipt_id carry the external work-order/payment anchors. The payment receipt table is still owned by the economic/payment layer; this schema only keeps references.

Ranger Liquidation Guard

The Ranger benchmark class should be represented as:

agent_eval_cases.expected_route = 'web_search'
agent_eval_cases.tags_json includes ranger_liquidated
agent_eval_cases.must_not_use_tools_json includes market-data-only routes
agent_tool_invocations records market data as rejected or skipped when it is not the right tool
agent_eval_results.routing_correct = 1 only if Leo routed to source-backed research instead of live-token valuation

This ensures "Ranger is liquidated/gone" is verified before any valuation framing and never silently treated as a normal live fair-value token question.

Minimum Invariants

No row may set secret_values_included = 1.
A benchmark result must link to both an eval case and a research run.
Tool invocation sequence numbers are unique per research run.
Scores are bounded between 0 and 1.
Research runs store prompt and answer hashes plus optional redacted excerpts, not raw private prompts.
outcome_observations remain the downstream business-value layer; raw tool traces belong here, not there.

3.3 KiB Raw Blame History