teleo-infrastructure/lib/extraction_prompt.py
m3taversal d79ff60689 epimetheus: sync VPS-deployed code to repo — Mar 18-20 reliability + features
Pipeline reliability (8 fixes, reviewed by Ganymede+Rhea+Leo+Rio):
1. Merge API recovery — pre-flight approval check, transient/permanent distinction, jitter
2. Ghost PR detection — ls-remote branch check in reconciliation, network guard
3. Source status contract — directory IS status, no code change needed
4. Batch-state markers eliminated — two-gate skip (archive-check + batched branch-check)
5. Branch SHA tracking — batched ls-remote, auto-reset verdicts, dismiss stale reviews
6. Mirror pre-flight permissions — chown check in sync-mirror.sh
7. Telegram archive commit-after-write — git add/commit/push with rebase --abort fallback
8. Post-merge source archiving — queue/ → archive/{domain}/ after merge

Pipeline fixes:
- merge_cycled flag — eval attempts preserved during merge-failure cycling (Ganymede+Rhea)
- merge_failures diagnostic counter
- Startup recovery preserves eval_attempts (was incorrectly resetting to 0)
- No-diff PRs auto-closed by eval (root cause of 17 zombie PRs)
- GC threshold aligned with substantive fixer budget (was 2, now 4)
- Conflict retry with 3-attempt budget + permanent conflict handler
- Local ff-merge fallback for Forgejo 405 errors

Telegram bot:
- KB retrieval: 3-layer (entity resolution → claim search → agent context)
- Reply-to-bot handler (context.bot.id check)
- Tag regex: @teleo|@futairdbot
- Prompt rewrite for natural analyst voice
- Market data API integration (Ben's token price endpoint)
- Conversation windows (5-message unanswered counter, per-user-per-chat)
- Conversation history in prompt (last 5 exchanges)
- Worktree file lock for archive writes

Infrastructure:
- worktree_lock.py — file-based lock (flock) for main worktree coordination
- backfill-sources.py — source DB registration for Argus funnel
- batch-extract-50.sh v3 — two-gate skip, batched ls-remote, network guard
- sync-mirror.sh — auto-PR creation for mirrored GitHub branches, permission pre-flight
- Argus dashboard — conflicts + reviewing in backlog, queue count in funnel
- Enrichment-inside-frontmatter bug fix (regex anchor, not --- split)

Pentagon-Agent: Epimetheus <3D35839A-7722-4740-B93D-51157F7D5E70>
2026-03-20 20:17:27 +00:00

259 lines
12 KiB
Python

"""Lean extraction prompt — judgment only, mechanical rules in code.
The extraction prompt focuses on WHAT to extract:
- Separate facts from claims from enrichments
- Classify confidence honestly
- Identify entity data
- Check for duplicates against KB index
Mechanical enforcement (frontmatter format, wiki links, dates, filenames)
is handled by post_extract.py AFTER the LLM returns.
Design principle (Leo): mechanical rules in code, judgment in prompts.
Epimetheus owns this module. Leo reviews changes.
"""
from datetime import date
def build_extraction_prompt(
source_file: str,
source_content: str,
domain: str,
agent: str,
kb_index: str,
*,
today: str | None = None,
rationale: str | None = None,
intake_tier: str | None = None,
proposed_by: str | None = None,
) -> str:
"""Build the lean extraction prompt.
Args:
source_file: Path to the source being extracted
source_content: Full text of the source
domain: Primary domain for this source
agent: Agent name performing extraction
kb_index: Pre-generated KB index text (claim titles for dedup)
today: Override date for testing (default: today)
rationale: Contributor's natural-language thesis about the source (optional)
intake_tier: undirected | directed | challenge (optional)
proposed_by: Contributor handle who submitted the source (optional)
Returns:
The complete prompt string
"""
today = today or date.today().isoformat()
# Build contributor directive section (if rationale provided)
if rationale and rationale.strip():
contributor_name = proposed_by or "a contributor"
tier_label = intake_tier or "directed"
contributor_directive = f"""
## Contributor Directive (intake_tier: {tier_label})
**{contributor_name}** submitted this source and said:
> {rationale.strip()}
This is an extraction directive — use it to focus your extraction:
- Extract claims that relate to the contributor's thesis
- If the source SUPPORTS their thesis, extract the supporting evidence as claims
- If the source CONTRADICTS their thesis, extract the contradiction — that's even more valuable
- Evaluate whether the contributor's own thesis is extractable as a standalone claim
- If specific enough to disagree with and supported by the source: extract it with `source: "{contributor_name}, original analysis"`
- If too vague or already in the KB: use it as a directive only
- If the contributor references existing claims ("I disagree with X"), identify those claims by filename from the KB index and include them in the `challenges` field
- ALSO extract anything else valuable in the source — the directive is a spotlight, not a filter
Set `contributor_thesis_extractable: true` if you extracted the contributor's thesis as a claim, `false` otherwise.
"""
else:
contributor_directive = ""
return f"""You are {agent}, extracting knowledge from a source for TeleoHumanity's collective knowledge base.
## Your Task
Read the source below. Be SELECTIVE — extract only what genuinely expands the KB's understanding. Most sources produce 0-3 claims. A source that produces 5+ claims is almost certainly over-extracting.
For each insight, classify it as one of:
**CLAIM** — An arguable proposition someone could disagree with. Must name a specific mechanism.
- Good: "futarchy is manipulation-resistant because attack attempts create profitable opportunities for defenders"
- Bad: "futarchy has interesting governance properties"
- Test: "This note argues that [title]" must work as a sentence.
- MAXIMUM 3-5 claims per source. If you find more, keep only the most novel and surprising.
**ENRICHMENT** — New evidence that strengthens, challenges, or extends an existing claim in the KB.
- If an insight supports something already in the KB index below, it's an enrichment, NOT a new claim.
- Enrichment over duplication: ALWAYS prefer adding evidence to an existing claim.
- Most sources should produce more enrichments than new claims.
**ENTITY** — Factual data about a company, protocol, person, organization, or market. Not arguable.
- Entity types: company, person, protocol, organization, market (core). Domain-specific: lab, fund, token, exchange, therapy, research_program, benchmark.
- One file per entity. If the entity already exists, append a timeline entry — don't create a new file.
- New entities: raised real capital (>$10K), launched a product, or discussed by 2+ sources.
- Skip: test proposals, spam, trivial projects.
- Filing: `entities/{{domain}}/{{entity-name}}.md`
**DECISION** — A governance decision, futarchic proposal, funding vote, or policy action. Separate from entities.
- Decisions are events with terminal states (passed/failed/expired). Entities are persistent objects.
- Each significant decision gets its own file in `decisions/{{domain}}/`.
- ALSO output a timeline entry for the parent entity: `- **YYYY-MM-DD** — [[decision-filename]] Outcome: one-line summary`
- Only extract a CLAIM from a decision if it reveals a novel MECHANISM INSIGHT (~1 per 10-15 decisions).
- Routine decisions (minor budgets, operational tweaks, uncontested votes) → timeline entry on parent entity only, no decision file.
- Filing: `decisions/{{domain}}/{{parent}}-{{slug}}.md`
**FACT** — A verifiable data point no one would disagree with. Store in source notes, not as a claim.
- "Jupiter DAO vote reached 75% support" is a fact, not a claim.
- Individual data points about specific events are facts. Generalizable patterns from multiple data points are claims.
## Selectivity Rules
**Novelty gate — argument, not topic:** Before extracting a claim, check the KB index below. The question is NOT "does the KB cover this topic?" but "does the KB already make THIS SPECIFIC ARGUMENT?" A new argument in a well-covered topic IS a new claim. A new data point supporting an existing argument is an enrichment.
- New data point for existing argument → ENRICHMENT (add evidence to existing claim)
- New argument the KB doesn't have yet → CLAIM (even if the topic is well-covered)
- Same argument with different wording → ENRICHMENT (don't create near-duplicates)
**Challenge premium:** A single well-evidenced claim that challenges an existing KB position is worth more than 10 claims that confirm what we already know. Prioritize extraction of counter-evidence and boundary conditions.
**What would change an agent's mind?** Ask this for every potential claim. If the answer is "nothing — this is more evidence for what we already believe," it's an enrichment. If the answer is "this introduces a mechanism or argument we haven't considered," it's a claim.
## Confidence Calibration
Be honest about uncertainty:
- **proven**: Multiple independent confirmations, tested against challenges
- **likely**: 3+ corroborating sources with empirical data
- **experimental**: 1-2 sources with data, or strong theoretical argument
- **speculative**: Theory without data, single anecdote, or self-reported company claims
Single source = experimental at most. Pitch rhetoric or marketing copy = speculative.
## Source
**File:** {source_file}
{source_content}
{contributor_directive}
## KB Index (existing claims — check for duplicates and enrichment targets)
{kb_index}
## Output Format
Return valid JSON. The post-processor handles frontmatter formatting, wiki links, and dates — focus on the intellectual content.
```json
{{
"claims": [
{{
"filename": "descriptive-slug-matching-the-claim.md",
"domain": "{domain}",
"title": "Prose claim title that is specific enough to disagree with",
"description": "One sentence adding context beyond the title",
"confidence": "experimental",
"source": "author/org, key evidence reference",
"body": "Argument with evidence. Cite specific data, quotes, studies from the source. Explain WHY the claim is supported. This must be a real argument, not a restatement of the title.",
"related_claims": ["existing-claim-stem-from-kb-index"],
"scope": "structural|functional|causal|correlational",
"sourcer": "handle or name of the original author/source (e.g., @theiaresearch, Pine Analytics)"
}}
],
"enrichments": [
{{
"target_file": "existing-claim-filename.md",
"type": "confirm|challenge|extend",
"evidence": "The new evidence from this source",
"source_ref": "Brief source reference"
}}
],
"entities": [
{{
"filename": "entity-name.md",
"domain": "{domain}",
"action": "create|update",
"entity_type": "company|person|protocol|organization|market|lab|fund|research_program",
"content": "Full markdown for new entities. For updates, leave empty.",
"timeline_entry": "- **YYYY-MM-DD** — Event with specifics"
}}
],
"decisions": [
{{
"filename": "parent-slug-decision-slug.md",
"domain": "{domain}",
"parent_entity": "parent-entity-filename.md",
"status": "passed|failed|active",
"category": "treasury|fundraise|hiring|mechanism|liquidation|grants|strategy",
"summary": "One-sentence description of the decision",
"content": "Full markdown for significant decisions. Empty for routine ones.",
"parent_timeline_entry": "- **YYYY-MM-DD** — [[decision-filename]] Passed: one-line summary"
}}
],
"facts": [
"Verifiable data points to store in source archive notes"
],
"extraction_notes": "Brief summary: N claims, N enrichments, N entities, N decisions. What was most interesting.",
"contributor_thesis_extractable": false
}}
```
## Rules
1. **Quality over quantity.** 0-3 precise claims beats 8 vague ones. If you can't name the specific mechanism in the title, don't extract it. Empty claims arrays are fine — not every source produces novel claims.
2. **Enrichment over duplication.** Check the KB index FIRST. If something similar exists, add evidence to it. New claims are only for genuinely novel propositions.
3. **Facts are not claims.** Individual data points go in `facts`. Only generalized patterns from multiple data points become claims.
4. **Proposals are entities, not claims.** A governance proposal, token launch, or funding event is structured data (entity). Only extract a claim if the event reveals a novel mechanism insight that generalizes beyond this specific case.
5. **Scope your claims.** Say whether you're claiming a structural, functional, causal, or correlational relationship.
6. **OPSEC.** Never extract specific dollar amounts, valuations, equity percentages, or deal terms for LivingIP/Teleo. General market data is fine.
7. **Read the Agent Notes.** If the source has "Agent Notes" or "Curator Notes" sections, they contain context about why this source matters.
Return valid JSON only. No markdown fencing, no explanation outside the JSON.
"""
def build_entity_enrichment_prompt(
entity_file: str,
entity_content: str,
new_data: list[dict],
domain: str,
) -> str:
"""Build prompt for batch entity enrichment (runs on main, not extraction branch).
This is separate from claim extraction to avoid merge conflicts.
Entity enrichments are additive timeline entries — commutative, auto-mergeable.
Args:
entity_file: Path to the entity being enriched
entity_content: Current content of the entity file
new_data: List of timeline entries from recent extractions
domain: Entity domain
Returns:
Prompt for entity enrichment
"""
entries_text = "\n".join(
f"- Source: {d.get('source', '?')}\n Entry: {d.get('timeline_entry', '')}"
for d in new_data
)
return f"""You are a Teleo knowledge base agent. Merge these new timeline entries into an existing entity.
## Current Entity: {entity_file}
{entity_content}
## New Data Points
{entries_text}
## Rules
1. Append new entries to the Timeline section in chronological order
2. Deduplicate: skip entries that describe events already in the timeline
3. Preserve all existing content — append only
4. If a new data point updates a metric (revenue, valuation, user count), add it as a new timeline entry, don't modify existing entries
Return the complete updated entity file content.
"""