Archive schema migration: 49 source files standardized with status + claims_extracted. schemas/source.md merged with main version (resolved conflict, kept more complete schema). Reviewed by Rio.
4.6 KiB
Source Schema
Sources are the raw material that feeds claim extraction. Every piece of external content that enters the knowledge base gets archived in inbox/archive/ with standardized frontmatter so agents can track what's been processed, what's pending, and what yielded claims.
YAML Frontmatter
---
type: source
title: "Article or thread title"
author: "Name (@handle if applicable)"
url: https://example.com/article
date: YYYY-MM-DD
domain: internet-finance | entertainment | ai-alignment | health | grand-strategy
format: essay | newsletter | tweet | thread | whitepaper | paper | report | news
status: unprocessed | processing | processed | null-result
processed_by: agent-name
processed_date: YYYY-MM-DD
claims_extracted:
- "claim title 1"
- "claim title 2"
enrichments:
- "existing claim title that was enriched"
tags: [topic1, topic2]
linked_set: set-name-if-part-of-a-group
---
Required Fields
| Field | Type | Description |
|---|---|---|
| type | enum | Always source |
| title | string | Human-readable title of the source material |
| author | string | Who wrote it — name and handle |
| url | string | Original URL (even if content was provided manually) |
| date | date | Publication date |
| domain | enum | Primary domain for routing |
| status | enum | Processing state (see lifecycle below) |
Optional Fields
| Field | Type | Description |
|---|---|---|
| format | enum | paper, essay, newsletter, tweet, thread, whitepaper, report, news — source format affects evidence weight assessment (a peer-reviewed paper carries different weight than a tweet) |
| processed_by | string | Which agent extracted claims from this source |
| processed_date | date | When extraction happened |
| claims_extracted | list | Titles of standalone claims created from this source |
| enrichments | list | Titles of existing claims enriched with evidence from this source |
| tags | list | Topic tags for discovery |
| linked_set | string | Group identifier when sources form a debate or series (e.g., ai-intelligence-crisis-divergence-feb2026) |
| cross_domain_flags | list | Flags for other agents/domains surfaced during extraction |
| flagged_for_{agent} | list | Items flagged for a specific agent's domain (e.g., flagged_for_theseus, flagged_for_vida) |
| notes | string | Extraction notes — why null result, what was paywalled, etc. |
Status Lifecycle
unprocessed → processing → processed | null-result
| Status | Meaning |
|---|---|
unprocessed |
Content archived, no agent has extracted from it yet |
processing |
An agent is actively working on extraction |
processed |
Extraction complete — claims_extracted and/or enrichments populated |
null-result |
Agent reviewed and determined no extractable claims (must include notes explaining why) |
Note: Legacy files may use partial for paywalled content. Treat as equivalent to processing with a notes field explaining the limitation.
Filing Convention
Filename: YYYY-MM-DD-{author-handle}-{brief-slug}.md
Examples:
2026-02-22-citriniresearch-2028-global-intelligence-crisis.md2026-03-06-time-anthropic-drops-rsp.md2024-01-doppler-whitepaper-liquidity-bootstrapping.md
Body: After the frontmatter, include a summary of the source content. This serves two purposes:
- Agents can extract claims without re-fetching the URL
- Content persists even if the original URL goes down
The body is NOT a claim — it's a reference document. Use descriptive sections, not argumentative ones.
Governance
- Who archives: Any agent can archive sources. The
processed_byfield tracks who extracted, not who archived. - When to archive: Archive at ingestion time, before extraction begins. Set
status: unprocessed. - After extraction: Update frontmatter with
status: processed,processed_by,processed_date,claims_extracted, andenrichments. - Null results: Set
status: null-resultand explain innoteswhy no claims were extracted. Null results are valuable — they prevent duplicate work. - No deletion: Sources are never deleted from the archive, even if they yield no claims.
Legacy Fields
Older archive files (pre-migration) may use different field names:
type: evidenceortype: archiveinstead oftype: sourcesource:instead ofurl:date_published:ordate_archived:instead ofdate:source_type:instead offormat:archived_by:(still valid, just not required)status: partialinstead ofprocessingwith notes
These are accepted for backward compatibility. New files should use the canonical field names above.