# Source Schema Sources are the raw material that feeds claim extraction. Every piece of external content that enters the knowledge base gets archived in `inbox/archive/` with standardized frontmatter so agents can track what's been processed, what's pending, and what yielded claims. ## YAML Frontmatter ```yaml --- type: source title: "Article or thread title" author: "Name (@handle if applicable)" url: https://example.com/article date: YYYY-MM-DD domain: internet-finance | entertainment | ai-alignment | health | grand-strategy format: essay | newsletter | tweet | thread | whitepaper | paper | report | news status: unprocessed | processing | processed | null-result processed_by: agent-name processed_date: YYYY-MM-DD claims_extracted: - "claim title 1" - "claim title 2" enrichments: - "existing claim title that was enriched" tags: [topic1, topic2] linked_set: set-name-if-part-of-a-group --- ``` ## Required Fields | Field | Type | Description | |-------|------|-------------| | type | enum | Always `source` | | title | string | Human-readable title of the source material | | author | string | Who wrote it — name and handle | | url | string | Original URL (even if content was provided manually) | | date | date | Publication date | | domain | enum | Primary domain for routing | | status | enum | Processing state (see lifecycle below) | ## Optional Fields | Field | Type | Description | |-------|------|-------------| | format | enum | `paper`, `essay`, `newsletter`, `tweet`, `thread`, `whitepaper`, `report`, `news` — source format affects evidence weight assessment (a peer-reviewed paper carries different weight than a tweet) | | processed_by | string | Which agent extracted claims from this source | | processed_date | date | When extraction happened | | claims_extracted | list | Titles of standalone claims created from this source | | enrichments | list | Titles of existing claims enriched with evidence from this source | | tags | list | Topic tags for discovery | | linked_set | string | Group identifier when sources form a debate or series (e.g., `ai-intelligence-crisis-divergence-feb2026`) | | cross_domain_flags | list | Flags for other agents/domains surfaced during extraction | | flagged_for_{agent} | list | Items flagged for a specific agent's domain (e.g., `flagged_for_theseus`, `flagged_for_vida`) | | notes | string | Extraction notes — why null result, what was paywalled, etc. | ## Status Lifecycle ``` unprocessed → processing → processed | null-result ``` | Status | Meaning | |--------|---------| | `unprocessed` | Content archived, no agent has extracted from it yet | | `processing` | An agent is actively working on extraction | | `processed` | Extraction complete — claims_extracted and/or enrichments populated | | `null-result` | Agent reviewed and determined no extractable claims (must include `notes` explaining why) | Note: Legacy files may use `partial` for paywalled content. Treat as equivalent to `processing` with a `notes` field explaining the limitation. ## Filing Convention **Filename:** `YYYY-MM-DD-{author-handle}-{brief-slug}.md` Examples: - `2026-02-22-citriniresearch-2028-global-intelligence-crisis.md` - `2026-03-06-time-anthropic-drops-rsp.md` - `2024-01-doppler-whitepaper-liquidity-bootstrapping.md` **Body:** After the frontmatter, include a summary of the source content. This serves two purposes: 1. Agents can extract claims without re-fetching the URL 2. Content persists even if the original URL goes down The body is NOT a claim — it's a reference document. Use descriptive sections, not argumentative ones. ## Governance - **Who archives:** Any agent can archive sources. The `processed_by` field tracks who extracted, not who archived. - **When to archive:** Archive at ingestion time, before extraction begins. Set `status: unprocessed`. - **After extraction:** Update frontmatter with `status: processed`, `processed_by`, `processed_date`, `claims_extracted`, and `enrichments`. - **Null results:** Set `status: null-result` and explain in `notes` why no claims were extracted. Null results are valuable — they prevent duplicate work. - **No deletion:** Sources are never deleted from the archive, even if they yield no claims. ## Legacy Fields Older archive files (pre-migration) may use different field names: - `type: evidence` or `type: archive` instead of `type: source` - `source:` instead of `url:` - `date_published:` or `date_archived:` instead of `date:` - `source_type:` instead of `format:` - `archived_by:` (still valid, just not required) - `status: partial` instead of `processing` with notes These are accepted for backward compatibility. New files should use the canonical field names above.