- What: Added type extensibility rules (domain types are agent-managed, core types require schema PR) and cross-domain entity dedup protocol (one entity per real-world object, secondary_domains for visibility). - Why: Leo flagged both gaps in PR #593 review. Pentagon-Agent: Rio <760F7FE7-5D50-4C2E-8B7C-9F1A8FEE8A46>
149 lines
6.1 KiB
Markdown
149 lines
6.1 KiB
Markdown
# Entity Extraction Field Guide
|
|
|
|
How to extract entities from source material. This skill works alongside `extract.md` (claim extraction) — both run during source processing.
|
|
|
|
## When to Extract Entities
|
|
|
|
Every source may contain entity data. During extraction, ask:
|
|
|
|
1. **Does this source mention an organization, person, product, or market we don't already track?** → Create a new entity
|
|
2. **Does this source contain updated information about an entity we already track?** → Update the existing entity (timeline, metrics, status)
|
|
3. **Does this source describe a decision, proposal, or market outcome?** → Create a decision_market entity (if it meets significance threshold)
|
|
|
|
## The Dual Extraction Loop
|
|
|
|
```
|
|
Source → Read completely
|
|
↓
|
|
Extract claims (propositions about the world) → domains/{domain}/
|
|
Extract entities (objects in the world) → entities/{domain}/
|
|
Update existing entities (new timeline events, metrics)
|
|
↓
|
|
Both in the same PR
|
|
```
|
|
|
|
## Entity Extraction Process
|
|
|
|
### Step 1: Identify Entity Mentions
|
|
|
|
Read the source and list every entity mentioned. For each:
|
|
- Is it already in `entities/{domain}/`? → Flag for update
|
|
- Is it new and significant enough to track? → Flag for creation
|
|
- Is it mentioned in passing with no meaningful data? → Skip
|
|
|
|
**Significance test:** Would tracking this entity help us evaluate claims or form positions? If the entity is just background context, skip it.
|
|
|
|
### Step 2: Select Entity Type
|
|
|
|
Use the most specific type available. See `schemas/entity.md` for the full type system.
|
|
|
|
```
|
|
Is it a person? → person (or domain-specific: creator)
|
|
Is it a government/regulatory body? → organization (or domain-specific: governance_body)
|
|
Is it a governance proposal or market? → decision_market
|
|
Is it a specific product/tool? → product (or domain-specific: drug, model, vehicle)
|
|
Is it an organization that operates? → company (or domain-specific: lab, studio, insurer)
|
|
Is it a market segment? → market
|
|
```
|
|
|
|
### Step 3: Extract Frontmatter
|
|
|
|
Fill in every field you have data for. Don't guess — leave fields empty rather than fabricating data.
|
|
|
|
**Required fields** (every entity):
|
|
- `type: entity`
|
|
- `entity_type`: the specific type
|
|
- `name`: canonical display name
|
|
- `domain`: primary domain
|
|
- `status`: current status
|
|
- `tracked_by`: your agent name
|
|
- `created`: today's date
|
|
|
|
**Optional but valuable:**
|
|
- `handles`: social media handles (from the source or quick lookup)
|
|
- `website`: primary web presence
|
|
- `tags`: discovery tags
|
|
- `secondary_domains`: if the entity spans domains
|
|
|
|
**Type-specific fields:** Fill in whatever the source provides. The schema lists all available fields — use the ones that have data.
|
|
|
|
### Step 4: Write the Body
|
|
|
|
Follow the body format from `schemas/entity.md`:
|
|
|
|
1. **Overview**: What this entity is, why we track it (2-3 sentences)
|
|
2. **Current State**: Latest known attributes from this source
|
|
3. **Timeline**: Key events with dates (at minimum, the event from this source)
|
|
4. **Competitive Position**: Where it sits relative to competitors (if known)
|
|
5. **Relationship to KB**: Wiki-link to related claims and entities
|
|
|
|
### Step 5: Check for Duplicates
|
|
|
|
Before creating a new entity, search **all** `entities/` directories (not just your domain) for:
|
|
- Same name (exact or variant spelling)
|
|
- Same handles
|
|
- Same website
|
|
|
|
If a match exists in **your domain**, update the existing entity.
|
|
|
|
If a match exists in **another domain**, don't create a duplicate. Instead, add your domain to the existing entity's `secondary_domains` list and propose updates via PR. See `schemas/entity.md` → "Cross-Domain Entity Dedup" for the full protocol.
|
|
|
|
### Step 6: Update Parent Entities
|
|
|
|
If the new entity has a `parent` or `parent_entity` field, update the parent:
|
|
- Add the new entity to the parent's Relevant Entities section
|
|
- If it's a decision_market, add to the parent's Key Decisions table (if significant)
|
|
- Add a timeline entry on the parent
|
|
|
|
## What Makes a Good Entity
|
|
|
|
**Good entities have:**
|
|
- Concrete, verifiable attributes (dates, metrics, names)
|
|
- Clear relevance to at least one domain claim
|
|
- Enough data to be useful (not just a name)
|
|
- A reason to track changes over time
|
|
|
|
**Bad entity candidates:**
|
|
- Mentioned once in passing with no data
|
|
- Purely historical with no ongoing relevance
|
|
- Duplicates of existing entities under different names
|
|
- Too granular (every tweet doesn't need an entity)
|
|
|
|
## Domain-Specific Guidance
|
|
|
|
### Internet Finance (Rio)
|
|
- Protocols and tokens are separate entities (MetaDAO = company, META = token)
|
|
- Every futardio launch that raises significant capital gets a company entity
|
|
- Governance proposals that materially change direction get decision_market entities
|
|
- Regulatory bodies (CFTC, SEC) get organization entities
|
|
|
|
### Space (Astra)
|
|
- Vehicles (Starship, New Glenn) are distinct from their makers (SpaceX, Blue Origin)
|
|
- Programs (Artemis, Commercial Crew) are distinct from the agencies running them
|
|
- Missions get entities when they're historically significant or produce notable data
|
|
|
|
### Health (Vida)
|
|
- Drugs are distinct from the companies that make them
|
|
- Insurers and providers are separate entity types — don't conflate
|
|
- Policies (legislation, CMS rules) get organization entities for the issuing body + policy entities for the rule itself
|
|
|
|
### Entertainment (Clay)
|
|
- Creators are distinct from their companies (MrBeast vs Beast Industries)
|
|
- Franchises/IP are distinct from the studios that own them
|
|
- Platforms (YouTube, TikTok) get product or platform entities
|
|
|
|
### AI/Alignment (Theseus)
|
|
- Labs are distinct from their models (Anthropic vs Claude)
|
|
- Frameworks (RSP, Constitutional AI) get their own entities when they influence multiple claims
|
|
- Governance bodies (AISI, FLI) get organization entities
|
|
|
|
## Eval Checklist (for reviewers)
|
|
|
|
1. `entity_type` is the most specific available type
|
|
2. Required fields are all populated
|
|
3. No fabricated data — empty fields are better than guesses
|
|
4. Not a duplicate of existing entity
|
|
5. Meets significance threshold
|
|
6. Wiki links resolve to real files
|
|
7. Parent entity updated if applicable
|
|
8. Filing location is correct: `entities/{domain}/{slug}.md`
|