Separates three concerns currently conflated in contributors table:
contributors — people + agents we credit (kind in 'person','agent')
publishers — news orgs / academic venues / platforms (not credited)
sources — gains publisher_id + content_type + original_author columns
Rationale (Cory directive Apr 24): livingip.xyz leaderboard was showing CNBC,
SpaceNews, TechCrunch etc. at the top because the attribution pipeline credited
news org names as if they were contributors. The mechanism-level fix is a
schema split — orgs live in publishers, individuals in contributors, each
table has one semantics.
Migration v26:
- CREATE TABLE publishers (id PK, name UNIQUE, kind CHECK IN
news|academic|social_platform|podcast|self|internal|legal|government|
research_org|commercial|other, url_pattern, created_at)
- CREATE TABLE contributor_identities (contributor_handle, platform CHECK IN
x|telegram|github|email|web|internal, platform_handle, verified, created_at)
Composite PK on (platform, platform_handle) + index on contributor_handle.
Enables one contributor to unify X + TG + GitHub handles.
- ALTER TABLE sources ADD COLUMN publisher_id REFERENCES publishers(id)
- ALTER TABLE sources ADD COLUMN content_type
(article|paper|tweet|conversation|self_authored|webpage|podcast)
- ALTER TABLE sources ADD COLUMN original_author TEXT
(free-text fallback, e.g., "Kim et al." — not credit-bearing)
- ALTER TABLE sources ADD COLUMN original_author_handle REFERENCES contributors(handle)
(set only when the author is in our contributor network)
- ALTER wrapped in try/except on "duplicate column" for replay safety
- Both SCHEMA_SQL (fresh installs) + migration block (upgrades) updated
- SCHEMA_VERSION bumped 25 -> 26
Migration is non-breaking. No data moves yet. Existing publishers-polluting-
contributors row state is preserved until the classifier runs. Writer routing
to these tables lands in a separate branch (Phase B writer changes).
Classifier (scripts/classify-contributors.py):
Analyzes existing contributors rows, buckets into:
keep_agent — 9 Pentagon agents
keep_person — 21 real humans + reachable pseudonymous X/TG handles
publisher — 100 news orgs, academic venues, formal-citation names,
brand/platform names
garbage — 9 parse artifacts (containing /, parens, 3+ hyphens)
review_needed — 0 (fully covered by current allowlists)
Hand-curated allowlists for news/academic/social/internal publisher kinds.
Garbage detection via regex on special chars and length > 50.
Named pseudonyms without @ prefix (karpathy, simonw, swyx, metaproph3t,
sjdedic, ceterispar1bus, etc.) classified as keep_person — they're real
X/TG contributors missing an @ prefix because extraction frontmatter
didn't normalize. Cory's auto-create rule catches these on first reference.
Formal-citation names (Firstname-Lastname form — Clayton Christensen, Hayek,
Ostrom, Friston, Bostrom, Bak, etc.) classified as academic publishers —
these are cited, not reachable via @ handle. Get promoted to contributors
if/when they sign up with an @ handle.
Apply path is transactional (BEGIN / COMMIT / ROLLBACK on error). Publisher
insert happens before contributor delete, and contributor delete is gated
on successful insert so we never lose a row by moving it to a failed
publisher insert.
--apply path flags:
--delete-events : also DELETE contribution_events rows for moved handles
(default: keep events for audit trail)
--show <handle> : inspect a single row's classification
Smoke-tested end-to-end via local copy of VPS DB:
Before: 139 contributors total (polluted with orgs)
After: 30 contributors (9 agent + 21 person), 100 publishers, 9 deleted
contribution_events: 3,705 preserved
contributors <-> publishers overlap: 0
Named contributors verified present after --apply:
alexastrum (claims=6) thesensatore (5) cameron-s1 (1) m3taversal (1011)
Pentagon agent 'pipeline' (claims_merged=771) intentionally retained — it's
the process name from old extract.py fallback path, not a real contributor.
Classified as agent (kind='agent') so doesn't appear in person leaderboard.
Deploy sequence after Ganymede review:
1. Branch ff-merge to main
2. scp lib/db.py + scripts/classify-contributors.py to VPS
3. Pipeline already at v26 (migration ran during earlier v26 restart)
4. Run dry-run: python3 ops/classify-contributors.py
5. Apply: python3 ops/classify-contributors.py --apply
6. Verify: livingip.xyz leaderboard stops showing CNBC/SpaceNews
7. Argus /api/contributors unaffected (reads contributors directly, now clean)
Follow-up branch (not in this commit):
- Writer routing in lib/contributor.py + extract.py:
org handles -> publishers table + sources.publisher_id
person handles with @ prefix -> auto-create contributor, tier='cited'
formal-citation names -> sources.original_author (free text)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>