teleo-infrastructure/scripts
m3taversal 58cc955c69 feat(schema): v26 — publishers + contributor_identities + sources provenance
Separates three concerns currently conflated in contributors table:
  contributors — people + agents we credit (kind in 'person','agent')
  publishers   — news orgs / academic venues / platforms (not credited)
  sources      — gains publisher_id + content_type + original_author columns

Rationale (Cory directive Apr 24): livingip.xyz leaderboard was showing CNBC,
SpaceNews, TechCrunch etc. at the top because the attribution pipeline credited
news org names as if they were contributors. The mechanism-level fix is a
schema split — orgs live in publishers, individuals in contributors, each
table has one semantics.

Migration v26:
  - CREATE TABLE publishers (id PK, name UNIQUE, kind CHECK IN
    news|academic|social_platform|podcast|self|internal|legal|government|
    research_org|commercial|other, url_pattern, created_at)
  - CREATE TABLE contributor_identities (contributor_handle, platform CHECK IN
    x|telegram|github|email|web|internal, platform_handle, verified, created_at)
    Composite PK on (platform, platform_handle) + index on contributor_handle.
    Enables one contributor to unify X + TG + GitHub handles.
  - ALTER TABLE sources ADD COLUMN publisher_id REFERENCES publishers(id)
  - ALTER TABLE sources ADD COLUMN content_type
    (article|paper|tweet|conversation|self_authored|webpage|podcast)
  - ALTER TABLE sources ADD COLUMN original_author TEXT
    (free-text fallback, e.g., "Kim et al." — not credit-bearing)
  - ALTER TABLE sources ADD COLUMN original_author_handle REFERENCES contributors(handle)
    (set only when the author is in our contributor network)
  - ALTER wrapped in try/except on "duplicate column" for replay safety
  - Both SCHEMA_SQL (fresh installs) + migration block (upgrades) updated
  - SCHEMA_VERSION bumped 25 -> 26

Migration is non-breaking. No data moves yet. Existing publishers-polluting-
contributors row state is preserved until the classifier runs. Writer routing
to these tables lands in a separate branch (Phase B writer changes).

Classifier (scripts/classify-contributors.py):
  Analyzes existing contributors rows, buckets into:
    keep_agent   — 9 Pentagon agents
    keep_person  — 21 real humans + reachable pseudonymous X/TG handles
    publisher    — 100 news orgs, academic venues, formal-citation names,
                   brand/platform names
    garbage      — 9 parse artifacts (containing /, parens, 3+ hyphens)
    review_needed — 0 (fully covered by current allowlists)

  Hand-curated allowlists for news/academic/social/internal publisher kinds.
  Garbage detection via regex on special chars and length > 50.
  Named pseudonyms without @ prefix (karpathy, simonw, swyx, metaproph3t,
  sjdedic, ceterispar1bus, etc.) classified as keep_person — they're real
  X/TG contributors missing an @ prefix because extraction frontmatter
  didn't normalize. Cory's auto-create rule catches these on first reference.

  Formal-citation names (Firstname-Lastname form — Clayton Christensen, Hayek,
  Ostrom, Friston, Bostrom, Bak, etc.) classified as academic publishers —
  these are cited, not reachable via @ handle. Get promoted to contributors
  if/when they sign up with an @ handle.

  Apply path is transactional (BEGIN / COMMIT / ROLLBACK on error). Publisher
  insert happens before contributor delete, and contributor delete is gated
  on successful insert so we never lose a row by moving it to a failed
  publisher insert.

  --apply path flags:
    --delete-events  : also DELETE contribution_events rows for moved handles
                       (default: keep events for audit trail)
  --show <handle>   : inspect a single row's classification

Smoke-tested end-to-end via local copy of VPS DB:
  Before: 139 contributors total (polluted with orgs)
  After:  30 contributors (9 agent + 21 person), 100 publishers, 9 deleted
  contribution_events: 3,705 preserved
  contributors <-> publishers overlap: 0

Named contributors verified present after --apply:
  alexastrum (claims=6)  thesensatore (5)  cameron-s1 (1)  m3taversal (1011)

Pentagon agent 'pipeline' (claims_merged=771) intentionally retained — it's
the process name from old extract.py fallback path, not a real contributor.
Classified as agent (kind='agent') so doesn't appear in person leaderboard.

Deploy sequence after Ganymede review:
  1. Branch ff-merge to main
  2. scp lib/db.py + scripts/classify-contributors.py to VPS
  3. Pipeline already at v26 (migration ran during earlier v26 restart)
  4. Run dry-run: python3 ops/classify-contributors.py
  5. Apply: python3 ops/classify-contributors.py --apply
  6. Verify: livingip.xyz leaderboard stops showing CNBC/SpaceNews
  7. Argus /api/contributors unaffected (reads contributors directly, now clean)

Follow-up branch (not in this commit):
  - Writer routing in lib/contributor.py + extract.py:
    org handles -> publishers table + sources.publisher_id
    person handles with @ prefix -> auto-create contributor, tier='cited'
    formal-citation names -> sources.original_author (free text)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 17:40:10 +01:00
..
audit-wiki-links.py feat: add wiki-link audit script for codex graph integrity 2026-04-21 10:46:55 +01:00
backfill-ci.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
backfill-descriptions.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
backfill-domains.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
backfill-events.py fix(backfill): normalize commit_date via datetime() in time-proximity query 2026-04-24 16:16:03 +01:00
backfill-reviewer-count.py fix: sync all code from VPS — repo is now authoritative source of truth 2026-04-15 13:18:01 +01:00
backfill-source-authors.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
backfill-sourcer-attribution.py fix(attribution): credit sourcer/extractor from claim frontmatter 2026-04-24 12:48:41 +01:00
backfill-sources.py fix(backfill): don't regress terminal source statuses to unprocessed 2026-04-22 21:29:33 +01:00
backfill-synthetic-recovery-prs.py fix(backfill): Ganymede review — fix tautological guard + origin='human' 2026-04-24 16:49:12 +01:00
bootstrap-contributors.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
classify-contributors.py feat(schema): v26 — publishers + contributor_identities + sources provenance 2026-04-24 17:40:10 +01:00
contributor-graph.py feat: contributor graph PNG generator + API endpoint 2026-04-21 11:01:02 +01:00
cumulative-growth.py feat: add /api/contributor-growth endpoint + cumulative growth script 2026-04-20 22:19:42 +01:00
embed-claims.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
extract-decisions.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
extract-graph-data.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
migrate-entity-schema.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
migrate-source-archive.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
nightly-reweave.sh fix: lower reweave threshold 0.70→0.55, increase batch 50→200 2026-04-16 14:18:50 +01:00
openrouter-extract-v2.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
reconcile-source-status.sh feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
reconcile-sources.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
scoring_digest.py feat: wire action-type CI into contributor profiles 2026-04-21 11:29:01 +01:00
tier0-gate.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00
vector-gc.py feat: reorganize repo with clear directory boundaries and agent ownership 2026-04-14 18:20:13 +01:00