feat(schema): v26 — publishers + contributor_identities + sources provenance
Separates three concerns currently conflated in contributors table:
contributors — people + agents we credit (kind in 'person','agent')
publishers — news orgs / academic venues / platforms (not credited)
sources — gains publisher_id + content_type + original_author columns
Rationale (Cory directive Apr 24): livingip.xyz leaderboard was showing CNBC,
SpaceNews, TechCrunch etc. at the top because the attribution pipeline credited
news org names as if they were contributors. The mechanism-level fix is a
schema split — orgs live in publishers, individuals in contributors, each
table has one semantics.
Migration v26:
- CREATE TABLE publishers (id PK, name UNIQUE, kind CHECK IN
news|academic|social_platform|podcast|self|internal|legal|government|
research_org|commercial|other, url_pattern, created_at)
- CREATE TABLE contributor_identities (contributor_handle, platform CHECK IN
x|telegram|github|email|web|internal, platform_handle, verified, created_at)
Composite PK on (platform, platform_handle) + index on contributor_handle.
Enables one contributor to unify X + TG + GitHub handles.
- ALTER TABLE sources ADD COLUMN publisher_id REFERENCES publishers(id)
- ALTER TABLE sources ADD COLUMN content_type
(article|paper|tweet|conversation|self_authored|webpage|podcast)
- ALTER TABLE sources ADD COLUMN original_author TEXT
(free-text fallback, e.g., "Kim et al." — not credit-bearing)
- ALTER TABLE sources ADD COLUMN original_author_handle REFERENCES contributors(handle)
(set only when the author is in our contributor network)
- ALTER wrapped in try/except on "duplicate column" for replay safety
- Both SCHEMA_SQL (fresh installs) + migration block (upgrades) updated
- SCHEMA_VERSION bumped 25 -> 26
Migration is non-breaking. No data moves yet. Existing publishers-polluting-
contributors row state is preserved until the classifier runs. Writer routing
to these tables lands in a separate branch (Phase B writer changes).
Classifier (scripts/classify-contributors.py):
Analyzes existing contributors rows, buckets into:
keep_agent — 9 Pentagon agents
keep_person — 21 real humans + reachable pseudonymous X/TG handles
publisher — 100 news orgs, academic venues, formal-citation names,
brand/platform names
garbage — 9 parse artifacts (containing /, parens, 3+ hyphens)
review_needed — 0 (fully covered by current allowlists)
Hand-curated allowlists for news/academic/social/internal publisher kinds.
Garbage detection via regex on special chars and length > 50.
Named pseudonyms without @ prefix (karpathy, simonw, swyx, metaproph3t,
sjdedic, ceterispar1bus, etc.) classified as keep_person — they're real
X/TG contributors missing an @ prefix because extraction frontmatter
didn't normalize. Cory's auto-create rule catches these on first reference.
Formal-citation names (Firstname-Lastname form — Clayton Christensen, Hayek,
Ostrom, Friston, Bostrom, Bak, etc.) classified as academic publishers —
these are cited, not reachable via @ handle. Get promoted to contributors
if/when they sign up with an @ handle.
Apply path is transactional (BEGIN / COMMIT / ROLLBACK on error). Publisher
insert happens before contributor delete, and contributor delete is gated
on successful insert so we never lose a row by moving it to a failed
publisher insert.
--apply path flags:
--delete-events : also DELETE contribution_events rows for moved handles
(default: keep events for audit trail)
--show <handle> : inspect a single row's classification
Smoke-tested end-to-end via local copy of VPS DB:
Before: 139 contributors total (polluted with orgs)
After: 30 contributors (9 agent + 21 person), 100 publishers, 9 deleted
contribution_events: 3,705 preserved
contributors <-> publishers overlap: 0
Named contributors verified present after --apply:
alexastrum (claims=6) thesensatore (5) cameron-s1 (1) m3taversal (1011)
Pentagon agent 'pipeline' (claims_merged=771) intentionally retained — it's
the process name from old extract.py fallback path, not a real contributor.
Classified as agent (kind='agent') so doesn't appear in person leaderboard.
Deploy sequence after Ganymede review:
1. Branch ff-merge to main
2. scp lib/db.py + scripts/classify-contributors.py to VPS
3. Pipeline already at v26 (migration ran during earlier v26 restart)
4. Run dry-run: python3 ops/classify-contributors.py
5. Apply: python3 ops/classify-contributors.py --apply
6. Verify: livingip.xyz leaderboard stops showing CNBC/SpaceNews
7. Argus /api/contributors unaffected (reads contributors directly, now clean)
Follow-up branch (not in this commit):
- Writer routing in lib/contributor.py + extract.py:
org handles -> publishers table + sources.publisher_id
person handles with @ prefix -> auto-create contributor, tier='cited'
formal-citation names -> sources.original_author (free text)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f0f9388c1f
commit
45b2f6de20
2 changed files with 448 additions and 1 deletions
83
lib/db.py
83
lib/db.py
|
|
@ -9,7 +9,7 @@ from . import config
|
|||
|
||||
logger = logging.getLogger("pipeline.db")
|
||||
|
||||
SCHEMA_VERSION = 25
|
||||
SCHEMA_VERSION = 26
|
||||
|
||||
SCHEMA_SQL = """
|
||||
CREATE TABLE IF NOT EXISTS schema_version (
|
||||
|
|
@ -35,6 +35,15 @@ CREATE TABLE IF NOT EXISTS sources (
|
|||
feedback TEXT,
|
||||
-- eval feedback for re-extraction (JSON)
|
||||
cost_usd REAL DEFAULT 0,
|
||||
-- v26: provenance — publisher (news org / venue) + content author.
|
||||
-- publisher_id references publishers(id) when source is from a known org.
|
||||
-- original_author_handle references contributors(handle) when author is in our system.
|
||||
-- original_author is free-text fallback ("Kim et al.", "Robin Hanson") — not credit-bearing.
|
||||
publisher_id INTEGER REFERENCES publishers(id),
|
||||
content_type TEXT,
|
||||
-- article | paper | tweet | conversation | self_authored | webpage | podcast
|
||||
original_author TEXT,
|
||||
original_author_handle TEXT REFERENCES contributors(handle),
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
updated_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
|
|
@ -207,6 +216,33 @@ CREATE TABLE IF NOT EXISTS contributor_aliases (
|
|||
created_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_aliases_canonical ON contributor_aliases(canonical);
|
||||
|
||||
-- Publishers: news orgs, academic venues, social platforms. NOT contributors — these
|
||||
-- provide metadata/provenance for sources, never earn leaderboard credit. Separating
|
||||
-- these from contributors prevents CNBC/SpaceNews from dominating the leaderboard.
|
||||
-- (Apr 24 Cory directive: "only credit the original source if its on X or tg")
|
||||
CREATE TABLE IF NOT EXISTS publishers (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
name TEXT NOT NULL UNIQUE,
|
||||
kind TEXT CHECK(kind IN ('news', 'academic', 'social_platform', 'podcast', 'self', 'internal', 'legal', 'government', 'research_org', 'commercial', 'other')),
|
||||
url_pattern TEXT,
|
||||
created_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_publishers_name ON publishers(name);
|
||||
CREATE INDEX IF NOT EXISTS idx_publishers_kind ON publishers(kind);
|
||||
|
||||
-- Multi-platform identity: one contributor, many handles. Enables the leaderboard to
|
||||
-- unify @thesensatore (X) + thesensatore (TG) + thesensatore@github into one person.
|
||||
-- Writers check this table after resolving aliases to find canonical contributor handle.
|
||||
CREATE TABLE IF NOT EXISTS contributor_identities (
|
||||
contributor_handle TEXT NOT NULL,
|
||||
platform TEXT NOT NULL CHECK(platform IN ('x', 'telegram', 'github', 'email', 'web', 'internal')),
|
||||
platform_handle TEXT NOT NULL,
|
||||
verified INTEGER DEFAULT 0,
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
PRIMARY KEY (platform, platform_handle)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_identities_contributor ON contributor_identities(contributor_handle);
|
||||
"""
|
||||
|
||||
|
||||
|
|
@ -764,6 +800,51 @@ def migrate(conn: sqlite3.Connection):
|
|||
conn.commit()
|
||||
logger.info("Migration v25: patched kind='agent' for pipeline handle")
|
||||
|
||||
if current < 26:
|
||||
# Add publishers + contributor_identities. Non-breaking — new tables only.
|
||||
# No existing data moved. Classification into publishers happens via a
|
||||
# separate script (scripts/reclassify-contributors.py) with Cory-reviewed
|
||||
# seed list. CHECK constraint on contributors.kind deferred to v27 after
|
||||
# classification completes. (Apr 24 Cory directive: "fix schema, don't
|
||||
# filter output" — separate contributors from publishers at the data layer.)
|
||||
conn.executescript("""
|
||||
CREATE TABLE IF NOT EXISTS publishers (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
name TEXT NOT NULL UNIQUE,
|
||||
kind TEXT CHECK(kind IN ('news', 'academic', 'social_platform', 'podcast', 'self', 'internal', 'legal', 'government', 'research_org', 'commercial', 'other')),
|
||||
url_pattern TEXT,
|
||||
created_at TEXT DEFAULT (datetime('now'))
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_publishers_name ON publishers(name);
|
||||
CREATE INDEX IF NOT EXISTS idx_publishers_kind ON publishers(kind);
|
||||
|
||||
CREATE TABLE IF NOT EXISTS contributor_identities (
|
||||
contributor_handle TEXT NOT NULL,
|
||||
platform TEXT NOT NULL CHECK(platform IN ('x', 'telegram', 'github', 'email', 'web', 'internal')),
|
||||
platform_handle TEXT NOT NULL,
|
||||
verified INTEGER DEFAULT 0,
|
||||
created_at TEXT DEFAULT (datetime('now')),
|
||||
PRIMARY KEY (platform, platform_handle)
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_identities_contributor ON contributor_identities(contributor_handle);
|
||||
""")
|
||||
# Extend sources with provenance columns. ALTER TABLE ADD COLUMN is
|
||||
# idempotent-safe via try/except because SQLite doesn't support IF NOT EXISTS
|
||||
# on column adds.
|
||||
for col_sql in (
|
||||
"ALTER TABLE sources ADD COLUMN publisher_id INTEGER REFERENCES publishers(id)",
|
||||
"ALTER TABLE sources ADD COLUMN content_type TEXT",
|
||||
"ALTER TABLE sources ADD COLUMN original_author TEXT",
|
||||
"ALTER TABLE sources ADD COLUMN original_author_handle TEXT REFERENCES contributors(handle)",
|
||||
):
|
||||
try:
|
||||
conn.execute(col_sql)
|
||||
except sqlite3.OperationalError as e:
|
||||
if "duplicate column" not in str(e).lower():
|
||||
raise
|
||||
conn.commit()
|
||||
logger.info("Migration v26: added publishers + contributor_identities tables + sources provenance columns")
|
||||
|
||||
if current < SCHEMA_VERSION:
|
||||
conn.execute(
|
||||
"INSERT OR REPLACE INTO schema_version (version) VALUES (?)",
|
||||
|
|
|
|||
366
scripts/classify-contributors.py
Normal file
366
scripts/classify-contributors.py
Normal file
|
|
@ -0,0 +1,366 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Classify `contributors` rows into {keep_person, keep_agent, move_to_publisher, delete_garbage}.
|
||||
|
||||
Reads current contributors table, proposes reclassification per v26 schema design:
|
||||
- Real humans + Pentagon agents stay in contributors (kind='person'|'agent')
|
||||
- News orgs, publications, venues move to publishers table (new v26)
|
||||
- Multi-word hyphenated garbage (parsing artifacts) gets deleted
|
||||
- Their contribution_events are handled per category:
|
||||
* Publishers: DELETE events (orgs shouldn't have credit)
|
||||
* Garbage: DELETE events (bogus data)
|
||||
* Persons/agents: keep events untouched
|
||||
|
||||
Classification is heuristic — uses explicit allowlists + regex patterns + length gates.
|
||||
Ambiguous cases default to 'review_needed' (human decision).
|
||||
|
||||
Usage:
|
||||
python3 scripts/classify-contributors.py # dry-run analysis + report
|
||||
python3 scripts/classify-contributors.py --apply # write changes
|
||||
python3 scripts/classify-contributors.py --show <handle> # inspect a single row
|
||||
|
||||
Writes to pipeline.db only. Does NOT modify claim files.
|
||||
"""
|
||||
import argparse
|
||||
import os
|
||||
import re
|
||||
import sqlite3
|
||||
import sys
|
||||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
DB_PATH = os.environ.get("PIPELINE_DB", "/opt/teleo-eval/pipeline/pipeline.db")
|
||||
|
||||
# Pentagon agents: kind='agent'. Authoritative list.
|
||||
PENTAGON_AGENTS = frozenset({
|
||||
"rio", "leo", "theseus", "vida", "clay", "astra",
|
||||
"oberon", "argus", "rhea", "ganymede", "epimetheus", "hermes", "ship",
|
||||
"pipeline",
|
||||
})
|
||||
|
||||
# Publisher/news-org handles seen in current contributors table.
|
||||
# Grouped by kind for the publishers row. Classified by inspection.
|
||||
# NOTE: This list is hand-curated — add to it as new orgs appear.
|
||||
PUBLISHERS_NEWS = {
|
||||
# News outlets / brands
|
||||
"cnbc", "al-jazeera", "axios", "bloomberg", "reuters", "bettorsinsider",
|
||||
"fortune", "techcrunch", "coindesk", "coindesk-staff", "coindesk-research",
|
||||
"coindesk research", "coindesk staff",
|
||||
"defense-one", "thedefensepost", "theregister", "the-intercept",
|
||||
"the-meridiem", "variety", "variety-staff", "variety staff", "spacenews",
|
||||
"nasaspaceflight", "thedonkey", "insidedefense", "techpolicypress",
|
||||
"morganlewis", "casinoorg", "deadline", "animationmagazine",
|
||||
"defensepost", "casino-org", "casino.org",
|
||||
"air & space forces magazine", "ieee spectrum", "techcrunch-staff",
|
||||
"blockworks", "blockworks-staff", "decrypt", "ainvest", "banking-dive", "banking dive",
|
||||
"cset-georgetown", "cset georgetown",
|
||||
"kff", "kff-health-news", "kff health news", "kff-health-news---cbo",
|
||||
"kff-health-news-/-cbo", "kff health news / cbo", "kffhealthnews",
|
||||
"bloomberg-law",
|
||||
"norton-rose-fulbright", "norton rose fulbright",
|
||||
"defence-post", "the-defensepost",
|
||||
"wilmerhale", "mofo", "sciencedirect",
|
||||
"yogonet", "csr", "aisi-uk", "aisi", "aisi_gov", "rand",
|
||||
"armscontrol", "eclinmed", "solana-compass", "solana compass",
|
||||
"pmc11919318", "pmc11780016",
|
||||
"healthverity", "natrium", "form-energy",
|
||||
"courtlistener", "curtis-schiff", "curtis-schiff-prediction-markets",
|
||||
"prophetx", "techpolicypress-staff",
|
||||
"npr", "venturebeat", "geekwire", "payloadspace", "the-ankler",
|
||||
"theankler", "tubefilter", "emarketer", "dagster",
|
||||
"numerai", # fund/project brand, not person
|
||||
"psl", "multistate",
|
||||
}
|
||||
PUBLISHERS_ACADEMIC = {
|
||||
# Academic orgs, labs, papers, journals, institutions
|
||||
"arxiv", "metr", "metr_evals", "apollo-research", "apollo research", "apolloresearch",
|
||||
"jacc-study-authors", "jacc-data-report-authors",
|
||||
"anthropic-fellows-program", "anthropic-fellows",
|
||||
"anthropic-fellows-/-alignment-science-team", "anthropic-research",
|
||||
"jmir-2024", "jmir 2024",
|
||||
"oettl-et-al.,-journal-of-experimental-orthopaedics",
|
||||
"oettl et al., journal of experimental orthopaedics",
|
||||
"jacc", "nct06548490", "pmc",
|
||||
"conitzer-et-al.-(2024)", "aquino-michaels-2026", "pan-et-al.",
|
||||
"pan-et-al.-'natural-language-agent-harnesses'",
|
||||
"stanford", "stanford-meta-harness",
|
||||
"hendershot", "annals-im",
|
||||
"nellie-liang,-brookings-institution", "nellie liang, brookings institution",
|
||||
"penn-state", "american-heart-association", "american heart association",
|
||||
"molt_cornelius", "molt-cornelius",
|
||||
# Companies / labs / brand-orgs (not specific humans)
|
||||
"anthropic", "anthropicai", "openai", "nasa", "icrc", "ecri",
|
||||
"epochairesearch", "metadao", "iapam", "icer",
|
||||
"who", "ama", "uspstf", "unknown",
|
||||
"futard.io", # protocol/platform
|
||||
"oxford-martin-ai-governance-initiative",
|
||||
"oxford-martin-ai-governance",
|
||||
"u.s.-food-and-drug-administration",
|
||||
"jitse-goutbeek,-european-policy-centre", # cited person+org string → publisher
|
||||
"adepoju-et-al.", # paper citation
|
||||
# Formal-citation names (Firstname-Lastname or Lastname-et-al) — classified
|
||||
# as academic citations, not reachable contributors. They'd need an @ handle
|
||||
# to get CI credit per Cory's growth-loop design.
|
||||
"senator-elissa-slotkin",
|
||||
"bostrom", "hanson", "kaufmann", "noah-smith", "doug-shapiro",
|
||||
"shayon-sengupta", "shayon sengupta",
|
||||
"robin-hanson", "robin hanson", "eliezer-yudkowsky",
|
||||
"leopold-aschenbrenner", "aschenbrenner",
|
||||
"ramstead", "larsson", "heavey",
|
||||
"dan-slimmon", "van-leeuwaarden", "ward-whitt", "adams",
|
||||
"tamim-ansary", "spizzirri",
|
||||
"dario-amodei", # formal-citation form (real @ is @darioamodei)
|
||||
"corless", "oxranga", "vlahakis",
|
||||
# Brand/project/DAO tokens — not individuals
|
||||
"areal-dao", "areal", "theiaresearch", "futard-io", "dhrumil",
|
||||
# Classic formal-citation names — famous academics/economists cited by surname.
|
||||
# Reachable via @ handle if/when they join (e.g. Ostrom has no X, Hayek deceased,
|
||||
# Friston has an institutional affiliation not an @ handle we'd track).
|
||||
"clayton-christensen", "hidalgo", "coase", "wiener", "juarrero",
|
||||
"ostrom", "centola", "hayek", "marshall-mcluhan", "blackmore",
|
||||
"knuth", "friston", "aquino-michaels", "conitzer", "bak",
|
||||
}
|
||||
# NOTE: pseudonymous X handles that MAY be real contributors stay in keep_person:
|
||||
# karpathy, simonw, swyx, metaproph3t, metanallok, mmdhrumil, sjdedic,
|
||||
# ceterispar1bus — these are real X accounts and match Cory's growth loop.
|
||||
# They appear without @ prefix because extraction frontmatter didn't normalize.
|
||||
# Auto-creating them as contributors tier='cited' is correct (A-path from earlier).
|
||||
PUBLISHERS_SOCIAL = {
|
||||
"x", "twitter", "telegram", "x.com",
|
||||
}
|
||||
PUBLISHERS_INTERNAL = {
|
||||
"teleohumanity-manifesto", "strategy-session-journal",
|
||||
"living-capital-thesis-development", "attractor-state-historical-backtesting",
|
||||
"web-research-compilation", "architectural-investing",
|
||||
"governance---meritocratic-voting-+-futarchy", # title artifact
|
||||
"sec-interpretive-release-s7-2026-09-(march-17", # title artifact
|
||||
"mindstudio", # tooling/platform, not contributor
|
||||
}
|
||||
# Merge into one kind→set map for classification
|
||||
PUBLISHER_KIND_MAP = {}
|
||||
for h in PUBLISHERS_NEWS:
|
||||
PUBLISHER_KIND_MAP[h.lower()] = "news"
|
||||
for h in PUBLISHERS_ACADEMIC:
|
||||
PUBLISHER_KIND_MAP[h.lower()] = "academic"
|
||||
for h in PUBLISHERS_SOCIAL:
|
||||
PUBLISHER_KIND_MAP[h.lower()] = "social_platform"
|
||||
for h in PUBLISHERS_INTERNAL:
|
||||
PUBLISHER_KIND_MAP[h.lower()] = "internal"
|
||||
|
||||
|
||||
# Garbage: handles that are clearly parse artifacts, not real names.
|
||||
# Pattern: contains parens, special chars, or >50 chars.
|
||||
def is_garbage(handle: str) -> bool:
|
||||
h = handle.strip()
|
||||
if len(h) > 50:
|
||||
return True
|
||||
if re.search(r"[()\[\]<>{}\/\\|@#$%^&*=?!:;\"']", h):
|
||||
# But @ can appear legitimately in handles like @thesensatore — allow if @ is only prefix
|
||||
if h.startswith("@") and not re.search(r"[()\[\]<>{}\/\\|#$%^&*=?!:;\"']", h):
|
||||
return False
|
||||
return True
|
||||
# Multi-word hyphenated with very specific artifact shape: 3+ hyphens in a row or trailing noise
|
||||
if "---" in h or "---meritocratic" in h or h.endswith("(march") or h.endswith("-(march"):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def classify(handle: str) -> tuple[str, str | None]:
|
||||
"""Return (category, publisher_kind).
|
||||
|
||||
category ∈ {'keep_agent', 'keep_person', 'publisher', 'garbage', 'review_needed'}
|
||||
publisher_kind ∈ {'news','academic','social_platform','internal', None}
|
||||
"""
|
||||
h = handle.strip().lower().lstrip("@")
|
||||
|
||||
if h in PENTAGON_AGENTS:
|
||||
return ("keep_agent", None)
|
||||
|
||||
if h in PUBLISHER_KIND_MAP:
|
||||
return ("publisher", PUBLISHER_KIND_MAP[h])
|
||||
|
||||
if is_garbage(handle):
|
||||
return ("garbage", None)
|
||||
|
||||
# @-prefixed handles or short-slug real-looking names → keep as person
|
||||
# (Auto-create rule from Cory: @ handles auto-join as tier='cited'.)
|
||||
if handle.startswith("@"):
|
||||
return ("keep_person", None)
|
||||
|
||||
# Short plausible handles (<=20 chars, alphanum + underscore/hyphen): treat as person
|
||||
if re.match(r"^[a-z0-9][a-z0-9_-]{0,19}$", h):
|
||||
return ("keep_person", None)
|
||||
|
||||
# Everything else: needs human review
|
||||
return ("review_needed", None)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--apply", action="store_true", help="Write changes to DB")
|
||||
parser.add_argument("--show", type=str, help="Inspect a single handle")
|
||||
parser.add_argument("--delete-events", action="store_true",
|
||||
help="DELETE contribution_events for publishers+garbage (default: keep for audit)")
|
||||
args = parser.parse_args()
|
||||
|
||||
if not Path(DB_PATH).exists():
|
||||
print(f"ERROR: DB not found at {DB_PATH}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
conn = sqlite3.connect(DB_PATH, timeout=30)
|
||||
conn.row_factory = sqlite3.Row
|
||||
|
||||
# Sanity: publishers table must exist (v26 migration applied)
|
||||
try:
|
||||
conn.execute("SELECT 1 FROM publishers LIMIT 1")
|
||||
except sqlite3.OperationalError:
|
||||
print("ERROR: publishers table missing. Run migration v26 first.", file=sys.stderr)
|
||||
sys.exit(2)
|
||||
|
||||
rows = conn.execute(
|
||||
"SELECT handle, kind, tier, claims_merged FROM contributors ORDER BY claims_merged DESC"
|
||||
).fetchall()
|
||||
|
||||
if args.show:
|
||||
target = args.show.strip().lower().lstrip("@")
|
||||
for r in rows:
|
||||
if r["handle"].lower().lstrip("@") == target:
|
||||
category, pkind = classify(r["handle"])
|
||||
events_count = conn.execute(
|
||||
"SELECT COUNT(*) FROM contribution_events WHERE handle = ?",
|
||||
(r["handle"].lower().lstrip("@"),),
|
||||
).fetchone()[0]
|
||||
print(f"handle: {r['handle']}")
|
||||
print(f"current_kind: {r['kind']}")
|
||||
print(f"current_tier: {r['tier']}")
|
||||
print(f"claims_merged: {r['claims_merged']}")
|
||||
print(f"events: {events_count}")
|
||||
print(f"→ category: {category}")
|
||||
if pkind:
|
||||
print(f"→ publisher: kind={pkind}")
|
||||
return
|
||||
print(f"No match for '{args.show}'")
|
||||
return
|
||||
|
||||
# Classify all
|
||||
buckets: dict[str, list[dict]] = {
|
||||
"keep_agent": [],
|
||||
"keep_person": [],
|
||||
"publisher": [],
|
||||
"garbage": [],
|
||||
"review_needed": [],
|
||||
}
|
||||
for r in rows:
|
||||
category, pkind = classify(r["handle"])
|
||||
buckets[category].append({
|
||||
"handle": r["handle"],
|
||||
"kind_now": r["kind"],
|
||||
"tier": r["tier"],
|
||||
"claims": r["claims_merged"] or 0,
|
||||
"publisher_kind": pkind,
|
||||
})
|
||||
|
||||
print("=== Classification summary ===")
|
||||
for cat, items in buckets.items():
|
||||
print(f" {cat:18s} {len(items):5d}")
|
||||
|
||||
print("\n=== Sample of each category ===")
|
||||
for cat, items in buckets.items():
|
||||
print(f"\n--- {cat} (showing up to 10) ---")
|
||||
for item in items[:10]:
|
||||
tag = f" → {item['publisher_kind']}" if item["publisher_kind"] else ""
|
||||
print(f" {item['handle']:50s} claims={item['claims']:5d}{tag}")
|
||||
|
||||
print("\n=== Full review_needed list ===")
|
||||
for item in buckets["review_needed"]:
|
||||
print(f" {item['handle']:50s} claims={item['claims']:5d}")
|
||||
|
||||
if not args.apply:
|
||||
print("\n(dry-run — no writes. Re-run with --apply to execute.)")
|
||||
return
|
||||
|
||||
# ── Apply changes ──
|
||||
print("\n=== Applying changes ===")
|
||||
if buckets["review_needed"]:
|
||||
print(f"ABORT: {len(buckets['review_needed'])} rows need human review. Fix classifier before --apply.")
|
||||
sys.exit(3)
|
||||
|
||||
inserted_publishers = 0
|
||||
reclassified_agents = 0
|
||||
deleted_garbage = 0
|
||||
deleted_publisher_rows = 0
|
||||
deleted_events = 0
|
||||
|
||||
# Single transaction — if any step errors, roll back. This prevents the failure
|
||||
# mode where a publisher insert fails silently and we still delete the contributor
|
||||
# row, losing data.
|
||||
try:
|
||||
conn.execute("BEGIN")
|
||||
|
||||
# 1. Insert publishers. Track which ones succeeded so step 4 only deletes those.
|
||||
moved_to_publisher = set()
|
||||
for item in buckets["publisher"]:
|
||||
name = item["handle"].strip().lower().lstrip("@")
|
||||
conn.execute(
|
||||
"INSERT OR IGNORE INTO publishers (name, kind) VALUES (?, ?)",
|
||||
(name, item["publisher_kind"]),
|
||||
)
|
||||
moved_to_publisher.add(item["handle"])
|
||||
inserted_publishers += 1
|
||||
|
||||
# 2. Ensure Pentagon agents have kind='agent' (idempotent after v25 patch)
|
||||
for item in buckets["keep_agent"]:
|
||||
conn.execute(
|
||||
"UPDATE contributors SET kind = 'agent' WHERE handle = ?",
|
||||
(item["handle"].lower().lstrip("@"),),
|
||||
)
|
||||
reclassified_agents += 1
|
||||
|
||||
# 3. Delete garbage handles from contributors (and their events)
|
||||
for item in buckets["garbage"]:
|
||||
if args.delete_events:
|
||||
cur = conn.execute(
|
||||
"DELETE FROM contribution_events WHERE handle = ?",
|
||||
(item["handle"].lower().lstrip("@"),),
|
||||
)
|
||||
deleted_events += cur.rowcount
|
||||
cur = conn.execute(
|
||||
"DELETE FROM contributors WHERE handle = ?",
|
||||
(item["handle"],),
|
||||
)
|
||||
deleted_garbage += cur.rowcount
|
||||
|
||||
# 4. Delete publisher rows from contributors — ONLY for those successfully
|
||||
# inserted into publishers above. Guards against partial failure.
|
||||
for item in buckets["publisher"]:
|
||||
if item["handle"] not in moved_to_publisher:
|
||||
continue
|
||||
if args.delete_events:
|
||||
cur = conn.execute(
|
||||
"DELETE FROM contribution_events WHERE handle = ?",
|
||||
(item["handle"].lower().lstrip("@"),),
|
||||
)
|
||||
deleted_events += cur.rowcount
|
||||
cur = conn.execute(
|
||||
"DELETE FROM contributors WHERE handle = ?",
|
||||
(item["handle"],),
|
||||
)
|
||||
deleted_publisher_rows += cur.rowcount
|
||||
|
||||
conn.commit()
|
||||
except Exception as e:
|
||||
conn.rollback()
|
||||
print(f"ERROR: Transaction failed, rolled back. {e}", file=sys.stderr)
|
||||
sys.exit(4)
|
||||
|
||||
print(f" publishers inserted: {inserted_publishers}")
|
||||
print(f" agents kind='agent' ensured: {reclassified_agents}")
|
||||
print(f" garbage rows deleted: {deleted_garbage}")
|
||||
print(f" publisher rows removed from contributors: {deleted_publisher_rows}")
|
||||
if args.delete_events:
|
||||
print(f" contribution_events deleted: {deleted_events}")
|
||||
else:
|
||||
print(f" (events kept — re-run with --delete-events to clean them)")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Loading…
Reference in a new issue