teleo-infrastructure

History

m3taversal 45b2f6de20 feat(schema): v26 — publishers + contributor_identities + sources provenance Separates three concerns currently conflated in contributors table: contributors — people + agents we credit (kind in 'person','agent') publishers — news orgs / academic venues / platforms (not credited) sources — gains publisher_id + content_type + original_author columns Rationale (Cory directive Apr 24): livingip.xyz leaderboard was showing CNBC, SpaceNews, TechCrunch etc. at the top because the attribution pipeline credited news org names as if they were contributors. The mechanism-level fix is a schema split — orgs live in publishers, individuals in contributors, each table has one semantics. Migration v26: - CREATE TABLE publishers (id PK, name UNIQUE, kind CHECK IN news\|academic\|social_platform\|podcast\|self\|internal\|legal\|government\| research_org\|commercial\|other, url_pattern, created_at) - CREATE TABLE contributor_identities (contributor_handle, platform CHECK IN x\|telegram\|github\|email\|web\|internal, platform_handle, verified, created_at) Composite PK on (platform, platform_handle) + index on contributor_handle. Enables one contributor to unify X + TG + GitHub handles. - ALTER TABLE sources ADD COLUMN publisher_id REFERENCES publishers(id) - ALTER TABLE sources ADD COLUMN content_type (article\|paper\|tweet\|conversation\|self_authored\|webpage\|podcast) - ALTER TABLE sources ADD COLUMN original_author TEXT (free-text fallback, e.g., "Kim et al." — not credit-bearing) - ALTER TABLE sources ADD COLUMN original_author_handle REFERENCES contributors(handle) (set only when the author is in our contributor network) - ALTER wrapped in try/except on "duplicate column" for replay safety - Both SCHEMA_SQL (fresh installs) + migration block (upgrades) updated - SCHEMA_VERSION bumped 25 -> 26 Migration is non-breaking. No data moves yet. Existing publishers-polluting- contributors row state is preserved until the classifier runs. Writer routing to these tables lands in a separate branch (Phase B writer changes). Classifier (scripts/classify-contributors.py): Analyzes existing contributors rows, buckets into: keep_agent — 9 Pentagon agents keep_person — 21 real humans + reachable pseudonymous X/TG handles publisher — 100 news orgs, academic venues, formal-citation names, brand/platform names garbage — 9 parse artifacts (containing /, parens, 3+ hyphens) review_needed — 0 (fully covered by current allowlists) Hand-curated allowlists for news/academic/social/internal publisher kinds. Garbage detection via regex on special chars and length > 50. Named pseudonyms without @ prefix (karpathy, simonw, swyx, metaproph3t, sjdedic, ceterispar1bus, etc.) classified as keep_person — they're real X/TG contributors missing an @ prefix because extraction frontmatter didn't normalize. Cory's auto-create rule catches these on first reference. Formal-citation names (Firstname-Lastname form — Clayton Christensen, Hayek, Ostrom, Friston, Bostrom, Bak, etc.) classified as academic publishers — these are cited, not reachable via @ handle. Get promoted to contributors if/when they sign up with an @ handle. Apply path is transactional (BEGIN / COMMIT / ROLLBACK on error). Publisher insert happens before contributor delete, and contributor delete is gated on successful insert so we never lose a row by moving it to a failed publisher insert. --apply path flags: --delete-events : also DELETE contribution_events rows for moved handles (default: keep events for audit trail) --show <handle> : inspect a single row's classification Smoke-tested end-to-end via local copy of VPS DB: Before: 139 contributors total (polluted with orgs) After: 30 contributors (9 agent + 21 person), 100 publishers, 9 deleted contribution_events: 3,705 preserved contributors <-> publishers overlap: 0 Named contributors verified present after --apply: alexastrum (claims=6) thesensatore (5) cameron-s1 (1) m3taversal (1011) Pentagon agent 'pipeline' (claims_merged=771) intentionally retained — it's the process name from old extract.py fallback path, not a real contributor. Classified as agent (kind='agent') so doesn't appear in person leaderboard. Deploy sequence after Ganymede review: 1. Branch ff-merge to main 2. scp lib/db.py + scripts/classify-contributors.py to VPS 3. Pipeline already at v26 (migration ran during earlier v26 restart) 4. Run dry-run: python3 ops/classify-contributors.py 5. Apply: python3 ops/classify-contributors.py --apply 6. Verify: livingip.xyz leaderboard stops showing CNBC/SpaceNews 7. Argus /api/contributors unaffected (reads contributors directly, now clean) Follow-up branch (not in this commit): - Writer routing in lib/contributor.py + extract.py: org handles -> publishers table + sources.publisher_id person handles with @ prefix -> auto-create contributor, tier='cited' formal-citation names -> sources.original_author (free text) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-24 20:47:21 +01:00
..
__init__.py	Initial commit: Pipeline v2 daemon + infrastructure docs	2026-03-12 14:11:18 +00:00
analytics.py	epimetheus: sync VPS-deployed code to repo — Mar 18-20 reliability + features	2026-03-20 20:17:27 +00:00
attribution.py	fix(attribution): Phase A followup — bug #1 + 4 nits + refactor (Ganymede review)	2026-04-24 14:13:54 +01:00
breaker.py	ganymede: add dev infrastructure — pyproject.toml, CI, deploy script	2026-03-13 14:24:27 +00:00
cascade.py	fix: sync all code from VPS — repo is now authoritative source of truth	2026-04-15 13:18:01 +01:00
claim_index.py	epimetheus: sync VPS-deployed code to repo — Mar 18-20 reliability + features	2026-03-20 20:17:27 +00:00
config.py	fix: close cooldown-dependence gaps in extract.py (Ganymede review)	2026-04-22 11:33:10 +01:00
connect.py	fix: sync all code from VPS — repo is now authoritative source of truth	2026-04-15 13:18:01 +01:00
contributor.py	fix(attribution): Phase A followup — bug #1 + 4 nits + refactor (Ganymede review)	2026-04-24 14:13:54 +01:00
costs.py	Consolidate pipeline code from teleo-codex + VPS into single repo	2026-04-07 16:52:26 +01:00
cross_domain.py	Consolidate pipeline code from teleo-codex + VPS into single repo	2026-04-07 16:52:26 +01:00
db.py	feat(schema): v26 — publishers + contributor_identities + sources provenance	2026-04-24 20:47:21 +01:00
dedup.py	fix: enrichment idempotency — three-layer dedup prevents duplicate evidence blocks	2026-03-31 13:18:23 +01:00
digest.py	Consolidate pipeline code from teleo-codex + VPS into single repo	2026-04-07 16:52:26 +01:00
domains.py	Wire rejection_reason into review records + fix ingestion domain routing	2026-04-20 18:03:34 +01:00
entity_batch.py	fix: enrichment idempotency — three-layer dedup prevents duplicate evidence blocks	2026-03-31 13:18:23 +01:00
entity_queue.py	epimetheus: sync VPS-deployed code to repo — Mar 18-20 reliability + features	2026-03-20 20:17:27 +00:00
eval_actions.py	fix(eval): treat empty diff as conservative fallback in auto-close gate	2026-04-23 11:24:16 +01:00
eval_parse.py	refactor: Phase 4 — extract eval_actions.py, drop underscore prefixes in eval_parse	2026-04-16 12:57:51 +01:00
evaluate.py	Wire rejection_reason into review records + fix ingestion domain routing	2026-04-20 18:03:34 +01:00
extract.py	fix: close cooldown-dependence gaps in extract.py (Ganymede review)	2026-04-22 11:33:10 +01:00
extraction_prompt.py	Reduce near-duplicate and frontmatter schema rejections	2026-04-20 18:03:26 +01:00
feedback.py	epimetheus: sync VPS-deployed code to repo — Mar 18-20 reliability + features	2026-03-20 20:17:27 +00:00
fixer.py	refactor: Phase 2 — wire pr_state into fixer.py and substantive_fixer.py	2026-04-16 12:21:40 +01:00
forgejo.py	leo: handle non-JSON 200 from Forgejo merge API	2026-03-13 17:38:00 +00:00
frontmatter.py	fix: quote YAML edge values containing colons, skip unparseable files in reweave merge	2026-04-18 12:07:28 +01:00
github_feedback.py	fix: wrap breaker calls in stage_loop to prevent permanent task death	2026-04-20 12:37:28 +01:00
health.py	Consolidate pipeline code from teleo-codex + VPS into single repo	2026-04-07 16:52:26 +01:00
llm.py	Consolidate pipeline code from teleo-codex + VPS into single repo	2026-04-07 16:52:26 +01:00
log.py	Initial commit: Pipeline v2 daemon + infrastructure docs	2026-03-12 14:11:18 +00:00
merge.py	feat: bidirectional source↔claim linking	2026-04-21 13:00:59 +01:00
post_extract.py	Consolidate pipeline code from teleo-codex + VPS into single repo	2026-04-07 16:52:26 +01:00
post_merge.py	feat: bidirectional source↔claim linking	2026-04-21 13:00:59 +01:00
pr_state.py	refactor: Phase 3 — fix close_pr ghost bug, wire stale_pr, extract eval_parse	2026-04-16 12:40:23 +01:00
pre_screen.py	Consolidate pipeline code from teleo-codex + VPS into single repo	2026-04-07 16:52:26 +01:00
search.py	Consolidate pipeline code from teleo-codex + VPS into single repo	2026-04-07 16:52:26 +01:00
stale_pr.py	refactor: Phase 3 — fix close_pr ghost bug, wire stale_pr, extract eval_parse	2026-04-16 12:40:23 +01:00
substantive_fixer.py	refactor: Phase 2 — wire pr_state into fixer.py and substantive_fixer.py	2026-04-16 12:21:40 +01:00
validate.py	Reduce near-duplicate and frontmatter schema rejections	2026-04-20 18:03:26 +01:00
watchdog.py	fix: sync all code from VPS — repo is now authoritative source of truth	2026-04-15 13:18:01 +01:00
worktree_lock.py	epimetheus: sync VPS-deployed code to repo — Mar 18-20 reliability + features	2026-03-20 20:17:27 +00:00