Addresses two findings in commit 762fd42 review:
1. BUG: guard query was tautological. `SELECT MAX(number) FROM prs WHERE
number < 900000` filters out exactly what the `>= 900000` check tests.
Replaced with a direct check for unexpected rows in the synthetic range
(excluding our known 900068/900088).
2. WARNING: origin defaults to 'pipeline' via schema default. lib/merge.py
convention is origin='human' for external contributors. Synthetic rows
now set origin='human', priority='high' — matches discover_external_prs
for real GitHub PRs. Prevents Phase B origin-based filtering from
misclassifying Alex/Cameron as machine-authored.
Also flagged in review: credit projection was optimistic. Author events are
PR-level (not per-claim), so Alex gets 1×0.30 author credit, not 6. Same
for Cameron. Per-claim originator credit goes to the 7 frontmatter sourcers
where applicable. Not a code change — expectation reset for Cory.
Two historical GitHub PRs merged before our sync-mirror.sh tracked github_pr:
- GitHub PR #68: alexastrum, 6 claims, merged Mar 9 2026 via squash merge
- GitHub PR #88: Cameron-S1, 1 claim, merged early April
Their claim files were lost during a Forgejo→GitHub mirror overwrite and later
recovered via direct-to-main commits (dba00a79, da64f805). Because the
recovery commits bypassed the pipeline, our 'prs' table has no row to attach
originator events to — all 4 backfill-events.py strategies returned None,
leaving Alex + Cameron at 0 originator credits despite real historical work.
This reconstructs synthetic 'prs' rows so the existing github_pr strategy in
backfill-events.py attaches 7 originator events on re-run:
- Numbers 900068 / 900088 live in a clearly-synthetic range that cannot
collide with real Forgejo PRs (current max: 3941)
- github_pr=68/88 wires up the existing lookup strategy
- submitted_by=alexastrum / cameron-s1 establishes author attribution
- merged_at from the recovery commit messages (not recovery-commit time)
- last_error tags the rows as synthetic for future audits
Idempotent: INSERT OR IGNORE via check on number OR github_pr. Safe to replay.
Reversible: DELETE FROM prs WHERE number IN (900068, 900088).
After applying this script:
python3 ops/backfill-events.py
will credit Alex with 6 author + 6 originator events (author=1.80, originator=0.90)
and Cameron with 1 author + 1 originator (0.30 + 0.15), all dated to the
historical merge dates — so 7d/30d leaderboard windows show them correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQLite datetime comparison fails lexicographically across ISO-T and
space-separator formats: '2026-03-27 18:00:14' < '2026-03-27T17:43:04+00:00'
because space (0x20) < T (0x54). PRs merged same-day but earlier than the
commit hour were silently excluded from the time-proximity cascade.
Shaga's 3 stigmergic-coordination claims resolved to PR #2032 (later, wrong)
instead of #2025 (earlier, correct). Fixed by wrapping both sides in
datetime(), which normalizes to space-separator before comparison.
Verified: all 3 Shaga claims now resolve to #2025 via git_time_proximity.
No change to totals (126 originator events, 5 proximity hits) — the fix
corrects WHICH PR each proximity-matched claim resolves to, not whether.
Caught by Ganymede review of 1d6b515.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite claim-level pass in backfill-events.py to recover the Forgejo PR
that introduced each claim via a cascade of 4 strategies (reliability
order), replacing the single title→description match that missed PRs
with NULL description (Cameron #3377) and bare-subject extracts (Shaga's
Leo research PR).
## Strategies
1. sourced_from frontmatter → prs.source_path stem match
2. git log first-add commit → subject pattern → prs.branch
- "<agent>: extract claims from <slug>" → extract/<slug>
- "<agent>: research session YYYY-MM-DD" → <agent>/research-<date>
- "<agent>: (challenge|contrib|entity|synthesize)" → <agent>/*
- "Recover X from GitHub PR #N" → prs.github_pr=N
- "Extract N claims from X" (no prefix) → time-proximity on
agent-owned branches within 24h
3. Current title_desc fallback for anything the above miss
## Dry-run projection (1,662 merged PRs)
Before:
Claims processed: 33
Originator events: 6
Breakdown: {no_pr_match: 1608, no_sourcer: 26, invalid_handle: 21, skip_self: 6}
After:
Claims processed: 505 (+472)
Originator events: 126 (+120)
Strategy hits: git_subject=412, sourced_from=88, git_time_proximity=5
Breakdown: {no_pr_match: 1095, no_sourcer: 67, invalid_handle: 359, skip_self: 20}
## Verified on real VPS data
- @thesensatore claims: 3/5 resolve via git_time_proximity to leo/ PRs
- Cameron-S1, alexastrum: remain None — their recovery commits
(dba00a79, da64f805) bypassed the pipeline entirely, no Forgejo PR
record exists. Requires synthetic prs rows — deferred to separate
commit with its own Ganymede review (write operation, larger blast
radius than this pure-read backfill change).
## Implementation
- New find_pr_for_claim(conn, repo, md) helper returns (pr_number, strategy)
- Claim-level pass uses it first, falls back to title_desc map
- Strategy counter surfaced in summary output for operator visibility
Idempotent — backfill re-runs skip duplicate events via the partial
UNIQUE index on contribution_events.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses Apr 24 review of 58fa8c52. All 6 findings landed.
Bug #1 — git log -1 returns latest commit, not first (semantic mismatch
with "original author" comment):
Drop -1 flag, take last line of default-ordered log output (= oldest).
Fixes mis-credit on multi-commit PRs where a reviewer rebased/force-pushed.
Nit #2 — forward writer didn't pass merged_at:
Fetch merged_at in the prs SELECT, thread pr_merged_at through all 5
insert_contribution_event call sites. Keeps forward-emitted and backfilled
event timestamps on the same timeline after merge retries.
Nit #3 — legacy-counts fallback paths emit no events (parity gap):
git-author and prs.agent fallback paths now emit challenger/synthesizer
events via the TRAILER_EVENT_ROLE map when refined_type matches. Closes
the gap where external-contributor challenge/enrich PRs would accumulate
legacy counts but disappear from event-sourced leaderboards.
Nit #4 — migration v24 agent seed missing 'pipeline':
Added "pipeline" to the seed list. Plus new migration v25 with idempotent
corrective UPDATE so existing envs (where v24 already ran) pick up the
fix on restart without requiring manual SQL. Verified on VPS state:
pipeline row was kind='person', will flip to 'agent' on redeploy.
Nit #5 — backfill summary prints originator attempted=0 in wrong pass:
Split the "=== Summary ===" header into "=== PR-level events ===" and
"=== Claim-level originator pass ===" with originator counts in the
right block. Operator-facing cosmetic.
Refactor #6 — AGENT_BRANCH_PREFIXES duplicated in 2 sites:
Extracted to lib/attribution.py as single source of truth. contributor.py
imports it. backfill-events.py keeps its local copy (runs standalone
without pipeline package import) with a sync-reference comment.
No behavioral drift for the common case. Backfill re-runs cleanly against
existing forward-written events (UNIQUE-index idempotency).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces contribution_events table + non-breaking double-write. Schema
lands today, forward traffic writes events alongside existing count upserts,
backfill script replays history. Phase B will add leaderboard API reading
from events; Phase C switches Argus dashboard over.
## Schema v24 (lib/db.py)
- contribution_events: one row per credit-earning event
(id, handle, kind, role, weight, pr_number, claim_path, domain, channel, timestamp)
Partial UNIQUE indexes handle SQLite's NULL != NULL semantics:
idx_ce_unique_claim on (handle, role, pr_number, claim_path) WHERE claim_path NOT NULL
idx_ce_unique_pr on (handle, role, pr_number) WHERE claim_path IS NULL
PR-level events (evaluator, author, challenger, synthesizer) dedup on 3-tuple.
Per-claim events (originator) dedup on 4-tuple. Idempotent on replay.
- contributor_aliases: canonical handle mapping
Seeded: @thesensatore → thesensatore, cameron → cameron-s1
- contributors.kind TEXT DEFAULT 'person'
Migration seeds 'agent' for known Pentagon agent handles.
## Role model (confirmed by Cory Apr 24)
Weights: author 0.30, challenger 0.25, synthesizer 0.20, originator 0.15, evaluator 0.05
- author: human who submitted the PR (curation + submission work)
- originator: person who authored the underlying content (rewards external creators)
- challenger: agent/person who brought a productive disagreement
- synthesizer: cross-domain work (enrichments, research sessions)
- evaluator: reviewer who approved (Leo + domain agent)
Humans-are-always-author: agents credit is capped at evaluator/synthesizer/
challenger. Pentagon agents classify as kind='agent' and surface in the
agent-view leaderboard, not the default person view.
## Writer (lib/contributor.py)
- New insert_contribution_event(): idempotent INSERT OR IGNORE with alias
normalization + kind classification. Falls back silently on pre-v24 DBs.
- record_contributor_attribution double-writes alongside existing
upsert_contributor calls. Zero risk to current dashboard.
- Author event: emitted once per PR from prs.submitted_by → git author →
agent-branch-prefix.
- Originator events: emitted per claim from frontmatter sourcer, skipping
when sourcer == author (avoids self-credit double-count).
- Evaluator events: Leo (always when leo_verdict='approve') + domain_agent
(when domain_verdict='approve' and not Leo).
- Challenger/Synthesizer: emitted from Pentagon-Agent trailer on
agent-owned branches (theseus/*, rio/*, etc.) based on commit_type.
Pipeline-owned branches (extract/*, reweave/*) get no trailer-based event —
infrastructure work isn't contribution credit.
## Helpers (lib/attribution.py)
- normalize_handle(raw, conn=None): lowercase + strip @ + alias lookup
- classify_kind(handle): returns 'agent' for PENTAGON_AGENTS, else 'person'
Intentionally narrow. Orgs get classified by operator review, not heuristics.
## Backfill (scripts/backfill-events.py)
Replays all merged PRs into events. Idempotent (safe to re-run). Emits:
- PR-level: author, evaluator, challenger, synthesizer
- Per-claim: originator (walks knowledge tree, matches via description titles)
Known limitation: post-merge PR branches are deleted from Forgejo, so we
can't diff them for granular per-claim events. Claim→PR mapping uses
prs.description (pipe-separated titles). Misses some edge cases but
recovers the bulk of historical originator credit. Forward traffic gets
clean per-claim events via the normal record_contributor_attribution path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three layers of contributor-attribution bug surfaced by Apr 24 leaderboard
investigation. alexastrum, thesensatore, cameron-s1 all had real merged
contributions but zero credit in the contributors table.
1. lib/attribution.py: parse_attribution() only read `attribution_sourcer:`
prefix-keyed flat fields. ~42% of claim files (535/1280) use the bare-key
form `sourcer: alexastrum` written by extract.py. Added bare-key handling
between the prefixed-flat path and the legacy-source-field fallback.
Block format (`attribution: { sourcer: [...] }`) still wins when present.
2. lib/contributor.py: record_contributor_attribution() parsed the diff text
with regex looking for `+- handle: "X"` lines. This matched neither the
bare-key flat format nor the `attribution: { sourcer: [...] }` block
format Leo uses for manual extractions. Replaced the regex parser with
a file walker that calls attribution.parse_attribution_from_file() on
each changed knowledge file — single source of truth for both formats.
3. scripts/backfill-sourcer-attribution.py: walks all merged knowledge files,
re-attributes via the canonical parser, upserts contributors. Default
additive mode preserves existing high counts (e.g. m3taversal.sourcer=1011
reflects Telegram-curator credit accumulated via a different code path
that this fix does not touch). --reset flag for the destructive case.
Dry-run preview (additive mode):
- 670 NEW contributors to insert (mostly source-citation handles)
- 77 EXISTING contributors with under-counted role columns
- alexastrum: 0 → 6, thesensatore: 0 → 5, cameron-s1: 0 → 2
- astra.sourcer: 0 → 96, leo.sourcer: 0 → 44, theseus.sourcer: 0 → 18
- m3taversal.sourcer: 1011 (preserved, not 22 from file walk)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backfill-sources.py runs every 15 minutes and derives sources.status
purely from directory location. If a source file is in inbox/queue/,
it blindly overwrites the DB status to 'unprocessed' — even when the
DB already had 'extracted' or 'null_result'.
This is why the 43 zombies kept coming back after manual backfill:
cron re-reset them every 15 minutes, then each 4h cooldown expiry
re-triggered runaway extraction on the same source.
Fix: never regress from a terminal status (extracted, null_result,
error, ghost_no_file) to 'unprocessed'. File location is ambiguous
(legitimately new vs. zombie from failed archive); DB is authoritative.
Legitimate re-extraction still works — it goes through the needs_reextraction
path which is unaffected by this gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- contribution_scores table stores per-PR CI with action type
- Profile endpoint returns action_ci alongside role-based ci_score
- Branch-name attribution: contrib/NAME/ PRs attributed to NAME
- Cameron now shows 0.32 CI + BELIEF MOVER badge from challenge
- Handle variant matching (cameron-s1 → cameron) for cross-system lookup
- Full historical backfill: 985 scores across 9 contributors
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
matplotlib chart with dual axes — cumulative claims (#00d4aa) and
contributors (#7c3aed) on dark background. 1200x630 for Twitter.
Auto-regenerates hourly via /api/contributor-graph endpoint.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Classifies merged PRs by action type, scores with importance multiplier
(confidence, domain maturity, connectivity bonus), updates contributor
records, posts summary to Telegram, serves via /api/digest/latest.
Cron: 7:07 UTC daily (8:07 AM London).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Crawls domains/foundations/core/decisions for [[wiki-links]], resolves
against claim files, entities, maps, and agents. Reports dead links,
orphans, and connectivity stats. Prerequisite for CI scoring connectivity
bonus — broken links would inflate scores.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds async git-log-based endpoint for cumulative contributor and claim
tracking. 5-minute cache, excludes bot accounts, tags founding contributors.
Standalone CLI script also included for ad-hoc data generation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Orphan ratio at 39.6% (443/1118 claims) vs <15% target. Root cause:
reweave threshold 0.70 too strict for text-embedding-3-small — 56% of
orphans found "no neighbors." At 0.55, dry-run shows 0% no-neighbor
skips. Batch size 200 clears backlog in ~3-4 nights at ~$0.20/run.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>