Commit graph

8 commits

Author SHA1 Message Date
540ba97b9d fix(attribution): Phase A followup — bug #1 + 4 nits + refactor (Ganymede review)
Some checks are pending
CI / lint-and-test (push) Waiting to run
Addresses Apr 24 review of 58fa8c52. All 6 findings landed.

Bug #1 — git log -1 returns latest commit, not first (semantic mismatch
with "original author" comment):
  Drop -1 flag, take last line of default-ordered log output (= oldest).
  Fixes mis-credit on multi-commit PRs where a reviewer rebased/force-pushed.

Nit #2 — forward writer didn't pass merged_at:
  Fetch merged_at in the prs SELECT, thread pr_merged_at through all 5
  insert_contribution_event call sites. Keeps forward-emitted and backfilled
  event timestamps on the same timeline after merge retries.

Nit #3 — legacy-counts fallback paths emit no events (parity gap):
  git-author and prs.agent fallback paths now emit challenger/synthesizer
  events via the TRAILER_EVENT_ROLE map when refined_type matches. Closes
  the gap where external-contributor challenge/enrich PRs would accumulate
  legacy counts but disappear from event-sourced leaderboards.

Nit #4 — migration v24 agent seed missing 'pipeline':
  Added "pipeline" to the seed list. Plus new migration v25 with idempotent
  corrective UPDATE so existing envs (where v24 already ran) pick up the
  fix on restart without requiring manual SQL. Verified on VPS state:
  pipeline row was kind='person', will flip to 'agent' on redeploy.

Nit #5 — backfill summary prints originator attempted=0 in wrong pass:
  Split the "=== Summary ===" header into "=== PR-level events ===" and
  "=== Claim-level originator pass ===" with originator counts in the
  right block. Operator-facing cosmetic.

Refactor #6 — AGENT_BRANCH_PREFIXES duplicated in 2 sites:
  Extracted to lib/attribution.py as single source of truth. contributor.py
  imports it. backfill-events.py keeps its local copy (runs standalone
  without pipeline package import) with a sync-reference comment.

No behavioral drift for the common case. Backfill re-runs cleanly against
existing forward-written events (UNIQUE-index idempotency).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 14:13:54 +01:00
58fa8c5276 feat(attribution): Phase A — event-sourced contribution ledger (schema v24)
Some checks are pending
CI / lint-and-test (push) Waiting to run
Introduces contribution_events table + non-breaking double-write. Schema
lands today, forward traffic writes events alongside existing count upserts,
backfill script replays history. Phase B will add leaderboard API reading
from events; Phase C switches Argus dashboard over.

## Schema v24 (lib/db.py)

- contribution_events: one row per credit-earning event
  (id, handle, kind, role, weight, pr_number, claim_path, domain, channel, timestamp)
  Partial UNIQUE indexes handle SQLite's NULL != NULL semantics:
    idx_ce_unique_claim on (handle, role, pr_number, claim_path) WHERE claim_path NOT NULL
    idx_ce_unique_pr    on (handle, role, pr_number)             WHERE claim_path IS NULL
  PR-level events (evaluator, author, challenger, synthesizer) dedup on 3-tuple.
  Per-claim events (originator) dedup on 4-tuple. Idempotent on replay.
- contributor_aliases: canonical handle mapping
  Seeded: @thesensatore → thesensatore, cameron → cameron-s1
- contributors.kind TEXT DEFAULT 'person'
  Migration seeds 'agent' for known Pentagon agent handles.

## Role model (confirmed by Cory Apr 24)

Weights: author 0.30, challenger 0.25, synthesizer 0.20, originator 0.15, evaluator 0.05
- author:     human who submitted the PR (curation + submission work)
- originator: person who authored the underlying content (rewards external creators)
- challenger: agent/person who brought a productive disagreement
- synthesizer: cross-domain work (enrichments, research sessions)
- evaluator:  reviewer who approved (Leo + domain agent)

Humans-are-always-author: agents credit is capped at evaluator/synthesizer/
challenger. Pentagon agents classify as kind='agent' and surface in the
agent-view leaderboard, not the default person view.

## Writer (lib/contributor.py)

- New insert_contribution_event(): idempotent INSERT OR IGNORE with alias
  normalization + kind classification. Falls back silently on pre-v24 DBs.
- record_contributor_attribution double-writes alongside existing
  upsert_contributor calls. Zero risk to current dashboard.
- Author event: emitted once per PR from prs.submitted_by → git author →
  agent-branch-prefix.
- Originator events: emitted per claim from frontmatter sourcer, skipping
  when sourcer == author (avoids self-credit double-count).
- Evaluator events: Leo (always when leo_verdict='approve') + domain_agent
  (when domain_verdict='approve' and not Leo).
- Challenger/Synthesizer: emitted from Pentagon-Agent trailer on
  agent-owned branches (theseus/*, rio/*, etc.) based on commit_type.
  Pipeline-owned branches (extract/*, reweave/*) get no trailer-based event —
  infrastructure work isn't contribution credit.

## Helpers (lib/attribution.py)

- normalize_handle(raw, conn=None): lowercase + strip @ + alias lookup
- classify_kind(handle): returns 'agent' for PENTAGON_AGENTS, else 'person'
  Intentionally narrow. Orgs get classified by operator review, not heuristics.

## Backfill (scripts/backfill-events.py)

Replays all merged PRs into events. Idempotent (safe to re-run). Emits:
- PR-level: author, evaluator, challenger, synthesizer
- Per-claim: originator (walks knowledge tree, matches via description titles)

Known limitation: post-merge PR branches are deleted from Forgejo, so we
can't diff them for granular per-claim events. Claim→PR mapping uses
prs.description (pipe-separated titles). Misses some edge cases but
recovers the bulk of historical originator credit. Forward traffic gets
clean per-claim events via the normal record_contributor_attribution path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 13:59:22 +01:00
93917f9fc2 fix(attribution): --diff-filter=A + handle sanity filter + remove legacy fallback
Some checks are pending
CI / lint-and-test (push) Waiting to run
Ganymede review findings on epimetheus/contributor-attribution-fix branch:

1. BUG: record_contributor_attribution used `git diff --name-only` (all modified
   files), not just added. Enrich/challenge PRs re-credited the sourcer on every
   subsequent modification. Fixed: --diff-filter=A restricts to new files only.
   The synthesizer/challenger/reviewer roles for enrich PRs are still credited
   via the Pentagon-Agent trailer path, so this doesn't lose any correct credit.

2. WARNING: Legacy `source`-field heuristic fabricated garbage handles from
   descriptive strings ("sec-interpretive-release-s7-2026-09-(march-17",
   "governance---meritocratic-voting-+-futarchy"). Removed outright + added
   regex handle sanity filter (`^[a-z0-9][a-z0-9_-]{0,38}$`). Applied before
   every return path in parse_attribution (the nested-block early return was
   previously bypassing the filter).

   Dry-run impact: unique handles 83→70 (13 garbage filtered), NEW contributors
   49→48, EXISTING drift rows 34→22. The filter drops rows where the literal
   garbage string lives in frontmatter (Slotkin case: attribution.sourcer.handle
   was written as "senator-elissa-slotkin-/-the-hill" by the buggy legacy path).

3. NIT: Aligned knowledge_prefixes in the file walker to match is_knowledge_pr
   (removed entities/, convictions/). Widening those requires Cory sign-off
   since is_knowledge_pr currently gates entity-only PRs out of CI.

Tests: 17 pass (added test_bad_handles_filtered, test_valid_handle_with_hyphen_passes,
updated test_legacy_source_fallback → test_legacy_source_fallback_removed).

Ganymede review — 3-message protocol msg 3 pending.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 12:58:55 +01:00
3fe0f4b744 fix(attribution): credit sourcer/extractor from claim frontmatter
Three layers of contributor-attribution bug surfaced by Apr 24 leaderboard
investigation. alexastrum, thesensatore, cameron-s1 all had real merged
contributions but zero credit in the contributors table.

1. lib/attribution.py: parse_attribution() only read `attribution_sourcer:`
   prefix-keyed flat fields. ~42% of claim files (535/1280) use the bare-key
   form `sourcer: alexastrum` written by extract.py. Added bare-key handling
   between the prefixed-flat path and the legacy-source-field fallback.
   Block format (`attribution: { sourcer: [...] }`) still wins when present.

2. lib/contributor.py: record_contributor_attribution() parsed the diff text
   with regex looking for `+- handle: "X"` lines. This matched neither the
   bare-key flat format nor the `attribution: { sourcer: [...] }` block
   format Leo uses for manual extractions. Replaced the regex parser with
   a file walker that calls attribution.parse_attribution_from_file() on
   each changed knowledge file — single source of truth for both formats.

3. scripts/backfill-sourcer-attribution.py: walks all merged knowledge files,
   re-attributes via the canonical parser, upserts contributors. Default
   additive mode preserves existing high counts (e.g. m3taversal.sourcer=1011
   reflects Telegram-curator credit accumulated via a different code path
   that this fix does not touch). --reset flag for the destructive case.

Dry-run preview (additive mode):
  - 670 NEW contributors to insert (mostly source-citation handles)
  - 77 EXISTING contributors with under-counted role columns
  - alexastrum: 0 → 6, thesensatore: 0 → 5, cameron-s1: 0 → 2
  - astra.sourcer: 0 → 96, leo.sourcer: 0 → 44, theseus.sourcer: 0 → 18
  - m3taversal.sourcer: 1011 (preserved, not 22 from file walk)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 12:48:41 +01:00
c29049924e fix: wire commit_type into contributor role assignment
The contributor attribution always recorded "extractor" regardless of
the PR's refined commit_type. Added COMMIT_TYPE_TO_ROLE mapping and
applied it in all three attribution paths (Pentagon-Agent trailer,
git author fallback, PR agent fallback).

Backfill script resets and re-derives role counts from prs.commit_type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 10:27:36 +01:00
0f868aefab Add GitHub PR feedback module and fix attribution for mirrored PRs
Some checks failed
CI / lint-and-test (push) Has been cancelled
github_feedback.py posts pipeline status to GitHub PRs at three touchpoints:
discovery ack, eval review result, and merge/close outcome. Only fires for
PRs with a github_pr link (set by sync-mirror.sh). All calls non-fatal.

contributor.py: expanded git author fallback to scan all non-merge commits
(was only checking last commit), added teleo-bot and github-actions[bot]
to bot filter list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 18:16:28 +01:00
13f21f7732 feat: external contributor pipeline — fork PR handling, attribution, prefix recognition
- Mirror: fetch GitHub fork PR refs (refs/pull/*/head), push to Forgejo as gh-pr-N/branch
- Mirror: fork PRs auto-create Forgejo PR with GitHub PR title, link github_pr in DB
- db.py: add contrib + gh-pr-* to classify_branch for external contributor branches
- contributor.py: git commit author as attribution fallback (before branch agent)
- contributor.py: skip bot/generic authors (m3taversal, teleo, pipeline)
- Tests: fix fallback test for new git author path, add external contributor test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 18:14:01 +01:00
53dc18afd5 Phase 5: Extract contributor.py from merge.py (−234 lines)
Some checks are pending
CI / lint-and-test (push) Waiting to run
5 functions extracted: is_knowledge_pr, refine_commit_type,
record_contributor_attribution, upsert_contributor, recalculate_tier.

git_fn parameter injection avoids circular import (merge→contributor,
contributor needs _git from merge). Single call site passes _git.

merge.py: 1912 → 1678 lines. 23 new tests, zero regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:08:26 +01:00