Resolves the format inconsistency between the forward fix and the 304-row
backfill. Both halves now produce prs.submitted_by = "rio (self-directed)":
- research-session.sh: drop proposed_by from the frontmatter template.
extract.py path 1 (proposed_by-driven) no longer fires; path 2 fires
instead and constructs f"{agent} (self-directed)" — matches backfill.
- attribution.py: normalize_handle now strips "(self-directed)" suffix
immediately after lowercase+@-strip, before alias lookup. Closes the
phantom-person-event class on any future replay through
record_contributor_attribution. Round-trips through alias rules keyed
on bare agent names.
Test (5 cases) still passes; suffix-strip behavior verified against
hostile inputs (whitespace, casing, mid-string occurrences must NOT
match — only trailing pattern).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five tests against the real contribution_events schema (lib/db.py:181-209):
- pr-level dedup with NULL claim_path via idx_ce_unique_pr partial index
- per-claim dedup with non-NULL claim_path via idx_ce_unique_claim partial index
- pr-level and per-claim events coexist on the same pr_number
- backfill (INSERT correct + DELETE wrong) is a true no-op on replay
- replay against already-backfilled state preserves unrelated events
Schema case identified: case 2 with partial-index split solution already in
place. Two partial UNIQUE indexes target disjoint row sets (claim_path IS NULL
vs IS NOT NULL), bypassing SQLite's NULL-not-equal-NULL UNIQUE quirk.
Production replay verified: re-running backfill --apply against the live DB
returns "misattributed PRs found: 0" because the first-run UPDATE flipped the
WHERE predicate. Total contribution_events count: 3839 → 3839.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-part fix for a bug where every claim extracted from agent overnight
research sessions was being credited to m3taversal in contribution_events
(visible in the activity feed as "@m3taversal" on agent-derived claims).
Forward fix (research/research-session.sh):
The frontmatter template the agent prompt instructs Claude to use now
includes `proposed_by: ${AGENT}` and `intake_tier: research-task`. With
those fields present, extract.py path 1 (line 687) takes precedence and
sets prs.submitted_by to the agent handle, which then propagates into
contribution_events as a kind='agent' author event for the agent.
Without the fields, extract.py fell through to the default branch on
line 695 and set submitted_by='@m3taversal'.
Backfill (scripts/backfill-research-session-attribution.py):
Identifies research-session-derived PRs by finding teleo-codex commits
matching `^<agent>: research session YYYY-MM-DD —`, listing the
inbox/queue/*.md files added in each commit's diff, and matching those
filename basenames against prs.source_path. Only PRs currently
submitted_by='@m3taversal' AND merged within the configurable window
are touched. Default --dry-run; --apply to commit.
For each match the script:
1. UPDATE prs SET submitted_by = '<agent> (self-directed)'
2. INSERT OR IGNORE the agent author event (kind='agent', weight=0.30)
with the original PR's domain, channel, merged_at preserved
3. DELETE the misattributed m3taversal author event
Applied 30-day backfill on VPS:
- 304 PRs re-attributed (rio 74, clay 70, astra 53, vida 48,
theseus 30, leo 29)
- 297 m3taversal author events deleted, 304 agent author events
inserted (delta of 7 = pre-v24 PRs that never had m3ta events
in the first place; we still create the new agent event)
- m3taversal author count: 1368 → 1071 (−22%)
- Pre-backfill DB snapshot: pipeline.db.bak-pre-research-attribution
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-up fixes from Ganymede's review of d0fb4c9:
1. is_publisher_handle: narrow `except Exception` to sqlite3.OperationalError.
Pre-v26 DB fallback only needs to catch the "table doesn't exist" case;
broader exceptions (programming errors, locks, corruption) should propagate.
2. upsert_contributor gate: add comment documenting the alias-resolution
asymmetry between insert_contribution_event (alias-resolved via
normalize_handle) and upsert_contributor (bare lower+lstrip-@). Today this
is fine because the v26 classifier produced one publisher row per canonical
handle. Branch 3 will normalize alias→canonical at writer entry points,
tightening this gate transparently.
Unit tests for the gates (positive + negative + alias resolution) deferred to
Branch 3 alongside the auto-create flow tests.
Smoke-tested:
- pre-v26 fallback (no publishers table) → None (correct)
- case-insensitive match (CNBC → id=1) → correct
- @ prefix strip (@cnbc → id=1) → correct
- non-publisher handle (alexastrum) → None (correct)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Schema v26 (commit 3fe524d) split orgs/citations from contributors into
the publishers table. Without a writer-side gate, every merged PR with
`sourcer: cnbc` (or similar) re-creates CNBC as a contributor and
undoes the v26 classifier cleanup. Once normal pipeline traffic resumes,
the contributors table re-pollutes within hours.
Fix: belt-and-suspenders gate at both writer surfaces.
1. `lib/attribution.py::is_publisher_handle(handle, conn)` — returns
publisher.id if handle exists in publishers.name, else None. Falls
back gracefully on pre-v26 DBs (no publishers table → returns None →
writer behaves like before, no regression).
2. `lib/contributor.py::insert_contribution_event` — checks
is_publisher_handle on canonical handle before INSERT. If it's a
publisher, debug-log + return False. Prevents originator events for
CNBC/SpaceNews/etc.
3. `lib/contributor.py::upsert_contributor` — same gate at top. Prevents
the contributors table from re-acquiring publisher rows.
Verified end-to-end against live VPS DB snapshot:
- CNBC originator event: blocked (insert returns False)
- CNBC contributors row: blocked (no row created)
- alexastrum, thesensatore, newhandle_xyz: pass through unchanged
- is_publisher_handle handles case-insensitive lookup correctly
(CNBC and cnbc both match publisher_id=3)
Pre-deploy event count was 3705. Post-classifier cleanup: 3623 (82 org
events purged). Going forward, no new org events accumulate.
Branch 2 of the schema-v26 rollout. Branch 3 (auto-create at tier='cited',
extract.py sources.publisher_id wiring) is separate scope and not required
for regression prevention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>