fix(attribution): canonicalize submitted_by at write time + historical normalizer #10

Merged
fwazb merged 1 commit from fix/canonicalize-submitted-by into main 2026-05-13 03:19:28 +00:00
Owner

Problem

prs.submitted_by (and sources.submitted_by) were being written with decorated strings like Vida (self-directed), @m3taversal, and pipeline (reweave). The activity-feed API surfaces this field as contributor, and the frontend routes /contributors/{contributor} against it — decorated strings 404 because no such contributor row exists.

The sibling PR (fix/activity-feed-canonical-handle) normalizes at read time. This PR stops the bad data from ever entering the DB in the first place, so the read-side fix becomes defense-in-depth rather than load-bearing.

Changes

Write sites — all three now write canonical handles (lowercase, no @, no trailing parenthetical):

  • lib/extract.py:690 — extraction-stage source attribution
  • diagnostics/backfill_submitted_by.py — legacy backfill (still referenced in ops)
  • scripts/backfill-research-session-attribution.py — research-session re-attribution

Canonical form derived from lib/attribution._HANDLE_RE (^[a-z0-9][a-z0-9_-]{0,38}$).

One-time historical fixscripts/normalize-submitted-by.py:

  • Lowercases + strips trailing parenthetical from prs.submitted_by and sources.submitted_by
  • Defaults to --dry-run; --apply to commit
  • Dry-run against live DB: 3008 rows in prs, 730 rows in sources → 3738 total updates, 0 invalid handles produced
  • Idempotent: re-running is a no-op

Verification

  • All four files py_compile clean
  • Dry-run mapping verified row-by-row (counts shown per source variant)
  • No schema changes — pure string normalization on existing TEXT columns

Run order after merge

  1. Auto-deploy lands the code changes
  2. SSH to VPS, run once: python3 /opt/teleo-eval/pipeline/scripts/normalize-submitted-by.py --apply
  3. (Optional) verify: sqlite3 pipeline.db "SELECT DISTINCT submitted_by FROM prs WHERE submitted_by LIKE '%(%'" should return 0 rows

Stack context

Part of a 3-PR set fixing timeline-page 404s end-to-end. See sibling Forgejo PR (fix/activity-feed-canonical-handle) and GitHub PR living-ip/livingip-web#33 for the read-side and frontend contract pieces respectively.

## Problem `prs.submitted_by` (and `sources.submitted_by`) were being written with decorated strings like `Vida (self-directed)`, `@m3taversal`, and `pipeline (reweave)`. The activity-feed API surfaces this field as `contributor`, and the frontend routes `/contributors/{contributor}` against it — decorated strings 404 because no such contributor row exists. The sibling PR (`fix/activity-feed-canonical-handle`) normalizes at read time. This PR stops the bad data from ever entering the DB in the first place, so the read-side fix becomes defense-in-depth rather than load-bearing. ## Changes **Write sites** — all three now write canonical handles (lowercase, no `@`, no trailing parenthetical): - `lib/extract.py:690` — extraction-stage source attribution - `diagnostics/backfill_submitted_by.py` — legacy backfill (still referenced in ops) - `scripts/backfill-research-session-attribution.py` — research-session re-attribution Canonical form derived from `lib/attribution._HANDLE_RE` (`^[a-z0-9][a-z0-9_-]{0,38}$`). **One-time historical fix** — `scripts/normalize-submitted-by.py`: - Lowercases + strips trailing parenthetical from `prs.submitted_by` and `sources.submitted_by` - Defaults to `--dry-run`; `--apply` to commit - Dry-run against live DB: 3008 rows in `prs`, 730 rows in `sources` → 3738 total updates, **0 invalid handles produced** - Idempotent: re-running is a no-op ## Verification - All four files `py_compile` clean - Dry-run mapping verified row-by-row (counts shown per source variant) - No schema changes — pure string normalization on existing TEXT columns ## Run order after merge 1. Auto-deploy lands the code changes 2. SSH to VPS, run once: `python3 /opt/teleo-eval/pipeline/scripts/normalize-submitted-by.py --apply` 3. (Optional) verify: `sqlite3 pipeline.db "SELECT DISTINCT submitted_by FROM prs WHERE submitted_by LIKE '%(%'"` should return 0 rows ## Stack context Part of a 3-PR set fixing timeline-page 404s end-to-end. See sibling Forgejo PR (`fix/activity-feed-canonical-handle`) and GitHub PR `living-ip/livingip-web#33` for the read-side and frontend contract pieces respectively.
m3taversal added 1 commit 2026-05-13 03:08:48 +00:00
fix(attribution): canonicalize submitted_by at write time + historical normalizer
Some checks are pending
CI / lint-and-test (pull_request) Waiting to run
74bf0461e8
Companion / write-side fix to fix/activity-feed-canonical-handle.

The activity-feed canonicalization was a read-side guard. The bug at the
source is that extract.py and two backfill scripts write decorated
strings (Vida (self-directed), pipeline (reweave), @m3taversal) into
prs.submitted_by and sources.submitted_by. Downstream readers
(lib.contributor.insert_contribution_event, scripts/scoring_digest,
diagnostics/activity_feed_api) all strip the decorator on read — but
anything that reads the column verbatim (like /api/activity-feed before
the read-side fix) 404s on /contributors/{decorated-handle}.

Stop writing the decorator. The self-directed signal is already carried
by intake_tier == research-task plus the prs.agent column; the suffix
is redundant string noise that costs us correctness at every consumer
that forgets to strip.

Changes:

- lib/extract.py:690 — write canonical handle via attribution.normalize_handle.
  Direct elif for intake_tier == research-task now stores just agent_name.
  @m3taversal -> m3taversal.

- diagnostics/backfill_submitted_by.py — same fix in two branches plus
  the reweave branch (pipeline (reweave) -> pipeline).

- scripts/backfill-research-session-attribution.py — UPDATE prs sets
  agent handle alone, no suffix. Docstring + log line updated.

- scripts/normalize-submitted-by.py (new) — one-time backfill that
  canonicalizes existing prs.submitted_by and sources.submitted_by rows.
  Strips trailing parenthetical decorators, lowercases, drops @. Defaults
  to dry-run; --apply to commit. Skips rows that would normalize to
  invalid handles (no garbage falls through silently).

Dry-run against live pipeline.db:
  prs:     3008 rows need normalization (clean mappings, 0 invalid)
  sources: 730 rows need normalization (clean mappings, 0 invalid)
  Total:   3738 rows. All map to existing handle column values.

After this lands + auto-deploys, the operator should run
  python3 scripts/normalize-submitted-by.py --apply
once to clean historical rows. The read-side canonicalization in
diagnostics/activity_feed_api.py (fix/activity-feed-canonical-handle)
becomes redundant defense-in-depth instead of load-bearing.

No KB writes.
fwazb merged commit b29ec95dd8 into main 2026-05-13 03:19:28 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: teleo/teleo-infrastructure#10
No description provided.