Reweave PRs modify existing files (appending YAML edges). Cherry-pick
fails ~75% when main moves between PR creation and merge.
_merge_reweave_pr() reads each changed file from both main HEAD and
branch HEAD, unions the edge arrays (order-preserving, main-first),
and writes the result. Eliminates merge conflicts structurally.
Key design decisions (Ganymede + Theseus approved):
- Order-preserving dedup: main's edges first, branch-new appended
- Superset assertion: logs warning if branch missing main edges
- Uses main's body text (reweave only touches frontmatter)
- Loud failure on parse errors (no cherry-pick fallback)
- Append-only contract: reweave adds edges, never removes
18 tests covering parse, union, serialize, superset, and full workflow.
Also fixes _is_entity path check to use Path.parts instead of string
containment, preventing false positives on paths like "domains/entities-overview/".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: _group_into_windows never checked time gaps or chat_id.
All messages went into one stream, capped at 10 per window. 120 msgs
from one chat → 12 windows → 12 source files → 12 extraction branches.
Fix:
- Group by chat_id first (different chats = different windows always)
- Split on actual time gaps (>window_seconds between messages)
- Cap at 50 messages per window (not 10)
- Consolidate substantive windows from same chat into one source file
at triage time (one source per chat per triage cycle)
6 tests in tests/test_tg_batching.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Layer 1: Insertion-time dedup in openrouter-extract-v2.py — skip if source_slug
already appears in claim content.
Layer 2: Insertion-time dedup in entity_batch.py — skip if PR number already
enriched this claim.
Layer 3: Post-rebase dedup in merge.py — scan rebased files for duplicate
evidence blocks (same source reference) and remove them before force-push.
Root cause: multiple enrichment branches modify the same claim at the same
insertion point. When rebased sequentially, evidence blocks are duplicated.
(Leo: PRs #1751, #1752)
lib/dedup.py: standalone module — parses evidence headers, deduplicates by
source key, preserves trailing content (Relevant Notes, Topics sections).
9 tests covering all patterns including the real PR #1751 duplication case.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Atomic extract-and-connect (lib/connect.py):
- After extraction writes claim files, each new claim is embedded via
OpenRouter, searched against Qdrant, and top-5 neighbors (cosine > 0.55)
are added as `related` edges in the claim's frontmatter
- Edges written on NEW claim only — avoids merge conflicts
- Cross-domain connections enabled, non-fatal on Qdrant failure
- Wired into openrouter-extract-v2.py post-extraction step
Stale PR monitor (lib/stale_pr.py):
- Every watchdog cycle checks open extract/* PRs
- If open >30 min AND 0 claim files → auto-close with comment
- After 2 stale closures → marks source as extraction_failed
- Wired into watchdog.py as check #6
Response audit system:
- response_audit table (migration v8), persistent audit conn in bot.py
- 90-day retention cleanup, tool_calls JSON column
- Confidence tag stripping, systemd ReadWritePaths for pipeline.db
Supporting infrastructure:
- reweave.py: nightly edge reconnection for orphan claims
- reconcile-sources.py: source status reconciliation
- backfill-domains.py: domain classification backfill
- ops/reconcile-source-status.sh: operational reconciliation script
- Attribution improvements, post-extract enrichments, merge improvements
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>