feat(ingestion): metadao.fi scraper to replace broken futard.io ingestion #6
Open
m3taversal
wants to merge 4 commits from
ship/metadao-scraper into main
pull from: ship/metadao-scraper
merge into: teleo:main
teleo:main
teleo:fix/forgejo-pr-url-fallback
teleo:fix/filter-pipeline-from-activity-feed
teleo:fix/reattribute-by-branch-prefix
teleo:fix/canonicalize-submitted-by
teleo:fix/activity-feed-canonical-handle
teleo:argus/claims-prefix-fallback-fix
teleo:epimetheus/clear-stale-ref-pr-5224
teleo:epimetheus/reaper-allowlist-research-sessions
teleo:epimetheus/substantive-fixer-actionable-filter
teleo:epimetheus/verdict-deadlock-reaper
teleo:epimetheus/substantive-fixer-visibility
teleo:epimetheus/sync-mirror-already-merged-gate
teleo:epimetheus/dedup-archived-queue
teleo:epimetheus/external-attribution-fix
teleo:epimetheus/fwazb-cleanup
teleo:epimetheus/external-merge-flow-bug1
teleo:epimetheus/sync-mirror-self-heal
teleo:epimetheus/external-merge-flow-design
teleo:ship/readme-public-rewrite
teleo:epimetheus/multi-repo-mirror
teleo:epimetheus/leaderboard-tests
teleo:epimetheus/writer-publisher-routing
teleo:epimetheus/source-classifier-fix
teleo:epimetheus/schema-v26-publishers
teleo:epimetheus/synthetic-recovery-prs
teleo:epimetheus/originator-backfill-recovery
teleo:epimetheus/phase-a-events-followup
teleo:epimetheus/phase-a-events
teleo:epimetheus/contributor-attribution-fix
teleo:epimetheus/auto-close-gate
teleo:epimetheus/reduce-rejections
teleo:epimetheus/consolidate-infra
teleo:epimetheus/wire-rrf-merge
teleo:ganymede/phase3-forgejo
teleo:ganymede/phase2-dev-infra
teleo:ganymede/phase1-critical-fixes
4 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
| 353c4a57b9 |
fix(deploy): add scripts/ to deploy.sh + auto-deploy.sh
Some checks failed
CI / lint-and-test (pull_request) Has been cancelled
Per Ganymede review of PR #6: scripts/ was in neither deploy script, so 25 root-level Python scripts (metadao-scrape.py, embed-claims.py, tier0-gate.py, etc.) lived in repo but never reached VPS. Changes (identical pattern in both files): - Add scripts/*.py to pre-deploy syntax check glob - Add scripts/ rsync to $PIPELINE_DIR/scripts/ Restart trigger NOT updated — scripts/ are cron-invoked (not daemon-imported), same pattern as fetch_coins.py. All 25 scripts/*.py pre-flight syntax check passed locally. |
|||
| dde055fdbf |
fix(metadao-scrape): STAT_BLEED word boundaries + min-render gate
Some checks failed
CI / lint-and-test (pull_request) Has been cancelled
Ganymede review on PR #6 (commit
|
|||
| 800d1d8b8e |
fix(metadao-scrape): YAML escape + URL regex + dry_run consistency
Some checks failed
CI / lint-and-test (pull_request) Has been cancelled
Ganymede review on PR #6: - WARNING: title and project["name"] flowed unescaped into YAML, would corrupt frontmatter on quote-bearing inputs (e.g. 'Adopt "Conservative" Pricing'). New _yaml_str helper routes free-text values through json.dumps (JSON strings are valid YAML strings). Applied to title, author, url, project_slug, proposal_address, proposal_status, squads_proposal, squads_status. - NIT: URL_ADDR_RE didn't match new metadao.fi URLs — pattern segment couldn't span /projects/{slug}/proposal/. Added (?:/[^/...]*)*? for variable path depth. Verified against three URL shapes. - NIT: dry_run key was omitted from JSON output on early --limit exit but present on normal exit. Trivial consistency fix. - NIT (deferred): STAT_BLEED_RE protection is accidental rather than designed; only matters if MetaDAO breaks DP-NNNNN naming convention. Per Ganymede 'optional — current behavior fine.' Verified: URL regex matches futard.io legacy + metadao.fi new + hypothetical no-slug shapes. YAML escape survives embedded quotes, newlines, backslashes, em-dashes. |
|||
| b8fba8195f |
feat(ingestion): metadao.fi scraper to replace broken futard.io ingestion
Some checks failed
CI / lint-and-test (pull_request) Has been cancelled
Background: - futard.io retired its /api/graphql endpoint between Apr 17–20 - Cloud Scheduler ingest-futard has been firing into 500s ever since (the AttributeError on e.url masked the real 404 for 5 days; fixed in living-ip/teleo-api@b8eb441 which surfaced the actual root cause) - The ecosystem migrated to metadao.fi, which is Vercel-protected - Direct curl is blocked by Vercel's anti-bot challenge regardless of headers; a real headless browser passes it cleanly Approach: - Playwright-driven scraper, runs as a one-shot - Discovery: scrape /projects DOM for project slugs, then each /projects/{slug} for proposal addresses - For each NEW proposal: visit page for prose body + call /api/decode-proposal/{addr} via in-browser fetch (bypasses challenge via the primed Vercel cookies in the browser context) for structured on-chain instructions - Idempotent: dedup against existing proposal addresses in archive frontmatter AND filename basenames - Filename embeds 8-char address fragment for stable cross-run dedup even on projects that don't use DP-NNNNN naming convention Tested locally against 6 active projects (p2p-protocol, paystream, zklsol, loyal, ranger, solomon). Captured 13 new proposals — including the Solomon Gigabus DP-00003 that triggered this work — with proper titles, status, on-chain instruction decoding (Squads transactions, SPL transfers, memos), and project metadata. Output schema matches existing futardio source files (type: source, event_type: proposal, domain: internet-finance, status: unprocessed) so the existing extract pipeline picks them up unchanged. Architectural note: this script is intentionally NOT wired to systemd yet — VPS deploy needs Playwright + Chromium system libs which require apt sudo (currently scoped to teleo-* services only). Reviewing the script first; deploy path is a separate decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |