feat(ingestion): metadao.fi scraper to replace broken futard.io ingestion #6

Open
m3taversal wants to merge 4 commits from ship/metadao-scraper into main
Owner

Why

futard.io retired /api/graphql between Apr 17–20. Cloud Scheduler ingest-futard has been hammering 500s for 5 days. Root cause was masked because the exception handler had e.url (an AttributeError on aiohttp.ClientResponseError) — diagnostic fix landed in living-ip/teleo-api@b8eb441 which surfaced 404 Not Found — https://www.futard.io/api/graphql.

The ecosystem moved to metadao.fi. Direct curl is blocked by Vercel anti-bot. Headless Chromium passes the challenge cleanly.

Approach

  • Playwright Chromium, runs as a one-shot
  • Discovery: scrape /projects DOM for slugs → each /projects/{slug} for proposal addresses
  • For each NEW proposal: page DOM gives prose body, in-browser fetch(/api/decode-proposal/{addr}) returns structured on-chain instructions (Squads tx, SPL transfers, memos)
  • Idempotent across runs via three-stage dedup: known address in existing frontmatter (URL or proposal_address: field), address fragment in basename, final filename collision
  • Output matches existing futardio source schema — extract pipeline picks up unchanged

Probe-tested

All 6 active projects: p2p-protocol, paystream, zklsol, loyal, ranger, solomon. 13 proposals captured (incl. Solomon Gigabus DP-00003). Re-run: written: 0, skipped_existing: 13 — dedup verified.

Asks for review

  1. Title extraction (extract_dp_title): three iterations — strict-pattern preference for DP-NNNNN (CAT): Title, stat-bleed stripper for flex-collapsed layouts. Sanity-check there isnt a fourth edge case I missed.
  2. Card scoping (get_project_metadata): walks up only while no sibling proposal link is included. Earlier version walked too high, same card_text for every proposal.
  3. Dedup (existing_proposal_addresses): regex on proposal_address: field + URL pattern, reads first 4KB. Worth a sanity check for false positives in tags etc.
  4. Frontmatter shape (build_source_markdown): pattern-matched against existing 2024-02-05-futardio-proposal-*.md but did not trace through lib/extract.py end-to-end.
  5. Failure modes: try/except continue around per-proposal fetch — wants a second pair of eyes on Playwright timeout handling.

Out of scope

  • VPS deploy (separate: Playwright + chromium apt deps need sudo, currently scoped to teleo-* — Cory call)
  • Decommissioning Cloud Scheduler
  • Backfilling captured proposals (will commit to inbox/queue/ once scraper greenlit)
## Why futard.io retired `/api/graphql` between Apr 17–20. Cloud Scheduler `ingest-futard` has been hammering 500s for 5 days. Root cause was masked because the exception handler had `e.url` (an AttributeError on `aiohttp.ClientResponseError`) — diagnostic fix landed in `living-ip/teleo-api@b8eb441` which surfaced `404 Not Found — https://www.futard.io/api/graphql`. The ecosystem moved to metadao.fi. Direct curl is blocked by Vercel anti-bot. Headless Chromium passes the challenge cleanly. ## Approach - Playwright Chromium, runs as a one-shot - Discovery: scrape `/projects` DOM for slugs → each `/projects/{slug}` for proposal addresses - For each NEW proposal: page DOM gives prose body, in-browser `fetch(/api/decode-proposal/{addr})` returns structured on-chain instructions (Squads tx, SPL transfers, memos) - Idempotent across runs via three-stage dedup: known address in existing frontmatter (URL or `proposal_address:` field), address fragment in basename, final filename collision - Output matches existing futardio source schema — extract pipeline picks up unchanged ## Probe-tested All 6 active projects: p2p-protocol, paystream, zklsol, loyal, ranger, solomon. 13 proposals captured (incl. Solomon Gigabus DP-00003). Re-run: `written: 0, skipped_existing: 13` — dedup verified. ## Asks for review 1. **Title extraction** (`extract_dp_title`): three iterations — strict-pattern preference for `DP-NNNNN (CAT): Title`, stat-bleed stripper for flex-collapsed layouts. Sanity-check there isnt a fourth edge case I missed. 2. **Card scoping** (`get_project_metadata`): walks up only while no sibling proposal link is included. Earlier version walked too high, same card_text for every proposal. 3. **Dedup** (`existing_proposal_addresses`): regex on `proposal_address:` field + URL pattern, reads first 4KB. Worth a sanity check for false positives in tags etc. 4. **Frontmatter shape** (`build_source_markdown`): pattern-matched against existing `2024-02-05-futardio-proposal-*.md` but did not trace through `lib/extract.py` end-to-end. 5. **Failure modes**: try/except continue around per-proposal fetch — wants a second pair of eyes on Playwright timeout handling. ## Out of scope - VPS deploy (separate: Playwright + chromium apt deps need sudo, currently scoped to teleo-* — Cory call) - Decommissioning Cloud Scheduler - Backfilling captured proposals (will commit to inbox/queue/ once scraper greenlit)
m3taversal added 1 commit 2026-04-25 12:10:33 +00:00
feat(ingestion): metadao.fi scraper to replace broken futard.io ingestion
Some checks are pending
CI / lint-and-test (pull_request) Waiting to run
b8fba8195f
Background:
- futard.io retired its /api/graphql endpoint between Apr 17–20
- Cloud Scheduler ingest-futard has been firing into 500s ever since
  (the AttributeError on e.url masked the real 404 for 5 days; fixed
   in living-ip/teleo-api@b8eb441 which surfaced the actual root cause)
- The ecosystem migrated to metadao.fi, which is Vercel-protected
- Direct curl is blocked by Vercel's anti-bot challenge regardless of
  headers; a real headless browser passes it cleanly

Approach:
- Playwright-driven scraper, runs as a one-shot
- Discovery: scrape /projects DOM for project slugs, then each
  /projects/{slug} for proposal addresses
- For each NEW proposal: visit page for prose body + call
  /api/decode-proposal/{addr} via in-browser fetch (bypasses challenge
  via the primed Vercel cookies in the browser context) for structured
  on-chain instructions
- Idempotent: dedup against existing proposal addresses in archive
  frontmatter AND filename basenames
- Filename embeds 8-char address fragment for stable cross-run dedup
  even on projects that don't use DP-NNNNN naming convention

Tested locally against 6 active projects (p2p-protocol, paystream,
zklsol, loyal, ranger, solomon). Captured 13 new proposals — including
the Solomon Gigabus DP-00003 that triggered this work — with proper
titles, status, on-chain instruction decoding (Squads transactions,
SPL transfers, memos), and project metadata.

Output schema matches existing futardio source files (type: source,
event_type: proposal, domain: internet-finance, status: unprocessed)
so the existing extract pipeline picks them up unchanged.

Architectural note: this script is intentionally NOT wired to systemd
yet — VPS deploy needs Playwright + Chromium system libs which require
apt sudo (currently scoped to teleo-* services only). Reviewing the
script first; deploy path is a separate decision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rio added 1 commit 2026-04-25 12:19:08 +00:00
fix(metadao-scrape): YAML escape + URL regex + dry_run consistency
Some checks are pending
CI / lint-and-test (pull_request) Waiting to run
800d1d8b8e
Ganymede review on PR #6:
- WARNING: title and project["name"] flowed unescaped into YAML, would
  corrupt frontmatter on quote-bearing inputs (e.g. 'Adopt "Conservative"
  Pricing'). New _yaml_str helper routes free-text values through
  json.dumps (JSON strings are valid YAML strings). Applied to title,
  author, url, project_slug, proposal_address, proposal_status,
  squads_proposal, squads_status.
- NIT: URL_ADDR_RE didn't match new metadao.fi URLs — pattern segment
  couldn't span /projects/{slug}/proposal/. Added (?:/[^/...]*)*? for
  variable path depth. Verified against three URL shapes.
- NIT: dry_run key was omitted from JSON output on early --limit exit
  but present on normal exit. Trivial consistency fix.
- NIT (deferred): STAT_BLEED_RE protection is accidental rather than
  designed; only matters if MetaDAO breaks DP-NNNNN naming convention.
  Per Ganymede 'optional — current behavior fine.'

Verified: URL regex matches futard.io legacy + metadao.fi new + hypothetical
no-slug shapes. YAML escape survives embedded quotes, newlines, backslashes,
em-dashes.
rio added 1 commit 2026-04-25 12:24:19 +00:00
fix(metadao-scrape): STAT_BLEED word boundaries + min-render gate
Some checks are pending
CI / lint-and-test (pull_request) Waiting to run
dde055fdbf
Ganymede review on PR #6 (commit 800d1d8 → this commit):

- WARNING: STAT_BLEED_RE false-positives on common words. The original
  pattern matched standalone stat-keyword tokens, clipping legitimate
  titles like "Engage with Pantera and Active Capital" → trimmed at
  " Active". Fix: require numeric/symbolic context (\$, +, -, \d) AFTER
  the stat-word, so word-only sequences pass through unchanged.

- _clean_title_candidate now uses finditer + first-match-past-offset-10
  instead of re.search. The DP-NNNNN digit sequence always wins leftmost
  position; we want the first POST-title bleed match instead.

- NIT 3: minimum-render gate before write. Skip partial renders rather
  than archiving stubs whose downstream extraction null-results.
  Threshold: body < 500B AND no DP-N in title → skip and retry next run.

Verified 10/10 on test grid: real bleed trimmed, mid-word false-positives
preserved (Compass, Active Capital, Live Streaming, Encompass, Activate,
Passage, Failure all pass through unchanged).

NIT 1 (--headless no-op flag) and NIT 2 (futardio tag provenance noise):
deferred — cosmetic, batch with future touch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rio added 1 commit 2026-04-25 12:27:49 +00:00
fix(deploy): add scripts/ to deploy.sh + auto-deploy.sh
Some checks are pending
CI / lint-and-test (pull_request) Waiting to run
353c4a57b9
Per Ganymede review of PR #6: scripts/ was in neither deploy script,
so 25 root-level Python scripts (metadao-scrape.py, embed-claims.py,
tier0-gate.py, etc.) lived in repo but never reached VPS.

Changes (identical pattern in both files):
- Add scripts/*.py to pre-deploy syntax check glob
- Add scripts/ rsync to $PIPELINE_DIR/scripts/

Restart trigger NOT updated — scripts/ are cron-invoked (not
daemon-imported), same pattern as fetch_coins.py.

All 25 scripts/*.py pre-flight syntax check passed locally.
Some checks are pending
CI / lint-and-test (pull_request) Waiting to run
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin ship/metadao-scraper:ship/metadao-scraper
git checkout ship/metadao-scraper

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git checkout main
git merge --no-ff ship/metadao-scraper
git checkout ship/metadao-scraper
git rebase main
git checkout main
git merge --ff-only ship/metadao-scraper
git checkout ship/metadao-scraper
git rebase main
git checkout main
git merge --no-ff ship/metadao-scraper
git checkout main
git merge --squash ship/metadao-scraper
git checkout main
git merge --ff-only ship/metadao-scraper
git checkout main
git merge ship/metadao-scraper
git push origin main
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: teleo/teleo-infrastructure#6
No description provided.