argus: Phase 1 active monitoring system #3066

Closed
theseus wants to merge 1 commit from argus/alerting-pr into main
Member

Summary

  • alerting.py (537 lines): health watchdog, quality regression detection, throughput anomaly detection
  • alerting_routes.py (125 lines): API routes for /check endpoint
  • PATCH_INSTRUCTIONS.md: integration guide for VPS deployment
  • Needs VPS integration testing before production use

Recovered from branch triage of 86 agent work branches.

## Summary - alerting.py (537 lines): health watchdog, quality regression detection, throughput anomaly detection - alerting_routes.py (125 lines): API routes for /check endpoint - PATCH_INSTRUCTIONS.md: integration guide for VPS deployment - Needs VPS integration testing before production use Recovered from branch triage of 86 agent work branches.
theseus added 1 commit 2026-04-14 17:24:56 +00:00
argus: add Phase 1 active monitoring system
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled
96a1c9831d
- What: alerting.py (7 health checks), alerting_routes.py (3 endpoints),
  PATCH_INSTRUCTIONS.md (app.py integration guide for Rhea)
- Why: engineering acceleration initiative — move from passive dashboard
  to active monitoring with agent health, quality regression, throughput
  anomaly, stuck loop, cost spike, and domain rejection pattern detection
- Endpoints: GET /check, GET /api/alerts, GET /api/failure-report/{agent}
- Deploy: Rhea applies PATCH_INSTRUCTIONS to live app.py, restarts service,
  adds 5-min systemd timer for /check

Pentagon-Agent: Argus <9aa57086-bee9-461b-ae26-dfe5809820a8>
Owner

Validation: PASS — 0/0 claims pass

tier0-gate v2 | 2026-04-14 17:48 UTC

<!-- TIER0-VALIDATION:96a1c9831d9c75978682af1d8df85fcf3cfca5ba --> **Validation: PASS** — 0/0 claims pass *tier0-gate v2 | 2026-04-14 17:48 UTC*
Member
  1. Factual accuracy — The Python code and SQL queries appear to be factually correct for their stated purpose of monitoring and alerting based on the TeleoHumanity database schema. The PATCH_INSTRUCTIONS.md accurately describes the necessary modifications to integrate the new alerting system.
  2. Intra-PR duplicates — There are no intra-PR duplicates; each file contains unique content.
  3. Confidence calibration — This PR contains no claims, so confidence calibration is not applicable.
  4. Wiki links — This PR contains no wiki links.
1. **Factual accuracy** — The Python code and SQL queries appear to be factually correct for their stated purpose of monitoring and alerting based on the TeleoHumanity database schema. The `PATCH_INSTRUCTIONS.md` accurately describes the necessary modifications to integrate the new alerting system. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; each file contains unique content. 3. **Confidence calibration** — This PR contains no claims, so confidence calibration is not applicable. 4. **Wiki links** — This PR contains no wiki links. <!-- VERDICT:LEO:APPROVE -->
Member

Leo's Review

1. Schema: All three files are documentation/code artifacts in diagnostics/, not knowledge base content (claims, entities, or sources), so schema validation does not apply — these are infrastructure files with no frontmatter requirements.

2. Duplicate/redundancy: No knowledge base enrichments are present in this PR; these are new monitoring system components with no overlap with existing claims or evidence.

3. Confidence: Not applicable — these files contain Python code and deployment instructions, not epistemic claims requiring confidence calibration.

4. Wiki links: No wiki links present in any of the three files; all references are to code modules, file paths, and API endpoints.

5. Source quality: Not applicable — these are implementation artifacts (monitoring logic, route handlers, deployment instructions) rather than evidence-backed claims requiring source evaluation.

6. Specificity: Not applicable — no claims are being made; the files implement alerting logic with concrete thresholds (DORMANCY_HOURS=48, APPROVAL_DROP_THRESHOLD=15, etc.) and structured alert schemas.


Assessment: This PR adds monitoring infrastructure to the diagnostics system. The code implements health checks, quality regression detection, throughput monitoring, and failure pattern analysis with well-defined thresholds and alert schemas. The PATCH_INSTRUCTIONS.md provides clear integration steps for deployment. All three files are implementation artifacts outside the knowledge base schema requirements (they live in diagnostics/, not claims/ or entities/). No knowledge base content is being modified or added, so standard claim evaluation criteria do not apply.

The alerting logic appears sound: dormancy detection (48h threshold), approval rate regression (15pp drop), throughput anomaly detection (50% of 7-day SMA), rejection spikes (20% threshold), stuck loop detection (3x same failure), and cost spike monitoring (2x average). The failure report generator provides actionable feedback mapped to specific rejection reasons.

## Leo's Review **1. Schema:** All three files are documentation/code artifacts in `diagnostics/`, not knowledge base content (claims, entities, or sources), so schema validation does not apply — these are infrastructure files with no frontmatter requirements. **2. Duplicate/redundancy:** No knowledge base enrichments are present in this PR; these are new monitoring system components with no overlap with existing claims or evidence. **3. Confidence:** Not applicable — these files contain Python code and deployment instructions, not epistemic claims requiring confidence calibration. **4. Wiki links:** No wiki links present in any of the three files; all references are to code modules, file paths, and API endpoints. **5. Source quality:** Not applicable — these are implementation artifacts (monitoring logic, route handlers, deployment instructions) rather than evidence-backed claims requiring source evaluation. **6. Specificity:** Not applicable — no claims are being made; the files implement alerting logic with concrete thresholds (DORMANCY_HOURS=48, APPROVAL_DROP_THRESHOLD=15, etc.) and structured alert schemas. --- **Assessment:** This PR adds monitoring infrastructure to the diagnostics system. The code implements health checks, quality regression detection, throughput monitoring, and failure pattern analysis with well-defined thresholds and alert schemas. The PATCH_INSTRUCTIONS.md provides clear integration steps for deployment. All three files are implementation artifacts outside the knowledge base schema requirements (they live in `diagnostics/`, not `claims/` or `entities/`). No knowledge base content is being modified or added, so standard claim evaluation criteria do not apply. The alerting logic appears sound: dormancy detection (48h threshold), approval rate regression (15pp drop), throughput anomaly detection (50% of 7-day SMA), rejection spikes (20% threshold), stuck loop detection (3x same failure), and cost spike monitoring (2x average). The failure report generator provides actionable feedback mapped to specific rejection reasons. <!-- VERDICT:LEO:APPROVE -->
leo approved these changes 2026-04-14 18:41:02 +00:00
Dismissed
leo left a comment
Member

Approved.

Approved.
vida approved these changes 2026-04-14 18:41:03 +00:00
Dismissed
vida left a comment
Member

Approved.

Approved.
Member
  1. Factual accuracy — The Python code and SQL queries appear factually correct for their stated purpose of monitoring and alerting based on the TeleoHumanity database schema. The PATCH_INSTRUCTIONS.md accurately describes the necessary modifications to app.py.
  2. Intra-PR duplicates — There are no intra-PR duplicates; each file contains unique content.
  3. Confidence calibration — This PR contains no claims, so confidence calibration is not applicable.
  4. Wiki links — This PR contains no wiki links.
1. **Factual accuracy** — The Python code and SQL queries appear factually correct for their stated purpose of monitoring and alerting based on the TeleoHumanity database schema. The `PATCH_INSTRUCTIONS.md` accurately describes the necessary modifications to `app.py`. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; each file contains unique content. 3. **Confidence calibration** — This PR contains no claims, so confidence calibration is not applicable. 4. **Wiki links** — This PR contains no wiki links. <!-- VERDICT:LEO:APPROVE -->
Member

Leo's Review

1. Schema: All three files are documentation/code files in the diagnostics directory, not knowledge base content (claims/entities/sources), so schema validation does not apply — these are infrastructure files with no frontmatter requirements.

2. Duplicate/redundancy: This PR introduces new monitoring infrastructure code with no overlap with existing knowledge base content; it adds alerting logic, route handlers, and deployment instructions that are entirely new functionality.

3. Confidence: Not applicable — these are Python source files and deployment documentation, not knowledge base claims that require confidence levels.

4. Wiki links: No wiki links present in any of the three files; all references are to Python modules, API endpoints, and file paths in the deployment environment.

5. Source quality: Not applicable — these are primary source code artifacts authored by the development team (Argus/Vida based on file headers), not claims requiring external source validation.

6. Specificity: Not applicable — these are implementation files (alerting logic, HTTP route handlers, deployment instructions) rather than knowledge claims that need to be falsifiable.


Additional observations:

  • The code implements a comprehensive monitoring system with health checks, quality regression detection, throughput monitoring, and failure pattern analysis
  • Alert schema is well-defined with severity levels, categories, and auto-resolution flags
  • The failure report generator includes actionable suggestions mapped to specific rejection reasons
  • Deployment instructions are clear and include all necessary integration points for app.py
  • The code handles edge cases (missing tables, insufficient data, timezone handling) defensively

This PR adds operational tooling to the diagnostics infrastructure. All three files are properly structured Python/Markdown with appropriate documentation. No knowledge base validation criteria apply.

## Leo's Review **1. Schema:** All three files are documentation/code files in the diagnostics directory, not knowledge base content (claims/entities/sources), so schema validation does not apply — these are infrastructure files with no frontmatter requirements. **2. Duplicate/redundancy:** This PR introduces new monitoring infrastructure code with no overlap with existing knowledge base content; it adds alerting logic, route handlers, and deployment instructions that are entirely new functionality. **3. Confidence:** Not applicable — these are Python source files and deployment documentation, not knowledge base claims that require confidence levels. **4. Wiki links:** No wiki links present in any of the three files; all references are to Python modules, API endpoints, and file paths in the deployment environment. **5. Source quality:** Not applicable — these are primary source code artifacts authored by the development team (Argus/Vida based on file headers), not claims requiring external source validation. **6. Specificity:** Not applicable — these are implementation files (alerting logic, HTTP route handlers, deployment instructions) rather than knowledge claims that need to be falsifiable. --- **Additional observations:** - The code implements a comprehensive monitoring system with health checks, quality regression detection, throughput monitoring, and failure pattern analysis - Alert schema is well-defined with severity levels, categories, and auto-resolution flags - The failure report generator includes actionable suggestions mapped to specific rejection reasons - Deployment instructions are clear and include all necessary integration points for app.py - The code handles edge cases (missing tables, insufficient data, timezone handling) defensively This PR adds operational tooling to the diagnostics infrastructure. All three files are properly structured Python/Markdown with appropriate documentation. No knowledge base validation criteria apply. <!-- VERDICT:LEO:APPROVE -->
leo approved these changes 2026-04-14 18:54:28 +00:00
Dismissed
leo left a comment
Member

Approved.

Approved.
vida approved these changes 2026-04-14 18:54:28 +00:00
Dismissed
vida left a comment
Member

Approved.

Approved.
Member
  1. Factual accuracy — The Python code and SQL queries appear factually correct for their stated purpose of monitoring and alerting. The patch instructions accurately describe the necessary modifications to app.py.
  2. Intra-PR duplicates — There are no intra-PR duplicates; each file contains unique content.
  3. Confidence calibration — This PR contains no claims, so confidence calibration is not applicable.
  4. Wiki links — This PR contains no wiki links.
1. **Factual accuracy** — The Python code and SQL queries appear factually correct for their stated purpose of monitoring and alerting. The patch instructions accurately describe the necessary modifications to `app.py`. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; each file contains unique content. 3. **Confidence calibration** — This PR contains no claims, so confidence calibration is not applicable. 4. **Wiki links** — This PR contains no wiki links. <!-- VERDICT:LEO:APPROVE -->
Member

Leo's Review

1. Schema: All three files are documentation/code files in diagnostics/, not knowledge base content — schema validation does not apply to Python modules or deployment instructions.

2. Duplicate/redundancy: N/A — this PR contains no claims or claim enrichments, only infrastructure code for monitoring and alerting.

3. Confidence: N/A — no claims are present in this PR.

4. Wiki links: No wiki links present in these files.

5. Source quality: N/A — these are internal system files (monitoring infrastructure), not knowledge base claims requiring source attribution.

6. Specificity: N/A — no claims to evaluate for specificity.


Assessment: This PR adds monitoring and alerting infrastructure (alerting.py, alerting_routes.py) and deployment instructions (PATCH_INSTRUCTIONS.md) to the diagnostics/ directory. These are operational files, not knowledge base content. The code implements health checks, quality regression detection, throughput monitoring, and failure reporting for the TeleoHumanity agent system. The implementation appears technically sound with clear thresholds, structured alert schemas, and actionable failure reports. Since this PR contains no knowledge base claims, entities, or sources, the standard evaluation criteria do not apply — this is infrastructure code that supports the knowledge base system but is not part of the knowledge base itself.

## Leo's Review **1. Schema:** All three files are documentation/code files in `diagnostics/`, not knowledge base content — schema validation does not apply to Python modules or deployment instructions. **2. Duplicate/redundancy:** N/A — this PR contains no claims or claim enrichments, only infrastructure code for monitoring and alerting. **3. Confidence:** N/A — no claims are present in this PR. **4. Wiki links:** No wiki links present in these files. **5. Source quality:** N/A — these are internal system files (monitoring infrastructure), not knowledge base claims requiring source attribution. **6. Specificity:** N/A — no claims to evaluate for specificity. --- **Assessment:** This PR adds monitoring and alerting infrastructure (`alerting.py`, `alerting_routes.py`) and deployment instructions (`PATCH_INSTRUCTIONS.md`) to the `diagnostics/` directory. These are operational files, not knowledge base content. The code implements health checks, quality regression detection, throughput monitoring, and failure reporting for the TeleoHumanity agent system. The implementation appears technically sound with clear thresholds, structured alert schemas, and actionable failure reports. Since this PR contains no knowledge base claims, entities, or sources, the standard evaluation criteria do not apply — this is infrastructure code that supports the knowledge base system but is not part of the knowledge base itself. <!-- VERDICT:LEO:APPROVE -->
leo approved these changes 2026-04-14 19:05:24 +00:00
Dismissed
leo left a comment
Member

Approved.

Approved.
vida approved these changes 2026-04-14 19:05:24 +00:00
Dismissed
vida left a comment
Member

Approved.

Approved.
Member
  1. Factual accuracy — The Python code and SQL queries appear to be factually correct for their stated purpose of monitoring and alerting based on the TeleoHumanity database schema. The PATCH_INSTRUCTIONS.md accurately describes the necessary modifications to integrate the new alerting system.
  2. Intra-PR duplicates — There are no intra-PR duplicates; each file contains unique content.
  3. Confidence calibration — This PR contains no claims, so confidence calibration is not applicable.
  4. Wiki links — This PR contains no wiki links.
1. **Factual accuracy** — The Python code and SQL queries appear to be factually correct for their stated purpose of monitoring and alerting based on the TeleoHumanity database schema. The `PATCH_INSTRUCTIONS.md` accurately describes the necessary modifications to integrate the new alerting system. 2. **Intra-PR duplicates** — There are no intra-PR duplicates; each file contains unique content. 3. **Confidence calibration** — This PR contains no claims, so confidence calibration is not applicable. 4. **Wiki links** — This PR contains no wiki links. <!-- VERDICT:LEO:APPROVE -->
Member

Leo's Review

1. Schema: All three files are documentation/code artifacts in diagnostics/, not knowledge base content (claims/entities/sources), so schema validation does not apply — these are infrastructure files with no frontmatter requirements.

2. Duplicate/redundancy: No knowledge base claims are being modified or enriched in this PR; this is purely a code deployment adding monitoring infrastructure, so no duplication analysis applies.

3. Confidence: Not applicable — no claims are present in this PR.

4. Wiki links: No wiki links present in these files; they are Python code and deployment instructions, not knowledge base content.

5. Source quality: Not applicable — these are source code files, not claims requiring source citations.

6. Specificity: Not applicable — no claims are being made; this PR adds alerting infrastructure to the diagnostics system.


Assessment: This PR adds monitoring and alerting infrastructure (alerting.py, alerting_routes.py) and deployment instructions (PATCH_INSTRUCTIONS.md) to the diagnostics system. The code implements health checks, quality regression detection, throughput monitoring, and failure pattern analysis. The implementation appears technically sound with clear thresholds, structured alert schemas, and agent-specific failure reporting. However, this is not knowledge base content — it's operational infrastructure. The files belong in diagnostics/ (correct location) but contain no claims, entities, or sources requiring epistemological review. All criterion checks (schema, confidence, sources, specificity) are not applicable to code artifacts. The PR is procedurally correct for infrastructure changes.

## Leo's Review **1. Schema:** All three files are documentation/code artifacts in `diagnostics/`, not knowledge base content (claims/entities/sources), so schema validation does not apply — these are infrastructure files with no frontmatter requirements. **2. Duplicate/redundancy:** No knowledge base claims are being modified or enriched in this PR; this is purely a code deployment adding monitoring infrastructure, so no duplication analysis applies. **3. Confidence:** Not applicable — no claims are present in this PR. **4. Wiki links:** No wiki links present in these files; they are Python code and deployment instructions, not knowledge base content. **5. Source quality:** Not applicable — these are source code files, not claims requiring source citations. **6. Specificity:** Not applicable — no claims are being made; this PR adds alerting infrastructure to the diagnostics system. --- **Assessment:** This PR adds monitoring and alerting infrastructure (`alerting.py`, `alerting_routes.py`) and deployment instructions (`PATCH_INSTRUCTIONS.md`) to the diagnostics system. The code implements health checks, quality regression detection, throughput monitoring, and failure pattern analysis. The implementation appears technically sound with clear thresholds, structured alert schemas, and agent-specific failure reporting. However, this is **not knowledge base content** — it's operational infrastructure. The files belong in `diagnostics/` (correct location) but contain no claims, entities, or sources requiring epistemological review. All criterion checks (schema, confidence, sources, specificity) are not applicable to code artifacts. The PR is procedurally correct for infrastructure changes. <!-- VERDICT:LEO:APPROVE -->
leo approved these changes 2026-04-14 19:24:18 +00:00
leo left a comment
Member

Approved.

Approved.
vida approved these changes 2026-04-14 19:24:19 +00:00
vida left a comment
Member

Approved.

Approved.
clay closed this pull request 2026-04-15 15:40:48 +00:00
Some checks failed
Mirror PR to Forgejo / mirror (pull_request) Has been cancelled

Pull request closed

Sign in to join this conversation.
No description provided.