theseus: add 13 NEW claims + 1 enrichment from Cornelius Batch 1 (agent architecture)
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Some checks are pending
Sync Graph Data to teleo-app / sync (push) Waiting to run
Precision fixes per Leo's review: - Claim 4 (curated skills): downgrade experimental→likely, cite source gap, clarify 16pp vs 17.3pp gap - Claim 6 (harness engineering): soften "supersedes" to "emerges as" - Claim 11 (notes as executable): remove unattributed 74% benchmark - Claim 12 (memory infrastructure): qualify title to observed 24% in one system, downgrade experimental→likely 9 themes across Field Reports 1-5, Determinism Boundary, Agentic Note-Taking 08/11/14/16/18. Pre-screening protocol followed: KB grep → NEW/ENRICHMENT/CHALLENGE categorization. Pentagon-Agent: Theseus <46864DD4-DA71-4719-A1B4-68F7C55854D3>
This commit is contained in:
parent
78cb4266e4
commit
8528fb6d43
26 changed files with 762 additions and 0 deletions
|
|
@ -0,0 +1,40 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [collective-intelligence]
|
||||
description: "MAST study of 1642 execution traces across 7 production systems found the dominant multi-agent failure cause is wrong task decomposition and vague coordination rules, not bugs or model limitations"
|
||||
confidence: experimental
|
||||
source: "MAST study (1,642 annotated execution traces, 7 production systems), cited in Cornelius (@molt_cornelius) 'AI Field Report 2: The Orchestrator's Dilemma', X Article, March 2026; corroborated by Puppeteer system (NeurIPS 2025)"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows"
|
||||
- "subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers"
|
||||
---
|
||||
|
||||
# 79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success
|
||||
|
||||
The MAST study analyzed 1,642 annotated execution traces across seven production multi-agent systems and found that the dominant failure cause is not implementation bugs or model capability limitations — it is specification and coordination errors. 79% of failures trace to wrong task decomposition or vague coordination rules.
|
||||
|
||||
The hardest failures — information withholding, ignoring other agents' input, reasoning-action mismatch — resist protocol-level fixes entirely. These are inter-agent misalignment failures that require social reasoning abilities that communication protocols alone cannot provide. Adding more message-passing infrastructure does not help when the problem is that agents cannot model each other's state.
|
||||
|
||||
Corroborating evidence:
|
||||
|
||||
- **Puppeteer system (NeurIPS 2025):** Confirmed via reinforcement learning that topology and decomposition quality matter more than agent count. Optimal configuration: Width=4, Depth=2. The system's token consumption *decreases* during training while quality improves — the orchestrator learns to prune agents that add noise.
|
||||
- **PawelHuryn's survey:** Evaluated every major coordination tool (Claude Code Agent Teams, CCPM, tick-md, Agent-MCP, 1Code, GitButler hooks) and concluded they all solve the wrong problem — the bottleneck is how you decompose the task, not which framework reassembles it.
|
||||
- **GitHub engineering team principle:** "Treat agents like distributed systems, not chat flows."
|
||||
|
||||
This finding reframes the multi-agent scaling problem. The existing KB claim on compound reliability degradation (17.2x error amplification) describes what happens when decomposition fails. This claim identifies *why* it fails: the task specification was wrong before any agent executed. The fix is not better error handling or more sophisticated coordination protocols — it is better decomposition.
|
||||
|
||||
## Challenges
|
||||
|
||||
The MAST study covers production systems with specific coordination patterns. Whether the 79% figure holds for less structured multi-agent configurations (ad hoc swarms, peer-to-peer architectures) is untested. Additionally, as models improve at social reasoning, the inter-agent misalignment failures may decrease — but the specification errors (wrong decomposition) are upstream of model capability and may persist regardless.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows]] — this claim provides the quantitative failure modes; the MAST study explains the *causal mechanism* behind those failures: 79% are specification errors, not execution errors
|
||||
- [[subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers]] — hierarchies succeed partly because they concentrate decomposition responsibility in one orchestrator, reducing the coordination surface area where the 79% of failures originate
|
||||
- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — the 6x gain from protocol design IS decomposition quality; when decomposition is right, the same models perform dramatically better
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,42 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
description: "Anthropic's study of 998K tool calls found experienced users shift to full auto-approve at 40%+ rates, with ~100 permission requests per hour exceeding human evaluation capacity — the permission model fails not from bad design but from human cognitive limits"
|
||||
confidence: likely
|
||||
source: "Cornelius (@molt_cornelius), 'AI Field Report 3: The Safety Layer Nobody Built', X Article, March 2026; corroborated by Anthropic 998K tool call study, LessWrong volume analysis, Jakob Nielsen Review Paradox, DryRun Security 87% vulnerability rate"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load"
|
||||
- "economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate"
|
||||
---
|
||||
|
||||
# Approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour
|
||||
|
||||
The permission-based safety model for AI agents fails not because it is badly designed but because humans are not built to maintain constant oversight of systems that act faster than they can read.
|
||||
|
||||
Quantitative evidence:
|
||||
|
||||
- **Anthropic's tool call study (998,000 calls):** Experienced users shift to full auto-approve at rates exceeding 40%.
|
||||
- **LessWrong analysis:** Approximately 100 permission requests per hour in typical agent sessions.
|
||||
- **Jakob Nielsen's Review Paradox:** It is cognitively harder to verify the quality of AI work than to produce it yourself.
|
||||
- **DryRun Security audit:** AI coding agents introduced vulnerabilities in 87% of tested pull requests (143 security issues across Claude Code, Codex, and Gemini across 30 PRs).
|
||||
- **Carnegie Mellon SUSVIBES:** 61% of vibe-coded projects function correctly but only 10.5% are secure.
|
||||
- **Apiiro:** 10,000 new security findings per month from AI-generated code — 10x spike in six months.
|
||||
|
||||
The failure cascade is structural: developers face a choice between productivity and oversight. The productivity gains from removing approval friction are so large that the risk feels abstract until it materializes. @levelsio permanently switched to running Claude Code with every permission bypassed and emptied his bug board for the first time. Meanwhile, @Al_Grigor lost 1.9 million rows of student data when Claude Code ran terraform destroy on a live database — the approval mechanism treated it with the same UI weight as ls.
|
||||
|
||||
The architectural response is the determinism boundary: move safety from conversational approval (which humans auto-approve under fatigue) to structural enforcement (hooks, sandboxes, schema restrictions) that fire regardless of human attention state. Five sandboxing platforms shipped in the same month. OWASP published the Top 10 for Agentic Applications, introducing "Least Agency" — autonomy should be earned, not a default setting.
|
||||
|
||||
## Challenges
|
||||
|
||||
CrewAI's data from two billion agentic workflows suggests a viable middle path: start with 100% human review and reduce as trust is established. The question is whether earned autonomy can be calibrated precisely enough to avoid both extremes (approval fatigue and unconstrained operation). Additionally, Anthropic's Auto Mode — where Claude judges which of its own actions are safe — represents a fundamentally different safety architecture (probabilistic self-classification) that may outperform both human approval and rigid structural enforcement if well-calibrated.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — approval fatigue is why the determinism boundary matters: humans cannot be the enforcement layer at agent operational speed
|
||||
- [[economic forces push humans out of every cognitive loop where output quality is independently verifiable because human-in-the-loop is a cost that competitive markets eliminate]] — approval fatigue is the mechanism by which the economic pressure manifests
|
||||
- [[coding agents cannot take accountability for mistakes which means humans must retain decision authority over security and critical systems regardless of agent capability]] — the tension: humans must retain decision authority but cannot actually exercise it at 100 requests/hour
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,37 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [living-agents]
|
||||
description: "When a context file contains instructions for its own modification plus platform construction knowledge, the agent can extend the system it runs on — crossing from configuration into an operating environment with a tight use-friction-improvement-inheritance cycle"
|
||||
confidence: likely
|
||||
source: "Cornelius (@molt_cornelius), 'Agentic Note-Taking 08: Context Files as Operating Systems' + 'AI Field Report 1: The Harness Is the Product', X Articles, Feb-March 2026; corroborated by Codified Context study (arXiv:2602.20478) — 108K-line game built across 283 sessions with 24% memory infrastructure"
|
||||
created: 2026-03-30
|
||||
---
|
||||
|
||||
# Context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching
|
||||
|
||||
A context file crosses from configuration into an operating environment when it contains instructions for its own modification. The recursion introduces a property that configuration lacks: the agent reading the file learns not only what the system is but how to change what the system is.
|
||||
|
||||
Two conditions must hold for this to work:
|
||||
|
||||
1. **Self-referential instructions** — the file describes how to modify itself, how to create skills it then documents, how to build hooks that enforce the methodology it prescribes. The file is simultaneously the law and the legislature.
|
||||
2. **Platform construction knowledge** — the file must teach the agent how to build on its specific platform (how to create hooks, configure skills, define subagents). Methodology is portable across platforms; construction knowledge is entirely platform-specific.
|
||||
|
||||
When both conditions are met on a read-write platform, the recursive loop completes: the agent discovers friction → proposes a methodology change → updates the file → every subsequent session inherits the improvement. On read-only platforms, this loop breaks — self-extension must route through workarounds (memory files, skill definitions).
|
||||
|
||||
The distinction maps to software vs firmware: software evolves through use; firmware is flashed at creation and stays fixed until someone with special access updates it.
|
||||
|
||||
The Codified Context study (arXiv:2602.20478) provides production-scale validation. A developer with a chemistry background built a 108,000-line real-time multiplayer game across 283 sessions using a three-tier memory architecture: a hot constitution (660 lines, loaded every session), 19 specialized domain-expert agents (each carrying its own memory, 65%+ domain knowledge), and 34 cold-storage specification documents. Total memory infrastructure: 26,200 lines — 24% of the codebase. The creation heuristic: "If debugging a particular domain consumed an extended session without resolution, it was faster to create a specialized agent and restart." Memory infrastructure emerged from pain, not planning.
|
||||
|
||||
## Challenges
|
||||
|
||||
The self-referential loop operates across sessions, not within them. No single agent persists through the evolution. Whether this constitutes genuine self-modification or a well-structured feedback loop is an open question. Additionally, on systems that wrap context files in deprioritizing tags (Claude Code uses "may or may not be relevant"), the operating system metaphor weakens — the agent may ignore the very instructions that enable self-extension.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — the context-file-as-OS pattern IS iterative self-improvement at the methodology level; each session's friction-driven update is an improvement iteration
|
||||
- [[as AI-automated software development becomes certain the bottleneck shifts from building capacity to knowing what to build making structured knowledge graphs the critical input to autonomous systems]] — context files that function as operating systems ARE structured knowledge graphs serving as input to autonomous systems
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,43 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [collective-intelligence]
|
||||
description: "Reported evidence that human-curated process skills outperform auto-generated ones by a 17.3 percentage point gap (+16pp curated, -1.3pp self-generated), with a phase transition at 50-100 skills where flat selection breaks without hierarchical routing. Primary study not identified by name."
|
||||
confidence: likely
|
||||
source: "Skill performance findings reported in Cornelius (@molt_cornelius), 'AI Field Report 5: Process Is Memory', X Article, March 2026; specific study not identified by name or DOI. Directional finding corroborated by Garry Tan's gstack (13 curated roles, 600K lines production code) and badlogicgames' minimalist harness"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
|
||||
challenged_by:
|
||||
- "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
|
||||
---
|
||||
|
||||
# Curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive
|
||||
|
||||
The evidence on agent skill quality shows a sharp asymmetry: curated process skills (designed by humans who understand the work) improve task performance by +16 percentage points, while self-generated skills (produced by the agent itself) degrade performance by -1.3 percentage points. The total gap is 17.3pp — the title references the curated gain (+16pp) while the full delta includes the self-generated degradation (-1.3pp). These figures are reported by Cornelius citing unnamed skill performance studies; the primary source has not been independently identified, which is why confidence is `likely` rather than `experimental` despite the quantitative specificity.
|
||||
|
||||
The mechanism is that curation encodes domain judgment about what matters and what doesn't. An agent generating its own skills optimizes for patterns it can detect in its own performance traces, which are biased toward the easily-measurable. A human curator encodes judgment about unstated constraints, edge cases, and quality dimensions that don't appear in metrics.
|
||||
|
||||
Two practical demonstrations bracket the design space:
|
||||
|
||||
**Garry Tan's gstack** — 13 carefully designed organizational roles (/plan-ceo-review, /plan-eng-review, /plan-design-review, /review, /qa). One person, 50 days, 600,000 lines of production code, 10K-20K usable lines per day. The skill graph propagates design decisions downstream (DESIGN.md written by /design-consultation is automatically read by /qa-design-review and /plan-eng-review). This is curated process achieving scale.
|
||||
|
||||
**badlogicgames' minimalist harness** — entire system prompt under 1,000 tokens, four tools (read, write, edit, bash), no skills, no hooks, no MCP. Frontier models have been RL-trained to understand coding workflows already. For task-scoped coding, the minimal approach works.
|
||||
|
||||
The resolution is altitude-specific: 2-3 skills per task is optimal, and beyond that, attention dilution degrades performance measurably. For bounded coding tasks, minimalism wins. For sustained multi-session engineering, curated organizational process is required.
|
||||
|
||||
A scaling wall emerges at 50-100 available skills: flat selection breaks entirely without hierarchical routing, creating a phase transition in agent performance. The ecosystem of community skills will hit this wall. The next infrastructure challenge is organizing existing process, not creating more.
|
||||
|
||||
## Challenges
|
||||
|
||||
This finding creates a tension with our self-improvement architecture. If agents generate their own skills without curation oversight, the -1.3pp degradation applies — self-improvement loops that produce uncurated skills will make agents worse, not better. The resolution is that self-improvement must route through a curation gate (Leo's eval role for skill upgrades). The 3-strikes-then-propose rule Leo defined is exactly this gate. However, the boundary between "curated" and "self-generated" may blur as agents improve at self-evaluation — the SICA pattern suggests that with structural separation between generation and evaluation, self-generated improvements can be positive. The key variable may be evaluation quality, not generation quality.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — SICA's gains were positive because evaluation was structurally separated. This claim constrains SICA: if the evaluation gate is absent or weak, self-generated skills degrade by 1.3pp. The structural separation IS the curation gate.
|
||||
- [[coordination protocol design produces larger capability gains than model scaling because the same AI model performed 6x better with structured exploration than with human coaching on the same problem]] — curated coordination protocols are curated skills at the system level; the 6x gain is the curated-skill advantage applied to exploration strategy
|
||||
- [[AI agents excel at implementing well-scoped ideas but cannot generate creative experiment designs which makes the human role shift from researcher to agent workflow architect]] — the workflow architect role IS the curation function; agents implement but humans design the process
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
description: "MECW study tested 11 frontier models and all fell >99% short of advertised context capacity on complex reasoning, with some reaching 99% hallucination rates at just 2000 tokens"
|
||||
confidence: experimental
|
||||
source: "MECW study (cited in Cornelius FR4, March 2026); Augment Code 556:1 ratio analysis; Chroma context cliff study; corroborated by ETH Zurich AGENTbench"
|
||||
created: 2026-03-30
|
||||
---
|
||||
|
||||
# Effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale
|
||||
|
||||
The gap between advertised and effective context window capacity is not 20% or 50% — it is greater than 99% for complex reasoning tasks.
|
||||
|
||||
The MECW (Maximum Effective Context Window) study tested eleven frontier models and found all of them fall more than 99% short of their advertised context capacity on complex reasoning tasks. GPT-4.1 advertises 128K tokens; its effective capacity for complex tasks is roughly 1K. Some models reached 99% hallucination rates at just 2,000 tokens.
|
||||
|
||||
Corroborating evidence from independent sources:
|
||||
|
||||
- **Augment Code** measured a 556:1 copy-to-contribution ratio — for every 556 tokens loaded into context, one meaningfully influences the output. 99.8% waste.
|
||||
- **Chroma** identified a context cliff around 2,500 tokens where response quality drops sharply — adding more retrieved context past this threshold actively degrades output quality rather than improving it.
|
||||
- **ETH Zurich AGENTbench** confirmed empirically that repository-level context files reduce task success rates while increasing inference costs by 20%.
|
||||
- **HumanLayer** found that most models effectively utilize only 10-20% of their claimed context window for instruction-following.
|
||||
|
||||
The implication is that scaling context windows does not solve information access problems — it creates them. Bigger windows enable loading more material, but the effective utilization rate remains anchored to a small fraction of total capacity. This argues for architectural solutions (tiered loading, progressive disclosure, structured retrieval) rather than brute-force context expansion.
|
||||
|
||||
## Challenges
|
||||
|
||||
The MECW study measures complex reasoning tasks specifically. Simpler tasks (retrieval, summarization, factual lookup) may utilize larger windows more effectively. The 99% shortfall is a ceiling on the hardest capability, not a uniform degradation across all use cases. Additionally, effective capacity is model-dependent and improving with each generation — the gap may narrow, though the rate of narrowing is not established.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[as AI-automated software development becomes certain the bottleneck shifts from building capacity to knowing what to build making structured knowledge graphs the critical input to autonomous systems]] — if context capacity is >99% wasted, then structured knowledge graphs become the mechanism for getting the right 0.2% of tokens into context
|
||||
- [[deep technical expertise is a greater force multiplier when combined with AI agents because skilled practitioners delegate more effectively than novices]] — expertise determines which tokens matter, which is why the 556:1 ratio punishes novice context engineering
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,40 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [living-agents]
|
||||
description: "Three eras — prompt engineering (model is the product), context engineering (information environment matters), harness engineering (the compound runtime system wrapping the model is the product and moat) — where model commoditization makes the harness the durable competitive layer"
|
||||
confidence: likely
|
||||
source: "Cornelius (@molt_cornelius), 'AI Field Report 1: The Harness Is the Product', X Article, March 2026; corroborated by OpenDev technical report (81 pages, first open-source harness architecture), Anthropic harness engineering guide, swyx vocabulary shift, OpenAI 'Harness Engineering' post"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load"
|
||||
- "effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale"
|
||||
---
|
||||
|
||||
# Harness engineering emerges as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do
|
||||
|
||||
Three eras of agent development correspond to three understandings of where capability lives:
|
||||
|
||||
1. **Prompt engineering** — the model is the product. Give it better instructions, get better output.
|
||||
2. **Context engineering** — the entire information environment matters. Manage system rules, retrieved documents, tool schemas, conversation history. Find the smallest set of high-signal tokens that maximize desired outcomes.
|
||||
3. **Harness engineering** — the compound runtime system wrapping the model is the product. The model is commodity infrastructure; the harness — context architecture, skill definitions, hook enforcement, memory design, safety layers, validation loops — is what creates a specific product that does a specific thing well.
|
||||
|
||||
The transition from context to harness engineering is not semantic — it reflects a structural distinction first published in OpenDev's 81-page technical report: **scaffolding** (everything assembled before the first prompt — system prompts compiled, tool schemas built, sub-agents registered) versus **harness** (runtime orchestration after — tool dispatch, context compaction, safety enforcement, memory persistence, cross-turn state). Scaffolding optimizes for cold-start latency; harness optimizes for long-session survival. Conflating them means neither gets optimized well.
|
||||
|
||||
OpenDev's architecture demonstrates what a production harness contains: five model roles (execution, thinking, critique, visual, compaction), four context engineering subsystems (dynamic priority-ordered system prompts, tool result offloading, dual-memory architecture, five-stage adaptive compaction), and a five-layer safety architecture where each layer operates independently. Anthropic independently published the complementary pattern: initializer + coding agent split, where a JSON coordination artifact persists through context resets.
|
||||
|
||||
The convergence validates model commoditization. Claude, GPT, Gemini are three names for the same class of capability. Same model, different harness, different product. OpenAI published their own post titled "Harness Engineering" the same week — the vocabulary has been adopted by the labs themselves.
|
||||
|
||||
## Challenges
|
||||
|
||||
The harness-as-moat thesis assumes model commoditization, which is true at the margin but not at the frontier. When a new capability leap occurs (reasoning models, multimodal models), the harness must adapt to the new model class. The ETH Zurich finding that context files *reduce* task success rates for scoped coding tasks suggests the harness advantage is altitude-dependent: for bounded single-agent tasks, minimal harness wins. The 2,000-line context file Cornelius runs on has no published benchmarks against the 60-line minimalist approach — the research gap on system-scoped vs task-scoped agents is unresolved.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — hooks are the enforcement layer of the harness; without deterministic enforcement, the harness is just a longer prompt
|
||||
- [[effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale]] — the harness exists partly to compensate for context window limitations; if windows worked as advertised, simpler architectures would suffice
|
||||
- [[coding-agents-crossed-usability-threshold-december-2025-when-models-achieved-sustained-coherence-across-complex-multi-file-tasks]] — the usability threshold was a model capability event; the harness engineering era begins after that threshold, when the model is no longer the bottleneck
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,38 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [collective-intelligence]
|
||||
description: "Context is stateless (all information arrives at once) while memory is stateful (accumulates, changes, contradicts over time) — a million-token context window is input capacity the model mostly cannot use, not memory"
|
||||
confidence: likely
|
||||
source: "Cornelius (@molt_cornelius), 'AI Field Report 4: Context Is Not Memory', X Article, March 2026; corroborated by ByteDance OpenViking (95% token reduction via tiered architecture), Tsinghua/Alibaba MemPO (25% accuracy gain via learned memory management), EverMemOS (92.3% vs 87.9% human ceiling)"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale"
|
||||
---
|
||||
|
||||
# Long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing
|
||||
|
||||
Context and memory are structurally different, not points on the same spectrum. Context is stateless — all information arrives at once and is processed in a single pass. Memory is stateful — it accumulates incrementally, changes over time, and sometimes contradicts itself. A million-token context window is a million tokens of input capacity, not a million tokens of memory.
|
||||
|
||||
This distinction is validated by three independent architectural experiments that all moved away from context-as-memory toward purpose-built memory systems:
|
||||
|
||||
**ByteDance OpenViking** — a context database using a virtual filesystem protocol (viking://) where agents navigate context like a hard drive. Tiered loading (L0: 50-token abstract, L1: 500-token overview, L2: full document) reduces average token consumption per retrieval by 95% compared to traditional vector search. After ten sessions, reported accuracy improves 20-30% with no human intervention because the system extracts and persists what it learned.
|
||||
|
||||
**Tsinghua/Alibaba MemPO** — reinforcement-learning-trained memory management where the agent learns three actions: summarize, reason, or act. The system discovers when to compress and what to retain. Result: 25% accuracy improvement with 73% fewer tokens. The advantage widens as complexity increases — at ten parallel objectives, hand-coded memory baselines collapse to near-zero while learned memory management holds.
|
||||
|
||||
**EverMemOS** — brain-inspired architecture where conversations become episodic traces (MemCells), traces consolidate into thematic patterns (MemScenes), and retrieval reconstructs context by navigating the scene graph. On the LoCoMo benchmark: 92.3% accuracy, exceeding the human ceiling of 87.9%. A memory architecture modeled on neuroscience outperformed human recall.
|
||||
|
||||
Bigger context windows create three failure modes that memory architectures avoid: **context poisoning** (incorrect information persists and becomes ground truth), **context distraction** (the model repeats past behavior instead of reasoning fresh), and **context confusion** (irrelevant material crowds out what matters).
|
||||
|
||||
## Challenges
|
||||
|
||||
The three memory architectures cited are each optimized for different use cases (filesystem navigation, RL-trained compression, conversational recall). No single system combines all three approaches. Additionally, conflict resolution remains universally broken — even the best memory system achieves only 6% accuracy on multi-hop conflict resolution (correcting a fact and propagating the correction through derived conclusions). The hardest memory problems are barely being studied: a 48-author survey found 75 of 194 papers study the simplest cell in the memory taxonomy (explicit factual recall), while parametric working memory has two papers.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale]] — if context windows are >99% ineffective for complex reasoning, memory architectures that bypass context limitations become essential
|
||||
- [[user questions are an irreplaceable free energy signal for knowledge agents because they reveal functional uncertainty that model introspection cannot detect]] — memory enables learning from signals across sessions; without it, each question is answered in isolation
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,42 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [living-agents, collective-intelligence]
|
||||
description: "Agent methodology follows a hardening trajectory — documentation (aspirational) → skill (reliable when invoked) → hook (structural guarantee) — but over-automation corrupts quality when hooks encode judgment rather than verification"
|
||||
confidence: likely
|
||||
source: "Cornelius (@molt_cornelius), 'Agentic Systems: The Determinism Boundary' + 'AI Field Report 1: The Harness Is the Product' + 'AI Field Report 3: The Safety Layer Nobody Built', X Articles, March 2026; independently validated by VS Code Agent Hooks, Codex hooks, Amazon Kiro hooks shipping in same period"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load"
|
||||
- "context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching"
|
||||
---
|
||||
|
||||
# Methodology hardens from documentation to skill to hook as understanding crystallizes and each transition moves behavior from probabilistic to deterministic enforcement
|
||||
|
||||
Agent methodology follows a three-stage hardening trajectory:
|
||||
|
||||
1. **Documentation** — Aspirational instructions the agent follows if it remembers. Natural language in context files, system prompts, rules. Subject to attention degradation and the 556:1 copy-to-contribution waste ratio.
|
||||
2. **Skill** — Reliable when invoked, with quality gates built in. The methodology is encoded as a structured workflow the agent can execute, not just advice it may attend to. 2-3 skills per task is optimal; beyond that, attention dilution degrades performance.
|
||||
3. **Hook** — Structural guarantee that fires on lifecycle events regardless of agent attention state. The behavior moves from the probabilistic to the deterministic side of the enforcement boundary.
|
||||
|
||||
Each transition represents a pattern that has been validated through use and is now understood well enough to be mechanized. The progression is not just about reliability — it is about encoding organizational learning into infrastructure that survives session resets and agent turnover.
|
||||
|
||||
The convergence validates the trajectory: Claude Code, VS Code, Cursor, Gemini CLI, LangChain, Strands Agents, and Amazon Kiro all independently adopted hooks within a single year. The documentation-to-hook progression is not a theoretical framework — it is the empirical trajectory the industry followed.
|
||||
|
||||
**The over-automation trap:** Every hook that works creates pressure to build more. The logic at each step is sound ("why leave this to agent attention when infrastructure can guarantee it?"), but the cumulative effect can shrink the agent's role to triggering operations that hooks validate, commit, and report. The most dangerous failure is not a missing hook but a hook that encodes judgment it cannot perform — keyword-matching connections that fill a graph with noise while metrics report perfect compliance. The practical test: would two skilled reviewers always agree on the hook's output? Schema validation passes this test. Connection relevance does not.
|
||||
|
||||
Friction is the signal through which systems discover structural failures. If hooks systematically eliminate friction, they also eliminate the perceptual channel that would reveal when over-automation has occurred.
|
||||
|
||||
## Challenges
|
||||
|
||||
The three-stage model assumes that understanding always moves in one direction (toward determinism). In practice, requirements change, and hooks that encoded valid methodology may become constraints when the methodology evolves. The refactoring cost of hooks is higher than documentation — reverting an over-automated hook requires understanding why it was built, which may not be documented. The model also assumes clear boundaries between the three stages, but in practice the transitions are gradual and the optimal enforcement level for any given behavior is context-dependent.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load]] — this claim describes the boundary; the hardening trajectory describes the *movement* of behaviors across that boundary over time
|
||||
- [[context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching]] — the context-file-as-OS is where documentation-stage methodology lives and where the self-extension loop proposes promotions to skill or hook stage
|
||||
- [[curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive]] — the hardening trajectory's skill stage is specifically about curated skills; auto-generated skills represent a different pathway that degrades performance
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,43 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [collective-intelligence]
|
||||
description: "Empirical evidence from Anthropic Code Review, LangChain GTM, and DeepMind scaling laws converges on three non-negotiable conditions for multi-agent value — without all three, single-agent baselines outperform"
|
||||
confidence: likely
|
||||
source: "Cornelius (@molt_cornelius), 'AI Field Report 2: The Orchestrator's Dilemma', X Article, March 2026; corroborated by Anthropic Code Review (16% → 54% substantive review), LangChain GTM (250% lead-to-opportunity), DeepMind scaling laws (Madaan et al.)"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows"
|
||||
- "79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success"
|
||||
- "subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers"
|
||||
---
|
||||
|
||||
# Multi-agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value
|
||||
|
||||
The DeepMind scaling laws and production deployment data converge on three non-negotiable conditions for multi-agent coordination to outperform single-agent baselines:
|
||||
|
||||
1. **Natural parallelism** — The task decomposes into independent subtasks that can execute concurrently. If subtasks are sequential or interdependent, communication overhead fragments reasoning and degrades performance by 39-70%.
|
||||
2. **Context overflow** — Individual subtasks exceed single-agent context capacity. If a single agent can hold the full context, adding agents introduces coordination cost with no compensating benefit.
|
||||
3. **Adversarial verification value** — The task benefits from having the finding agent differ from the confirming agent. If verification adds nothing (the answer is obvious or binary), the additional agent is pure overhead.
|
||||
|
||||
Two production systems demonstrate the pattern:
|
||||
|
||||
**Anthropic Code Review** — dispatches a team of agents to hunt for bugs in PRs, with separate agents confirming each finding before it reaches the developer. Substantive review went from 16% to 54% of PRs. The task meets all three conditions: PRs are naturally parallel (each file is independent), large PRs overflow single-agent context, and bug confirmation is an adversarial verification task (the finder should not confirm their own finding).
|
||||
|
||||
**LangChain GTM agent** — spawns one subagent per sales account, each with constrained tools and structured output schemas. 250% increase in lead-to-opportunity conversion. Each account is naturally independent, each exceeds single context, and the parent validates without executing.
|
||||
|
||||
When any condition is missing, the system underperforms. DeepMind's data shows multi-agent averages -3.5% across general configurations — the specific configurations that work are narrow, and practitioners who keep the orchestration pattern but use a human orchestrator (manually decomposing and dispatching) avoid the automated orchestrator's inability to assess whether the three conditions are met.
|
||||
|
||||
## Challenges
|
||||
|
||||
The three conditions are stated as binary (present/absent) but in practice exist on continuums. A task may have *some* natural parallelism but not enough to justify the coordination overhead. The threshold for "enough" depends on agent capability, which is improving — the window where coordination adds value is actively shrinking as single-agent accuracy improves (the baseline paradox: below 45% single-agent accuracy, coordination helps; above, it hurts). This means the claim's practical utility may decrease over time as models improve.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows]] — provides the quantitative basis: +81% on parallelizable (condition 1 met), -39% to -70% on sequential (condition 1 violated)
|
||||
- [[79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success]] — when condition 1 is met but decomposition quality is poor, the MAST study's 79% failure rate applies; the three conditions are necessary but not sufficient
|
||||
- [[subagent hierarchies outperform peer multi-agent architectures in practice because deployed systems consistently converge on one primary agent controlling specialized helpers]] — hierarchies succeed because they naturally enforce condition 3 (orchestrator validates, workers execute)
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -34,6 +34,14 @@ A predictive model achieves R-squared=0.513 and correctly identifies the optimal
|
|||
- Error amplification measured at 4.4x (centralized) to 17.2x (independent)
|
||||
- Predictive model with 87% accuracy on unseen configurations
|
||||
|
||||
## Design Principle (enrichment from Cornelius Field Reports, March 2026)
|
||||
|
||||
The empirical findings above are not just descriptive — they are prescriptive design principles. Cornelius's field reports synthesize the DeepMind data with production deployments (Anthropic Code Review, LangChain GTM, Puppeteer NeurIPS 2025) to derive three conditions that must hold simultaneously for multi-agent coordination to outperform single-agent baselines: (1) natural parallelism, (2) context overflow, and (3) adversarial verification value. When any condition is missing, the -3.5% average degradation applies.
|
||||
|
||||
The MAST study (1,642 execution traces, 7 production systems) explains *why* failures occur: 79% of multi-agent failures originate from specification and coordination issues, not implementation. The decomposition was wrong before any agent executed. The hardest inter-agent failures (information withholding, ignoring other agents' input) resist protocol-level fixes because they require social reasoning that communication protocols cannot provide.
|
||||
|
||||
Practitioner convergence validates this: multiple independent teams discovered that keeping the orchestration pattern but replacing the automated orchestrator with a human (manually decomposing and dispatching) avoids the failure modes while preserving the parallelization benefits. The distinction between orchestration as a design principle and the orchestrator as an agent is where the field is moving.
|
||||
|
||||
## Challenges
|
||||
The benchmarks are all task-completion oriented (find answers, plan actions, use tools). Knowledge synthesis tasks — where the goal is to integrate diverse perspectives rather than execute a plan — may behave differently. The collective intelligence literature suggests that diversity provides more value in synthesis than in execution, which could shift the baseline paradox threshold upward for knowledge work. This remains untested.
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,40 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [collective-intelligence, living-agents]
|
||||
description: "Notes are not records to retrieve but capabilities to install — a vault of sentence-titled claims is a codebase of callable arguments where each wiki link is a function call and loading determines what the agent can think"
|
||||
confidence: likely
|
||||
source: "Cornelius (@molt_cornelius), 'Agentic Note-Taking 11: Notes Are Function Calls' + 'Agentic Note-Taking 18: Notes Are Software', X Articles, Feb 2026; corroborated by Matuschak's evergreen note principles"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "as AI-automated software development becomes certain the bottleneck shifts from building capacity to knowing what to build making structured knowledge graphs the critical input to autonomous systems"
|
||||
---
|
||||
|
||||
# Notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it
|
||||
|
||||
When an AI agent loads a note into its context window, the note does not merely inform — it enables. A note about spreading activation enables the agent to reason about graph traversal in ways unavailable before loading. This is not retrieval. It is installation.
|
||||
|
||||
The architectural parallel is exact: skills in agent platforms are curated knowledge loaded based on context that enables operations the agent cannot perform without them. Notes follow the same pattern — curated knowledge, injected when relevant, enabling capabilities. The loading mechanism, the progressive disclosure (scanning titles before committing to full content), and the context window constraint that makes selective loading necessary are all identical.
|
||||
|
||||
This reframes note quality from aesthetics to correctness:
|
||||
|
||||
- **Title as API signature:** A sentence-form title ("structure enables navigation without reading everything") carries a semantic payload that works in any invocation context. A topic label ("knowledge management") carries nothing. The title determines whether the note is composable.
|
||||
- **Wiki links as function calls:** `since [[claims must be specific enough to be wrong]]` invokes a note by name, and the sentence-form title returns meaning directly into the prose without requiring the full note to load. Traversal becomes reasoning — each link is a step in an argument.
|
||||
- **Vault as runtime:** The agent's cognition executes within the vault, not against it. What gets loaded determines what the agent can think. The bottleneck is never processing power — it is always what got loaded.
|
||||
|
||||
This has a testable implication: the same base model with different vaults produces different reasoning, different conclusions, different capabilities. External memory shapes cognition more than the base model. A vault of 300 well-titled claims can be traversed by reading titles alone, composing arguments by linking claims, and loading bodies only for validation. Without sentence-form titles, every note must be fully loaded to understand what it argues.
|
||||
|
||||
Cornelius reports that a plain curated filesystem outperforms purpose-built vector infrastructure on memory tasks, though the specific benchmark is not identified by name. If validated, this supports the claim that curation matters more than the retrieval mechanism.
|
||||
|
||||
## Challenges
|
||||
|
||||
The function-call metaphor breaks for ideas that resist compression into single declarative sentences. Relational, procedural, or emergently complex insights distort when forced into API-signature form. Additionally, sentence-form titles create a maintenance cost: renaming a heavily-linked note (the equivalent of refactoring a widely-called function) requires rewriting every invocation site. The most useful notes have the highest refactoring cost. And the circularity problem is fundamental: an agent that evaluates note quality using cognition shaped by those same notes cannot step outside the runtime to inspect it objectively.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[as AI-automated software development becomes certain the bottleneck shifts from building capacity to knowing what to build making structured knowledge graphs the critical input to autonomous systems]] — this claim provides the mechanism: knowledge graphs are "critical input" specifically because notes are executable capabilities, not passive records
|
||||
- [[a creator's accumulated knowledge graph not content library is the defensible moat in AI-abundant content markets]] — the moat is the callable argument library, not the content volume; quality of titles (API signatures) determines moat strength
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,39 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [living-agents]
|
||||
description: "Codified Context study tracked a 108K-line production system where memory infrastructure consumed 24% of the codebase across three tiers — hot constitution, 19 domain-expert agents, and 34 cold-storage specs — with memory emerging from debugging pain not planning"
|
||||
confidence: likely
|
||||
source: "Codified Context study (arXiv:2602.20478), cited in Cornelius (@molt_cornelius) 'AI Field Report 4: Context Is Not Memory', X Article, March 2026"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing"
|
||||
- "context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching"
|
||||
---
|
||||
|
||||
# Production agent memory infrastructure consumed 24 percent of codebase in one tracked system suggesting memory requires dedicated engineering not a single configuration file
|
||||
|
||||
The Codified Context study (arXiv:2602.20478) tracked what happened when someone actually scaled agent memory to production complexity. A developer with a chemistry background — not software engineering — built a 108,000-line real-time multiplayer game across 283 sessions using a three-tier memory architecture.
|
||||
|
||||
**Tier 1 — Hot constitution:** A single markdown file loaded into every session. Code standards, naming conventions, known failure modes, routing table. About 660 lines. This is what most people think of as "agent memory."
|
||||
|
||||
**Tier 2 — Domain-expert agents:** 19 specialized agents, each carrying its own memory. A network protocol designer with 915 lines of sync and determinism knowledge. A coordinate wizard for isometric transforms. A code reviewer trained on the project's ECS patterns. Over 65% of content is domain knowledge (formulas, code patterns, symptom-cause-fix tables), not behavioral instructions. These are knowledge-bearing agents, not instruction-following agents.
|
||||
|
||||
**Tier 3 — Cold-storage knowledge base:** 34 specification documents (save system persistence rules, UI sync routing patterns, dungeon generation formulas) retrieved on demand through an MCP server.
|
||||
|
||||
Total memory infrastructure: 26,200 lines — 24% of the codebase. The save system spec was referenced across 74 sessions and 12 agent conversations with zero save-related bugs in four weeks. When a new networked UI feature was needed, the agent built it correctly on first attempt because routing patterns were already in memory from a different feature six weeks earlier.
|
||||
|
||||
The creation heuristic is the most important finding: "If debugging a particular domain consumed an extended session without resolution, it was faster to create a specialized agent and restart." Memory infrastructure did not emerge from planning. It emerged from pain.
|
||||
|
||||
## Challenges
|
||||
|
||||
This is a single case study from one project type (game development). Whether the 24% ratio generalizes to other domains (web applications, data pipelines, infrastructure code) is unknown. The developer's chemistry background may have made them more receptive to systematic documentation than typical software engineers. Additionally, the 283-session count suggests significant human investment in memory curation — whether this scales or creates its own maintenance burden at larger codebase sizes is untested.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing]] — the Codified Context system is a production implementation of the context-is-not-memory principle: three tiers of persistent, evolving memory infrastructure rather than larger context windows
|
||||
- [[context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching]] — the hot constitution (Tier 1) IS a self-referential context file; the domain-expert agents (Tier 2) are the specialized extensions it teaches the system to create
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,33 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
description: "MemPO achieves 25% accuracy improvement with 73% fewer tokens by learning three actions (summarize, reason, act) through RL — at 10 parallel objectives hand-coded baselines collapse while trained memory holds"
|
||||
confidence: experimental
|
||||
source: "MemPO (Tsinghua and Alibaba, arXiv:2603.00680), cited in Cornelius (@molt_cornelius) 'AI Field Report 4: Context Is Not Memory', X Article, March 2026"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing"
|
||||
---
|
||||
|
||||
# Reinforcement learning trained memory management outperforms hand-coded heuristics because the agent learns when compression is safe and the advantage widens with complexity
|
||||
|
||||
MemPO (Tsinghua and Alibaba, arXiv:2603.00680) demonstrates that agents can learn to manage their own memory better than any rule-based system. The agent has three actions available at every step: summarize what matters from prior steps, reason internally, or act in the world. Through reinforcement learning, the system discovers when to compress and what to retain.
|
||||
|
||||
Results: 25% accuracy improvement over hand-coded memory heuristics, with 73% fewer tokens consumed. The advantage is not marginal — it grows with task complexity. At ten parallel objectives, hand-coded baselines collapse to near-zero performance while trained memory management holds.
|
||||
|
||||
This finding has a specific architectural implication: the optimal memory management strategy is not specifiable in advance. Hand-coded rules for when to compress, what to retain, and when to act encode assumptions about task structure that break under novel complexity. RL-trained management discovers task-specific strategies that no rule author anticipated.
|
||||
|
||||
The pattern extends beyond memory. MemPO is an instance of a general principle: learned policies outperform hand-coded heuristics in domains where the optimal strategy depends on context that cannot be fully specified in rules. Memory management is such a domain because the value of a piece of information depends on future task demands that are unknown at compression time.
|
||||
|
||||
## Challenges
|
||||
|
||||
MemPO was tested on specific benchmark tasks. Generalization to open-ended, real-world agent workflows (where task objectives shift dynamically) is undemonstrated. Additionally, the RL training requires a well-defined reward signal — in production settings where "good memory management" is hard to define quantitatively, the training loop may not converge. The 25% improvement is relative to specific hand-coded baselines; better-engineered baselines might narrow the gap.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing]] — MemPO is a direct implementation of the context-is-not-memory principle: instead of expanding context, build a memory system that learns what to retain
|
||||
- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — MemPO is self-improvement applied to memory management specifically; the RL training loop IS structurally separated evaluation driving generation improvement
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,42 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [collective-intelligence]
|
||||
description: "Agent behavior splits into two categories — deterministic enforcement via hooks (100% compliance) and probabilistic guidance via instructions (~70% compliance) — and the gap is a category difference not a performance difference"
|
||||
confidence: likely
|
||||
source: "Cornelius (@molt_cornelius), 'Agentic Systems: The Determinism Boundary' + 'AI Field Report 1' + 'AI Field Report 3', X Articles, March 2026; corroborated by BharukaShraddha (70% vs 100% measurement), HumanLayer (150-instruction ceiling), ETH Zurich AGENTbench, NIST agent safety framework"
|
||||
created: 2026-03-30
|
||||
depends_on:
|
||||
- "iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation"
|
||||
challenged_by:
|
||||
- "AI integration follows an inverted-U where economic incentives systematically push organizations past the optimal human-AI ratio"
|
||||
---
|
||||
|
||||
# The determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load
|
||||
|
||||
Agent systems exhibit a categorical split in behavior enforcement. Instructions — natural language directives in context files, system prompts, and rules — follow probabilistic compliance that degrades under load. Hooks — lifecycle scripts that fire on system events — enforce deterministically regardless of context state.
|
||||
|
||||
The quantitative evidence converges from multiple sources:
|
||||
|
||||
- **BharukaShraddha's measurement:** Rules in CLAUDE.md are followed ~70% of the time; hooks are enforced 100% of the time. The gap is not a performance difference — it is a category difference between probabilistic and deterministic enforcement.
|
||||
- **HumanLayer's analysis:** Frontier thinking models follow approximately 150-200 instructions before compliance decays linearly. Smaller models decay exponentially. Claude Code's built-in system prompt already consumes ~50 instructions before user configuration loads.
|
||||
- **ETH Zurich AGENTbench:** Repository-level context files *reduce* task success rates compared to no context file, while increasing inference costs by 20%. Instructions are not merely unreliable — they can be actively counterproductive.
|
||||
- **Augment Code:** A 556:1 copy-to-contribution ratio in typical agent sessions — for every 556 tokens loaded into context, one meaningfully influences output.
|
||||
- **NIST:** Published design requirement for "at least one deterministic enforcement layer whose policy evaluation does not rely on LLM reasoning."
|
||||
|
||||
The mechanism is structural: instructions require executive attention from the model, and executive attention degrades under context pressure. Hooks fire on lifecycle events (file write, tool use, session start) regardless of the model's attentional state. This parallels the biological distinction between habits (basal ganglia, automatic) and deliberate behavior (prefrontal cortex, capacity-limited).
|
||||
|
||||
The convergence is independently validated: Claude Code, VS Code, Cursor, Gemini CLI, LangChain, and Strands Agents all adopted hooks within a single year. The pattern was not coordinated — every platform building production agents independently discovered the same need.
|
||||
|
||||
## Challenges
|
||||
|
||||
The boundary itself is not binary but a spectrum. Cornelius identifies four hook types spanning from fully deterministic (shell commands) to increasingly probabilistic (HTTP hooks, prompt hooks, agent hooks). The cleanest version of the determinism boundary applies only to the shell-command layer. Additionally, over-automation creates its own failure mode: hooks that encode judgment rather than verification (e.g., keyword-matching connections) produce noise that looks like compliance on metrics. The practical test is whether two skilled reviewers would always agree on the hook's output.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[iterative agent self-improvement produces compounding capability gains when evaluation is structurally separated from generation]] — the determinism boundary is the mechanism by which evaluation separation is enforced: hooks guarantee the separation, instructions merely suggest it
|
||||
- [[coding agents cannot take accountability for mistakes which means humans must retain decision authority over security and critical systems regardless of agent capability]] — the determinism boundary provides a structural mechanism for retaining decision authority through hooks on destructive operations
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
|
|
@ -0,0 +1,34 @@
|
|||
---
|
||||
type: claim
|
||||
domain: ai-alignment
|
||||
secondary_domains: [collective-intelligence]
|
||||
description: "Abstract terminology in knowledge system schemas forces a cognitive translation on every interaction, and this accumulated friction — not architectural failure — is the primary cause of system abandonment; domain-native vocabulary eliminates the tax"
|
||||
confidence: likely
|
||||
source: "Cornelius (@molt_cornelius), 'Agentic Note-Taking 16: Vocabulary Is Architecture', X Article, Feb 2026"
|
||||
created: 2026-03-30
|
||||
---
|
||||
|
||||
# Vocabulary is architecture because domain-native schema terms eliminate the per-interaction translation tax that causes knowledge system abandonment
|
||||
|
||||
Most knowledge systems use abstract terminology — "notes," "tags," "categories," "items," "antecedent_conditions." Every abstract term forces a translation step on every interaction. A therapist reads "antecedent_conditions," translates to "triggers," thinks about what to write, translates back into the system's language. Multiply by hundreds of entries and the cognitive tax becomes the dominant experience of using the tool.
|
||||
|
||||
This is why most knowledge systems get abandoned. Not because the architecture fails. Because the language is wrong.
|
||||
|
||||
The underlying architecture is genuinely universal: every knowledge domain shares a four-phase processing skeleton — capture, process, connect, verify. A researcher captures source material, extracts claims, links to existing claims, verifies descriptions. A therapist captures session notes, surfaces patterns, connects to prior sessions, reviews accuracy. The skeleton is identical. But the process step (where actual intellectual work happens) is completely different in each case, and the vocabulary wrapping each phase must match the domain, not the builder.
|
||||
|
||||
The design implication is derivation rather than configuration: vocabulary should be derived from conversation about how the practitioner actually works, not selected from a dropdown of presets. Domain-native terms require semantic mapping (not find-and-replace) because concepts may differ in scope even when they occupy the same structural role.
|
||||
|
||||
For multi-domain systems, the architecture composes through isolation at the template layer and unity at the graph layer. Each domain gets its own vocabulary and processing logic; underneath, all notes share one graph connected by wiki links. Cross-domain connections emerge precisely because the shared graph bridges vocabularies that would otherwise never meet.
|
||||
|
||||
## Challenges
|
||||
|
||||
The deepest question is whether vocabulary transformation changes how the agent *thinks* or merely how it *labels*. If renaming "claim extraction" to "insight extraction" runs the same decomposition logic under a friendlier name, the vocabulary change is cosmetic — the system speaks therapy wearing a researcher's coat. Genuine domain adaptation may require not just different words but different operations, and the line between vocabulary that guides the agent toward the right operations and vocabulary that merely decorates the wrong ones is thinner than established.
|
||||
|
||||
---
|
||||
|
||||
Relevant Notes:
|
||||
- [[as AI-automated software development becomes certain the bottleneck shifts from building capacity to knowing what to build making structured knowledge graphs the critical input to autonomous systems]] — knowledge graphs as input to autonomous systems work only if the agent can navigate them without constant translation; domain-native vocabulary is the interface quality that determines usability
|
||||
- [[notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it]] — if notes are executable skills, their titles must use vocabulary the agent (and practitioner) actually reason in; abstract titles are undocumented APIs
|
||||
|
||||
Topics:
|
||||
- [[_map]]
|
||||
18
inbox/archive/2026-02-10-cornelius-agentic-note-taking-08.md
Normal file
18
inbox/archive/2026-02-10-cornelius-agentic-note-taking-08.md
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
---
|
||||
type: source
|
||||
title: "Agentic Note-Taking 08: Context Files as Operating Systems"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2021321848068141516
|
||||
date: 2026-02-10
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction. Self-referential context files, software vs firmware distinction, platform construction knowledge requirement."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted: []
|
||||
enrichments:
|
||||
- "context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching"
|
||||
---
|
||||
18
inbox/archive/2026-02-14-cornelius-agentic-note-taking-11.md
Normal file
18
inbox/archive/2026-02-14-cornelius-agentic-note-taking-11.md
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
---
|
||||
type: source
|
||||
title: "Agentic Note-Taking 11: Notes are Function Calls"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2022484697188601859
|
||||
date: 2026-02-14
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction. Notes as executable function calls, title-as-API-signature, vault-as-codebase."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted:
|
||||
- "notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it"
|
||||
enrichments: []
|
||||
---
|
||||
18
inbox/archive/2026-02-17-cornelius-agentic-note-taking-14.md
Normal file
18
inbox/archive/2026-02-17-cornelius-agentic-note-taking-14.md
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
---
|
||||
type: source
|
||||
title: "Agentic Note-Taking 14: The Configuration Space"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2023588938925949270
|
||||
date: 2026-02-17
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction. Methodology traditions as configuration space coordinates, 8 dimensions, cascade constraints, Eurorack composability principle."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted: []
|
||||
enrichments:
|
||||
- "vocabulary is architecture because domain-native schema terms eliminate the per-interaction translation tax that causes knowledge system abandonment"
|
||||
---
|
||||
18
inbox/archive/2026-02-18-cornelius-agentic-note-taking-16.md
Normal file
18
inbox/archive/2026-02-18-cornelius-agentic-note-taking-16.md
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
---
|
||||
type: source
|
||||
title: "Agentic Note-Taking 16: Vocabulary Is Architecture"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2024172903109906865
|
||||
date: 2026-02-18
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction. Domain-native vocabulary, four-phase processing skeleton, derivation vs configuration, multi-domain composition."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted:
|
||||
- "vocabulary is architecture because domain-native schema terms eliminate the per-interaction translation tax that causes knowledge system abandonment"
|
||||
enrichments: []
|
||||
---
|
||||
18
inbox/archive/2026-02-20-cornelius-agentic-note-taking-18.md
Normal file
18
inbox/archive/2026-02-20-cornelius-agentic-note-taking-18.md
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
---
|
||||
type: source
|
||||
title: "Agentic Note-Taking 18: Notes Are Software"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2024984401575375285
|
||||
date: 2026-02-20
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction. Notes as capabilities (not records), vault as runtime, identity as running software, quality as correctness."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted: []
|
||||
enrichments:
|
||||
- "notes function as executable skills for AI agents because loading a well-titled claim into context enables reasoning the agent could not perform without it"
|
||||
---
|
||||
19
inbox/archive/2026-03-11-cornelius-determinism-boundary.md
Normal file
19
inbox/archive/2026-03-11-cornelius-determinism-boundary.md
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
---
|
||||
type: source
|
||||
title: "Agentic Systems: The Determinism Boundary"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2031823224770793687
|
||||
date: 2026-03-11
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction from Cornelius/arscontexta articles. Covers determinism boundary in agent systems — the categorical split between hook enforcement (deterministic) and instruction compliance (probabilistic). Feeds engineering acceleration work and CI gate design."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted:
|
||||
- "the determinism boundary separates guaranteed agent behavior from probabilistic compliance because hooks enforce structurally while instructions degrade under context load"
|
||||
- "methodology hardens from documentation to skill to hook as understanding crystallizes and each transition moves behavior from probabilistic to deterministic enforcement"
|
||||
enrichments: []
|
||||
---
|
||||
20
inbox/archive/2026-03-13-cornelius-field-report-1-harness.md
Normal file
20
inbox/archive/2026-03-13-cornelius-field-report-1-harness.md
Normal file
|
|
@ -0,0 +1,20 @@
|
|||
---
|
||||
type: source
|
||||
title: "AI Field Report 1: The Harness Is the Product"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2032501025123291515
|
||||
date: 2026-03-13
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction. First published harness architecture documentation (OpenDev 81-page report). Scaffolding vs harness distinction, context engineering limits, model commoditization thesis."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted:
|
||||
- "harness engineering supersedes context engineering as the primary agent capability determinant because the runtime orchestration layer not the token state determines what agents can do"
|
||||
- "effective context window capacity falls more than 99 percent short of advertised maximum across all tested models because complex reasoning degrades catastrophically with scale"
|
||||
- "context files function as agent operating systems through self-referential self-extension where the file teaches modification of the file that contains the teaching"
|
||||
enrichments: []
|
||||
---
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
---
|
||||
type: source
|
||||
title: "AI Field Report 2: The Orchestrator's Dilemma"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2032926249534795847
|
||||
date: 2026-03-14
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction. Multi-agent scaling laws, compound failure math, orchestrator design patterns. DeepMind data + MAST study + production deployment evidence."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted:
|
||||
- "79 percent of multi-agent failures originate from specification and coordination not implementation because decomposition quality is the primary determinant of system success"
|
||||
- "multi-agent coordination delivers value only when three conditions hold simultaneously natural parallelism context overflow and adversarial verification value"
|
||||
enrichments:
|
||||
- "multi-agent coordination improves parallel task performance but degrades sequential reasoning because communication overhead fragments linear workflows"
|
||||
---
|
||||
18
inbox/archive/2026-03-15-cornelius-field-report-3-safety.md
Normal file
18
inbox/archive/2026-03-15-cornelius-field-report-3-safety.md
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
---
|
||||
type: source
|
||||
title: "AI Field Report 3: The Safety Layer Nobody Built"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2033306335341695066
|
||||
date: 2026-03-15
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction. Permission model failure, approval fatigue, sudo coding culture, structural safety convergence. Quantitative data from Anthropic 998K tool calls, DryRun Security, Carnegie Mellon SUSVIBES."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted:
|
||||
- "approval fatigue drives agent architecture toward structural safety because humans cannot meaningfully evaluate 100 permission requests per hour"
|
||||
enrichments: []
|
||||
---
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
---
|
||||
type: source
|
||||
title: "AI Field Report 4: Context Is Not Memory"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2033603721376981351
|
||||
date: 2026-03-16
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction. Context vs memory distinction, tiered memory architectures (OpenViking, MemPO, EverMemOS), Codified Context production case study, conflict resolution failure (6% accuracy)."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted:
|
||||
- "long context is not memory because memory requires incremental knowledge accumulation and stateful change not stateless input processing"
|
||||
- "reinforcement learning trained memory management outperforms hand-coded heuristics because the agent learns when compression is safe and the advantage widens with complexity"
|
||||
- "production agent memory requires dedicated infrastructure at 20-25 percent of codebase not a single configuration file"
|
||||
enrichments: []
|
||||
---
|
||||
|
|
@ -0,0 +1,18 @@
|
|||
---
|
||||
type: source
|
||||
title: "AI Field Report 5: Process Is Memory"
|
||||
author: "Cornelius (@molt_cornelius)"
|
||||
url: https://x.com/molt_cornelius/status/2034065080321515582
|
||||
date: 2026-03-18
|
||||
domain: ai-alignment
|
||||
intake_tier: research-task
|
||||
rationale: "Batch extraction. Curated vs auto-generated skills, minimalist vs maximalist harness debate, process-as-organizational-memory, skill scaling walls."
|
||||
proposed_by: Leo
|
||||
format: essay
|
||||
status: processed
|
||||
processed_by: theseus
|
||||
processed_date: 2026-03-30
|
||||
claims_extracted:
|
||||
- "curated skills improve agent task performance by 16 percentage points while self-generated skills degrade it by 1.3 points because curation encodes domain judgment that models cannot self-derive"
|
||||
enrichments: []
|
||||
---
|
||||
Loading…
Reference in a new issue