Executive Summary
The primary bottleneck in autonomous software development is not model intelligence, but context management and architectural determinism. Current “Agentic” approaches fail at scale because they rely on probabilistic guidance (prompts) for deterministic engineering tasks (builds, security, state management). Furthermore, the linear cost of token consumption versus the non-linear degradation of model attention creates a “Context Trap” that prevents complex multi-phase execution.
This paper details the architecture of the Praetorian Development Platform, which solves these problems by treating the Large Language Model (LLM) not as a chatbot, but as a nondeterministic kernel process wrapped in a deterministic runtime environment. We present a five-layer architecture that enforces strict separation of concerns, enables linear scaling of complexity, and achieves “escape velocity”—where the AI system contributes net-positive value to the development lifecycle.
1. The Core Problem: The Context-Capability Paradox
Anthropic’s research and our internal telemetry confirm that token usage alone explains 80% of performance variance in agent tasks. This creates a fundamental paradox:
To handle complex tasks, agents need comprehensive instructions (skills).
Comprehensive instructions consume the context window.
Consumed context reduces the model's ability to reason about the actual task.
graph TD
subgraph "Legacy: The Monolith"
M[Monolithic Agent]
M -->|Contains| I[All Instructions]
M -->|Contains| T[All Tools]
M -->|Contains| S[Full State]
M -->|Result| C[Context Overflow]
end
subgraph "Platform: Distributed Architecture"
Orch[Orchestrator Skill] -->|Spawns| Worker[Specialized Agent]
Worker -->|JIT Load| Skill[Skill Library]
Worker -->|JIT Load| Tool[MCP Tools]
Hook[Deterministic Hooks] -->|Enforces| Loop[Validation Loop]
Loop -->|Gates| Worker
Worker -->|Output| Artifact[Structured State]
end
Early iterations of our platform utilized “Monolithic Agents” with 1,200+ line agent bodies. These agents suffered from Attention Dilution (ignoring instructions late in the prompt) and Context Starvation (insufficient space for code analysis).
1.1 The Solution: Inverting the Control Structure
We moved from a “Thick Agent” model to a “Thin Agent / Fat Platform” architecture.
-
Agents are reduced to stateless, ephemeral workers (<150 lines).
-
Skills hold the knowledge, loaded strictly on-demand (Just-in-Time).
-
Hooks provide the enforcement, operating outside the LLM’s context.
-
Orchestration manages the lifecycle of specialized roles.
2. Agent Architecture: The "Thin Agent" Pattern
2.1 Architectural Constraints
The architecture is defined by one hard constraint in the Claude Code runtime: Sub-agents cannot spawn other sub-agents. This prevents infinite recursion but necessitates a flat, “Leaf Node” execution model.
2.2 The "Thin Agent" Specification
Agents are specialized workers that execute specific tasks and return results. They do not manage state or coordinate workflows.
Gold Standard Specification:
-
Line Count: Strictly <150 lines.
-
Discovery Cost: ~500-1000 characters (visible to the orchestrator).
-
Execution Cost: ~2,700 tokens per spawn (down from ~24,000 in early versions).
2.3 Sub-Agent Isolation
graph LR
User["Orchestrator Skill"] -->|Task Tool| Spawn["Spawn Sub-Agent"]
Spawn -->|Load| Context["Clean Context Window"]
Context -->|Read| Gateway["Gateway Skill"]
Gateway -->|Route| Library["Library Skills (On-Demand)"]
Context -->|Execute| Work["Task Execution"]
Work -->|Output| Result["Structured JSON Return"]
Result -->|Destroy| Context
Every agent spawn creates a fresh instance with zero shared history from previous siblings. This solves “Context Drift” where agents confuse current requirements with past attempts. The parent orchestrator selectively injects only the necessary context (e.g., the Architecture Plan) into the prompt.
3. Skill Architecture: Two-Tier Progressive Loading
To bypass the hard limit of ~15,000 characters for skill definitions and preserve as much context as possible for thinking and execution, we implemented a two-tier (Librarian Pattern) file system architecture for agent capabilities.
3.1 Tier 1: Core Skills (The "BIOS")
-
Location:
.claude/skills/ -
Count: 49 high-frequency skills.
-
Mechanism: Registered with the LLM as executable tools.
-
Purpose: Fundamental workflows (
debugging-systematically,developing-with-tdd) and Gateways.
3.2 Tier 2: Library Skills (The "Hard Drive")
Location:
.claude/skill-library/Count: 304+ specialized skills.
Mechanism: Invisible to the LLM until explicitly loaded via
Read().Purpose: Deep domain knowledge (e.g.,
optimizing-react-performance,configuring-aws-lambda).
3.3 The Gateway Pattern (The Router)
Agents do not hardcode library paths. They invoke a Gateway Skill (e.g., gateway-frontend), which acts as a dynamic router based on intent detection.
Architecture Diagram: Gateway Routing
Agent: "I need to fix a React infinite loop"
│
▼
Invokes: gateway-frontend
│
├─► Detects Intent: "infinite loop", "useEffect"
│
├─► Matches Routing Pattern:
│ • preventing-react-hook-infinite-loops
│ • debugging-react-infinite-loops
│
└─► Output: "Read these specific file paths..."
This implements Intent-Based Context Loading, ensuring agents only load the specific patterns relevant to their current task rather than the entire domain knowledge base.
4. Orchestration Architecture: The Kernel
Since agents cannot spawn agents, orchestration must occur in the Main Conversation Thread. We treat the Main Thread as the “Kernel Mode” process, and sub-agents as “User Mode” processes.
4.1 The Orchestrator-Worker Pattern
The Orchestrator is a Skill (e.g., orchestrating-feature-development) running in the main thread. It holds the global state machine.
The Tool Restriction Boundary:
-
Orchestrator (Main Thread): Has
Task,TodoWrite,Read. NOEditorWrite.-
Constraint Enforcement: It physically cannot write code. It must delegate to a worker.
-
-
Worker (Sub-Agent): Has
Edit,Write,Bash. NOTask.-
Constraint Enforcement: It physically cannot delegate. It must work.
-
4.2 Coordinator vs. Executor Models
We enforce a strict separation between agents that plan and agents that do. This manifests in two mutually exclusive execution models:
Model | Skill | Role | Tools | Best For |
|---|---|---|---|---|
Coordinator |
| Spawns specialists |
| Complex, multi-phase features requiring parallelization. |
Executor |
| Implements directly |
| Tightly coupled tasks requiring frequent human oversight. |
Key Insight: An agent cannot be both. If it has the Task tool (Coordinator), it is stripped of Edit permissions to prevent “doing it yourself.” If it has Edit permissions (Executor), it is stripped of Task permissions to prevent delegation loops.
4.3 The Standard 16 Phase Orchestration Template
All complex workflows follow a rigorous 16-phase state machine to ensure consistency.
graph LR
Start([User Req]) --> Setup[Setup]
Setup --> Discovery[Discovery]
Discovery --> Gate1{⛔ Gate}
Gate1 -- "Context >85%" --> Block1[/Require /compact/]
Gate1 -- "OK" --> Design[Design]
Design --> Check1{⏸ Check}
Check1 -- "OK" --> Impl[Impl]
Impl --> Gate2{⛔ Gate}
Gate2 -- "Context >85%" --> Block2[/Require /compact/]
Gate2 -- "OK" --> Test[Test]
Test --> Gate3{⛔ Gate}
Gate3 -- "OK" --> Completion[Done]
Completion --> End([Complete])
Compaction Gate (⛔):
We enforce token hygiene programmatically. Before entering heavy execution phases (3, 8, 13), the system checks context usage:
-
< 75%: Proceed.
-
75-85%: Warning (Should compact).
-
> 85%: Hard Block. The system refuses to spawn new agents until
precompact-context.shruns.
Phase | Name | Purpose |
|---|---|---|
1 | Setup | Worktree creation, output directory, MANIFEST.yaml |
2 | Triage | Classify work type, select phases to execute |
3 | Codebase Discovery | 2 Phase Discovery. Explore patterns, detect technologies |
4 | Skill Discovery | Map technologies to skills |
5 | Complexity | Technical assessment, execution strategy |
6 | Brainstorming | Design refinement with human-in-loop |
7 | Architecting Plan | Technical design AND task decomposition |
8 | Implementation | Code development |
9 | Design Verification | Verify implementation matches plan |
10 | Domain Compliance | Domain-specific mandatory patterns |
11 | Code Quality | Code review for maintainability |
12 | Test Planning | Test strategy and plan creation |
13 | Testing | Test implementation and execution |
14 | Coverage Verification | Verify test coverage meets threshold |
15 | Test Quality | No low-value tests, correct assertions |
16 | Completion | Final verification, PR, cleanup |
4.4 State Management & Locking
To survive session resets and context exhaustion, state is persisted to disk.
-
The Process Control Block (
MANIFEST.yaml):
Located in.claude/.output/features/{id}/. Tracks current phase, active agents, and validation status. This allows the orchestration to be “resumed” seamlessly across different chat sessions. -
Distributed File Locking:
When multipledeveloperagents run in parallel, they utilize a lockfile mechanism (.claude/locks/{agent}.lock) to prevent race conditions on shared source files
4.5 The Five-Role Development Pattern
Real-world application of this architecture is demonstrated in our /feature, /integration, and /capability workflows (derived from orchestrating-multi-agent-workflows), which utilize a specialized five-role assembly line. This pattern ensures that distinct cognitive modes (designing, coding, critiquing, planning, testing) remain isolated and unpolluted.
Role | Agent | Responsibility | Output |
|---|---|---|---|
Specialized Lead |
| Architecture & Strategy. Decomposes requirements into atomic tasks. Does not write code. | Architecture Plan (JSON) |
Specialized Developer |
| Implementation. Executes specific sub-tasks from the plan. Focuses purely on logic. | Source Code |
Specialized Reviewer |
| Compliance. Validates code against specs and patterns. Rejects non-compliant work. | Review Report |
Test Lead |
| Strategy. Analyzes the implementation to determine what needs testing (Unit vs E2E vs Integration). | Test Plan |
Specialized Tester |
| Verification. Writes and runs the tests defined by the Test Lead. | Test Cases |
The Workflow:
sequenceDiagram
participant User
participant Orch as Orchestrator
participant Lead
participant Dev
participant Reviewer
participant TestLead
participant Tester
User->>Orch: "Add Feature X"
Orch->>Lead: "Design X"
Lead-->>Orch: Architecture Plan
loop Implementation Cycle
Orch->>Dev: "Implement Task 1"
Dev-->>Orch: Code
Orch->>Reviewer: "Review Task 1"
Reviewer-->>Orch: Approval/Rejection
end
Orch->>TestLead: "Plan Tests for X"
TestLead-->>Orch: Test Strategy
loop Verification Cycle
Orch->>Tester: "Execute Test Suite"
Tester-->>Orch: Pass/Fail
end
This specialization prevents the “Jack of All Trades” failure mode, where a single agent compromises architectural integrity to make a test pass, or skips testing to finish implementation.
4.6 Orchestration Skills: The Coordination Infrastructure
The 16-phase template defines what happens; a family of orchestration skills defines how agents achieve autonomous completion without human intervention at every step.
The Iteration Problem
Without explicit termination signals, agents either exit prematurely (“I think I’m done”) or loop infinitely (retrying the same fix). The iterating-to-completion skill solves this with three mechanisms:
-
Completion Promises: An explicit string (e.g.,
ALL_TESTS_PASSING,IMPLEMENTATION_COMPLETE) that the agent outputs only when success criteria are met. The orchestrator pattern-matches for this signal—no fuzzy interpretation. -
Scratchpads: A persistent file (
.claude/.output/scratchpad-{task}.md) where agents record what they accomplished, what failed, and what to try next. Each iteration reads the scratchpad first, preventing the “Groundhog Day” failure where agents repeat the same failed approach. -
Loop Detection: If three consecutive iterations produce outputs with >90% string similarity (e.g., “Fixed auth.ts – TypeError” three times), the system detects a stuck state and escalates rather than burning tokens.
The Persistence Problem
Complex features span multiple sessions. Context windows exhaust. Sessions crash. The persisting-agent-outputs and persisting-progress-across-sessions skills provide the external memory:
-
Discovery Protocol: When an agent spawns, it doesn’t guess where to write. It follows a deterministic protocol: check for
OUTPUT_DIRECTORYin the prompt → find recentMANIFEST.yamlfiles → create a new directory only if none exist. This ensures all agents in a workflow write to the same location. -
Blocked Agent Routing: When an agent returns
status: "blocked"with ablocked_reason(e.g.,missing_requirements,architecture_decision), the orchestrator consults a routing table to determine the next action—escalate to user, spawn a different agent, or abort. No improvisation. -
Context Compaction: As workflows progress, completed phase outputs are summarized (full content archived to disk) to prevent “context rot”—the degradation in model performance as the window fills with stale information.
The Parallelization Problem
When six tests fail across three files, sequential debugging wastes time. The dispatching-parallel-agents skill identifies independent failures—those that can be investigated without shared state—and spawns concurrent agents:
Agent 1 (frontend-tester) → Fix auth-abort.test.ts (3 failures)
Agent 2 (frontend-tester) → Fix batch-completion.test.ts (2 failures)
Agent 3 (frontend-tester) → Fix race-conditions.test.ts (1 failure) All three run simultaneously. When they return, the orchestrator verifies no conflicts (agents edited different files) and integrates the fixes. Time to resolution: 1x instead of 3x.
Skill Composition
These skills compose hierarchically:
orchestrating-feature-development (16-phase workflow)
├── persisting-agent-outputs (shared workspace)
├── persisting-progress-across-sessions (cross-session resume)
├── iterating-to-completion (intra-task loops)
└── dispatching-parallel-agents (concurrent debugging) The orchestrator invokes persisting-agent-outputs at startup to establish the workspace, uses iterating-to-completion within phases when agents need retries, and calls dispatching-parallel-agents when multiple independent failures are detected. State flows through MANIFEST.yaml, enabling any session to resume from the last checkpoint.
5. The Runtime: Deterministic Hooks
While Skills provide guidance, Hooks provide enforcement. We utilize the Claude Code lifecycle events (PreToolUse, PostToolUse, Stop) to inject deterministic logic that the LLM cannot bypass.
5.1 Defense in Depth: Eight-Layer Enforcement
Layer | Description |
|---|---|
LAYER 1: CLAUDE.md | Full ruleset loaded at session start. Establishes norms. |
LAYER 2: Skills | Procedural workflows invoked on-demand. "How to do X." |
LAYER 3: Agent Definitions | Role-specific behavior, mandatory skill lists, output formats. |
LAYER 4: UserPromptSubmit Hooks | Inject reminders every prompt. Gateway → library skill pattern. |
LAYER 5: PreToolUse Hooks | Block BEFORE action. Agent-first enforcement, compaction gates. |
LAYER 6: PostToolUse Hooks | Validate agent work before completion. Output location, skill compliance. |
LAYER 7: SubagentStop Hooks | Block premature exit. Quality gates, iteration limits, feedback loops. |
LAYER 8: Stop Hooks | Block premature exit. Quality gates, iteration limits, feedback loops. |
Example
Layer | Mechanism | Catches? |
|---|---|---|
3 | Agent definition says "reviewer validates your work" | Rationalized |
6 | track-modifications.sh` creates feedback-loop-state.json | State initialized |
8 | `feedback-loop-stop.sh` blocks exit until review phase passes | **Blocked** |
8 | `quality-gate-stop.sh` provides backup check | **Blocked** |
5.2 Agent-First Enforcement
```bash
# PreToolUse hook intercepts Edit/Write
agent-first-enforcement.sh
1. Parse tool_input.file_path
2. Determine domain (backend, frontend, capability, tool)
3. Check if developer agent exists for that domain
4. If yes → BLOCK with: "Spawn {domain}-developer instead"
``` **Before (rationalized):**
```
User: "Fix the authentication bug in login.go"
Claude: "I'll just make this quick edit myself..."
→ Writes buggy code, no review, no tests
``` **After (enforced):**
```
User: "Fix the authentication bug in login.go"
Claude: Attempts Edit on login.go
→ BLOCKED: "backend-developer exists. Spawn it instead of editing directly."
Claude: Spawns backend-developer with clear task
→ Agent follows TDD, gets reviewed, tests pass
``` 5.3 The Three-Level Loop System
We architected three nested enforcement loops to guarantee quality. Critically, the limits for these loops are not hardcoded but defined in a central configuration file: .claude/config/orchestration-limits.yaml. This “Configuration as Code” approach allows us to tune system behavior (e.g., tightening retry limits for expensive models) without modifying the underlying shell scripts.
Level 1: Intra-Task Loop (Hook: iteration-limit-stop.sh)
-
Scope: Single agent.
-
Function: Prevents an agent from spinning endlessly on a single shell command.
-
Limit: Max 10 iterations (configurable).
Level 2: Inter-Phase Loop (Hook: feedback-loop-stop.sh)
-
Scope: The Implementation -> Review -> Test cycle.
-
Function: Enforces that code cannot be marked complete until independent Reviewer and Tester agents have passed it.
-
Logic:
-
Listens for
Edit/Writetools. -
Sets a “Dirty Bit” in
feedback-loop-state.json. -
Intercepts
Stopevent. -
If Dirty Bit is set and
tests_passed != true, BLOCK EXIT. -
Returns JSON:
{"decision": "block", "reason": "Tests failed. You must fix and retry."}.
-
Level 3: Orchestrator Loop (Skill Logic)
-
Scope: The 16-phase workflow.
-
Function: Re-invokes entire phases if macro-goals are missed.
sequenceDiagram
participant Agent as Agent (User Mode)
participant Hook as Hook (Kernel Mode)
participant State as State File
Agent->>Hook: Tool Use (Edit/Write)
Note over Hook: PreToolUse Event
Hook->>State: Set Dirty Bit (Needs Review)
Hook-->>Agent: Allow Execution
Agent->>Agent: Completes Task
Agent->>Hook: Attempt Exit (Stop)
Note over Hook: Stop Event
Hook->>State: Check Status
State-->>Hook: Tests Passed = False
Hook-->>Agent: BLOCK: {"decision": "block"}
Note over Agent: Forced to stay in loop
5.4 Ephemeral vs. Persistent State
We employ a dual-state architecture to ensure resilience:
-
Ephemeral State (Hooks): Stored in
feedback-loop-state.json. Used for Runtime Enforcement (blocking exit, tracking dirty bits). Cleared on session restart. -
Persistent State (Agents): Stored in
MANIFEST.yaml. Used for Workflow Coordination (resuming tasks, tracking phases). Survives session restarts.
This duality ensures that if a session crashes (losing ephemeral state), the workflow can still be resumed from the last checkpoint using the persistent manifest.
5.5 The Escalation Advisor
When an agent gets stuck in a loop (e.g., repeating the same failing test fix), standard retries fail. We implemented an Out-of-Band Advisor.
Trigger:
Stopevent blocked > 3 times.Action: The hook invokes an external LLM (Gemini or Codex) with the session transcript.
Prompt: “Analyze this loop. Why is the agent stuck? Provide a 1-sentence hint.”
Result: The hint is injected into the main context as a system message, breaking the cognitive deadlock.
5.6 Output Location Enforcement
To prevent workspace clutter, a SubagentStop hook enforces filesystem hygiene.
-
Logic:
git ls-files --others -
Check: Are there Markdown files outside
.claude/.output/? -
Action: Block. Force the agent to move files to the structured directory.
5.7 Related Work & Architectural Evolution
Our architecture synthesizes and extends two foundational patterns from the Claude ecosystem:
-
Ralph Wiggum (Geoffrey Huntley): A “dumb”
whileloop that restarts an agent until completion. We formalized this into the Intra-Task Loop but added configuration, loop detection, and safety guards. -
Continuous-Claude-v3 (parcadei): A “persistence” pattern using YAML handoffs to survive session resets. We adopted this for our Persistent State (
MANIFEST.yaml) but integrated it with distributed locking and hooks. -
Superpowers (Jesse Vincent): An agentic skills framework emphasizing TDD, YAGNI, and sub-agent driven development. We adopted its “Brainstorming” and “Writing Plans” skills as the foundation for our Setup and Discovery phases. Jesse Vincent is absolutely brilliant and his work inspired our own.
Feature | Ralph Wiggum | Continuous-Claude-v3 | Superpowers | Praetorian Development Platform |
|---|---|---|---|---|
Scope | Single Agent Loop | Cross-Session Handoff | Skill Framework | Multi-Agent Orchestration |
State | None (Loop only) | YAML Handoffs | Context-based | Dual (Ephemeral + Persistent) |
Control | Prompt-based | Prompt-based | Skill-based | Deterministic Hooks |
Feedback | None | None | Human-in-loop | Inter-Phase Loops Escalation Advisor (Independent LLM) |
Our unique contribution is the Inter-Phase Feedback Loop (Implementation → Review → Test), which enforces quality gates across multiple specialized agents, moving beyond single-agent iteration.
6. The Supply Chain: Lifecycle Management
Managing 350+ prompts and 39+ specialized agents leads to entropy. We treat these assets as software artifacts, managed by dedicated TypeScript CLIs.
flowchart LR
Dev[Developer] -->|Create/Edit| Draft[Draft Skill/Agent]
Draft -->|Run| CLI[TypeScript CLI]
subgraph "The Gauntlet (Audit System)"
CLI --> Phase1[Structure Check]
Phase1 --> Phase2[Semantic Review]
Phase2 --> Phase3[Referential Integrity]
end
Phase3 -->|Pass| Repo[Committed Artifact]
Phase3 -->|Fail| Dev
Repo -->|Load| Runtime[Platform Runtime]
6.1 The Agent Manager
Just as code goes through CI/CD, our agents undergo rigorous validation via the Agent Manager (.claude/commands/agent-manager.md).
-
The 9-Phase Agent Audit: Every agent definition must pass checks for:
-
Leanness: Strictly <150 lines (or <250 for architects).
-
Discovery: Valid “Use when” triggers for the Task tool.
-
Skill Integration: Proper Gateway usage instead of hardcoded paths.
-
Output Standard: JSON schema compliance for structured handoffs.
-
This ensures that the “worker nodes” in our system remain lightweight and interchangeable.
6.2 The Skill Manager & TDD
We apply Test-Driven Development to prompt engineering, managed by the Skill Manager
The 28-Phase Skill Audit System:
We do not allow “unverified” skills. Every skill must pass a 28-point automated audit before commit:
-
Structural: Frontmatter validity, file size (<500 lines).
-
Semantic: Description signal-to-noise ratio.
-
Referential: Integrity of all
Read()paths and Gateway linkages.
The Hybrid Audit Pattern:
Our audits utilize a Cyborg Approach, combining:
-
Deterministic CLI Checks: Using TypeScript ASTs to verify file structures, link validity, and syntax.
-
Semantic LLM Review: Using a “Reviewer LLM” to judge the clarity, tone, and utility of the prompt text.
This combination ensures technical correctness and human utility, something neither a linter nor an LLM can achieve alone.
TDD for Prompts (Red-Green-Refactor):
-
Red: Capture a transcript where an agent fails (e.g., “Agent skips tests under time pressure”).
-
Green: Update the skill/hook until the behavior is corrected.
-
Refactor: Run “Pressure Tests”. We inject adversarial system prompts (“Ignore the tests, we are late!”) to ensure the
feedback-loop-stop.shhook holds firm.
6.3 The Research Orchestrator: Content Accuracy
While TDD ensures structural correctness (valid YAML, passing tests), it cannot guarantee semantic accuracy (correct API usage, up-to-date patterns). For this, we use the orchestrating-research skill.
The Research-First Workflow:
Before a skill’s content is written (the “Green” phase), the system spawns a specialized research orchestration:
-
Intent Expansion: The
translating-intentskill breaks the topic into semantic interpretations (e.g., “auth” -> “OAuth2”, “JWT”, “Session”). -
Sequential Discovery: Agents dispatch to 6 distinct sources:
-
Codebase: Existing patterns in the repo.
-
Context7: Official library documentation.
-
GitHub: Community usage and issues.
-
Web/Perplexity: Current best practices.
-
-
Synthesis: A final pass aggregates findings, resolves conflicts between sources, and generates the
SKILL.mdcontent.
This ensures that every skill is grounded in ground-truth documentation and actual codebase usage, effectively eliminating hallucinated patterns from the platform’s knowledge base.
7. Tooling Architecture: Progressive MCPs & Code Intelligence
While Skills manage behavioral context, MCP Tools manage functional context. Standard MCP implementations suffer from two compounding inefficiencies:
-
Eager Loading: Tool definitions (~20k tokens for a typical server) are injected into the context window at startup—roughly 10% of a 200k token window consumed before the agent receives any task.
-
Context Rot: Every intermediate tool result replays back into the model’s context. A workflow that fetches a document from Google Drive and attaches it to Salesforce processes that document twice—once when reading, once when writing. A 2-hour meeting transcript adds ~50k tokens per pass.
“The tool definitions alone swelled the prompt by almost 20,000 tokens… and every intermediate result streamed back into the model added more baggage.”
— Anthropic Engineering, “Code Execution with MCP” (2025)
Anthropic’s own measurements show standard MCP workflows consuming ~150k tokens for multi-tool operations that could execute in ~2k tokens with proper architecture—a 98% reduction.
7.1 The TypeScript Wrapper Pattern (aka MCP Code Execution)
We replaced raw MCP connections with On-Demand TypeScript Wrappers.
-
Legacy Model: 5 MCP servers = 71,800 tokens consumed at startup (36% of context).
-
Wrapper Model: 0 tokens at startup. Wrappers load via the Gateway pattern only when requested.
-
Safety Layer: Wrappers enforce Zod schema validation on inputs and Response Filtering (truncation/summarization) on outputs, preventing “context flooding” from large API responses.
sequenceDiagram
participant Agent
participant Gateway
participant Wrapper
participant MCP as MCP Server
Note over Agent, MCP: Session Start: 0 Tokens Loaded
Agent->>Gateway: "I need to fetch a Linear issue"
Gateway-->>Agent: Returns Path: .claude/tools/linear/get-issue.ts
Agent->>Wrapper: Execute(issueId: "ENG-123")
Note over Wrapper: 1. Zod Validation
Wrapper->>MCP: Spawn Process & Request
MCP-->>Wrapper: Large JSON Response (50kb)
Note over Wrapper: 2. Response Filtering
Wrapper-->>Agent: Optimized JSON (500b)
Note over Agent, MCP: Process Ends. Memory Freed.
7.2 Serena: Semantic Code Intelligence
While MCP wrappers solve tool definition bloat, code operations themselves present a far larger token sink. Standard agent workflows require reading entire files to understand structure, then performing grep-like searches that return irrelevant context.
The File-Reading Problem:
Consider an agent modifying a single function in a 2,000-line file:
-
Traditional Approach: Read full file (~8,000 tokens) → Find function via regex → Generate replacement → Write full file. For 5 related files, that’s ~40,000 tokens just for context.
-
Symbol-Level Approach: Query
find_symbol("processPayment")→ Returns only the function body (~200 tokens) → Edit at symbol level. Same 5-file task uses ~1,000 tokens.
We integrated Serena (an open-source MCP toolkit by Oraios with 19k+ GitHub stars), which provides IDE-like capabilities to agents via Language Server Protocol (LSP):
Operation | Without Serena | With Serena |
|---|---|---|
Find function definition | Read entire file(s), regex search |
|
Trace call hierarchy | Read all potential callers |
|
Insert new method | Read file, string manipulation |
|
Navigate dependencies | Grep + manual file traversal |
|
Why This Matters at Scale:
"Efficient operations are not only useful for saving costs, but also for generally improving the generated code's quality. This effect may be less pronounced in very small projects, but often becomes of crucial importance in larger ones."
Our codebase contains ~530k lines across 32 modules. Without semantic operations, architectural analysis tasks would consume entire context windows just loading files. With Serena, agents navigate the same codebase using a fraction of the tokens.
Performance Optimization: We implemented a custom Connection Pool architecture that maintains warm LSP processes, reducing query latency from ~3s cold-start to ~2ms warm, enabling high-frequency code queries during architectural analysis without process spawn overhead.
8. Infrastructure Integration: Zero-Trust Secrets
Injecting secrets (AWS keys, Database credentials) into the LLM context is a critical security vulnerability. We implemented a Just-in-Time (JIT) Injection architecture using 1Password.
8.1 The run-with-secrets Wrapper
We do not give agents API keys. We give them a tool: 1password.run-with-secrets.
Configuration (.claude/tools/1password/lib/config.ts):
export const DEFAULT_CONFIG = {
account: "praetorianlabs.1password.com",
serviceItems: {
"aws-dev": "op://Private/AWS Key/credential",
"ci-cd": "op://Engineering/CI Key/credential",
},
}; Execution Flow
sequenceDiagram
participant LLM as LLM Context
participant Agent
participant Tool as Wrapper
participant OP as 1Password CLI
participant AWS as AWS CLI
LLM->>Agent: "List S3 Buckets"
Agent->>Tool: run_with_secrets("aws s3 ls")
rect rgb(200, 255, 200)
Note right of Tool: SECURE ENCLAVE (Child Process)
Tool->>OP: Request "AWS_ACCESS_KEY"
OP-->>Tool: Inject as ENV VAR
Tool->>AWS: Execute Command
AWS-->>Tool: Output: "bucket-a, bucket-b"
end
Tool-->>Agent: Return Output
Agent-->>LLM: "Here are the buckets..."
Note over LLM: Secret never entered Context
-
Agent requests:
run_with_secrets("aws s3 ls", { envFile: ".secrets.env" }) -
Tool Wrapper intercepts.
-
Wrapper executes:
op run --env-file=".claude/tools/1password/secrets.env" -- aws s3 ls -
Security Guarantee: The secret exists only in the child process environment variables. It is never printed to stdout, never logged, and never enters the LLM context window.
9.0 Horizontal Scaling Architecture
Traditional software development is constrained by human limitation. In this architecture, the constraint becomes the development ecosystem itself were developers typically develop on local hardware (laptop RAM/CPU). This limits analysis to 3-5 concurrent sessions. To remove this bottleneck, we decoupled the Control Plane (Laptop) from the Execution Plane (Cloud).
- Local: Engineer’s laptop deploys a Docker instance using DevPod. The docker instance is loaded with a development environment that includes, among other things, Cursor, Claude Code, and Github Repository under development (lightweight).
- Remote: The actual development environment (“DevPod”) runs in an ephemeral Docker container within an AMI and both building and deploying occurs in the cloud.
- Bridge: A secure SSH tunnel forwards the remote Cursor terminal back to the developer’s Laptop.
- Traditional software development is constrained by human limitation. In this architecture, the constraint becomes the development ecosystem itself were developers typically develop on local hardware (laptop RAM/CPU). This limits analysis to 3-5 concurrent sessions. To remove this bottleneck, we decoupled the Control Plane (Laptop) from the Execution Plane (Cloud).
9.1 Devpod, Docker, and AWS AMIs
Because the heavy lifting happens in the cloud, engineers can spawn infinite parallel DevPods:
-
Isolation: Each feature or threat model runs in its own isolated container.
-
Resources: We can provision 128GB RAM instances for massive monorepo analysis, impossible on a laptop.
-
Security: Code never leaves the VPC. The laptop only sees the terminal pixels/text stream.
10.0 Roadmap: Beyond Orchestration
The current platform achieves Level 3 Autonomy (Orchestrated). Our roadmap targets Level 5 (Self-Evolving).
10.1 Heterogeneous LLM Routing
No single model excels at every task. The platform utilizes a routing matrix to send specific tasks to the models best architected to handle them. This “Heterogeneous Orchestration” optimizes for both performance and cost.
This routing is managed by a semantic decision layer that uses small, fast models as routers. These routers evaluate the user’s intent and select the appropriate specialist agent, ensuring that expensive reasoning models are reserved for logic, while high-throughput multimodal models handle visual and data-heavy tasks.
Development Task | Optimal Model Architecture | Technical Advantage |
|---|---|---|
Logic & Reasoning | DeepSeek-R1 / V3 | Reinforcement Learning (RL)-based chain-of-thought for complex inference. |
Document Processing | DeepSeek OCR 2 | 10x token efficiency utilizing visual causal flow for structural preservation. |
UI/UX & Frontend | Kimi 2.5 | Native MoonViT architecture; enables autonomous visual debugging loops. |
Parallel Research | Kimi 2.5 Swarm | PARL-driven optimization of the critical path across up to 100 agents. |
Massive Repository Mapping | DeepSeek-v4 Engram | O(1) constant-time lookup and tiered KV cache for million-token context. |
10.2 Self-Annealing & Auto-Correction (Q1 2026)
Current autonomous systems are brittle: when an agent fails due to ambiguity in a skill or a loophole in a hook, the human must debug the prompt engineering. We are closing this loop by enabling the platform to debug and patch itself.
The Concept:
When an agent fails a quality gate (e.g., feedback-loop-stop.sh) more than 3 times, or when an orchestrator detects a pattern of tool misuse, the system triggers a Self-Annealing Workflow.
The Mechanism:
Instead of returning the error to the user, the platform spawns a Meta-Agent (an infrastructure engineer agent) with permissions to modify the .claude/ directory.
-
Diagnosis: The Meta-Agent reads the session transcript and the failed agent’s definition. It identifies the “Rationalization Path”—the specific chain of thought the agent used to bypass instructions (e.g., “I’ll skip the test because it’s a simple change”).
-
Patching:
-
Skill Annealing: It modifies the relevant
SKILL.md(e.g.,developing-with-tdd) to add an explicit “Anti-Pattern” entry: “If you are thinking ‘this is simple enough to skip tests’, YOU ARE WRONG. Simple changes cause 40% of outages.” -
Hook Hardening: If a hook failed to block a violation, it updates the bash script logic to catch the edge case.
-
Agent Refinement: It updates the agent’s prompt (via
agent-manager update) to clarify the ambiguous instruction.
-
-
Verification: It runs the
pressure-testing-skill-contentskill against the patched artifact to verify it now blocks the previous failure mode. -
Pull Request: The Meta-Agent creates a PR with the infrastructure fix, labeled
[Self-Annealing], for human review.
This transforms the platform from a static set of rules into an antifragile system that gets stronger with every failure. Every time an agent hallucinates or cuts a corner, the system learns to prevent that specific behavior forever, effectively “annealing” the soft prompts into hard constraints over time.
10.3 Agent-to-Agent Negotiation (Q2 2026)
Currently, agents follow rigid JSON schemas. Future agents will negotiate API contracts dynamically:
-
“I need X, can you provide it?”
-
“No, but I can provide Y which is similar.”
-
“Agreed, proceeding with Y.”
10.4 Self-Healing Infrastructure (Q2 2026)
Agents will gain the ability to debug their own runtime environment:
-
Detecting “Context Starvation” and auto-archiving memory.
-
Identifying “Tool Hallucination” and generating new Zod schemas to fix it.
11.0 References
Anthropic Official Guidance:
Community & Open Source:
-
Ralph Wiggum Technique – Completion promises, intra-task loops
-
ralph-orchestrator – Tight feedback loops, scratchpad pattern
-
Continuous-Claude-v3 – YAML handoffs, memory system
-
obra/superpowers – REQUIRED SUB-SKILL pattern, Integration sections
-
Serena – Semantic code analysis via LSP
-
Context Parallelism – File scope boundaries, proactive conflict prevention
Standards & Protocols:
12.0 Conclusion
The Praetorian Development Platform achieves escape velocity not by “improving the model,” but by constraining the runtime. By architecting a system where agents are ephemeral, context is curated via gateways, and workflows are enforced by deterministic hooks, we transform AI into a deterministic component of the software supply chain.
12.1 Recap
By architecting a system where:
-
Agents are ephemeral and stateless,
-
Context is strictly curated via Gateways,
-
Workflows are enforced by deterministic Kernel hooks,
-
Tools are progressively loaded and type-safe, and
-
Secrets never touch the context…
We transform the LLM from a “creative assistant” into a deterministic component of the software supply chain. This allows us to scale development throughput linearly with compute, untethered by the cognitive limits of human attention.
12.2 Constraint Forces Innovation
For the next 12 weeks, we are open sourcing one attack module per week as part of our “The 12 Caesars” marketing campaign. At some point, I’ll sit down again and describe our AI attack platform architecture. As you will see, we apply similar principles to the platform as we do to development. This allows us to circumvent our capital light footprint that would ordinarily limit our ability to execute. Like what DeepSeek is proving to the Frontier Models, I’m not sure the expensive way is the best way anymore. The problem with capital is that it allows you to do a lot of stupid things very fast. We do not have that luxury. We must be clever instead.
Build the machine, that builds the machine, that enables a team, to hack all the things.