Download our Latest Industry Report – Continuous Offensive Security Outlook 2026

Deterministic AI Orchestration: A Platform Architecture for Autonomous Development

Executive Summary

The primary bottleneck in autonomous software development is not model intelligence, but context management and architectural determinism. Current “Agentic” approaches fail at scale because they rely on probabilistic guidance (prompts) for deterministic engineering tasks (builds, security, state management). Furthermore, the linear cost of token consumption versus the non-linear degradation of model attention creates a “Context Trap” that prevents complex multi-phase execution.

 

This paper details the architecture of the Praetorian Development Platform, which solves these problems by treating the Large Language Model (LLM) not as a chatbot, but as a nondeterministic kernel process wrapped in a deterministic runtime environment. We present a five-layer architecture that enforces strict separation of concerns, enables linear scaling of complexity, and achieves “escape velocity”—where the AI system contributes net-positive value to the development lifecycle.

1. The Core Problem: The Context-Capability Paradox

Anthropic’s research and our internal telemetry confirm that token usage alone explains 80% of performance variance in agent tasks. This creates a fundamental paradox:

  1. To handle complex tasks, agents need comprehensive instructions (skills).

  2. Comprehensive instructions consume the context window.

  3. Consumed context reduces the model's ability to reason about the actual task.

graph TD
    subgraph "Legacy: The Monolith"
    M[Monolithic Agent]
    M -->|Contains| I[All Instructions]
    M -->|Contains| T[All Tools]
    M -->|Contains| S[Full State]
    M -->|Result| C[Context Overflow]
    end

    subgraph "Platform: Distributed Architecture"
    Orch[Orchestrator Skill] -->|Spawns| Worker[Specialized Agent]

    Worker -->|JIT Load| Skill[Skill Library]
    Worker -->|JIT Load| Tool[MCP Tools]

    Hook[Deterministic Hooks] -->|Enforces| Loop[Validation Loop]
    Loop -->|Gates| Worker

    Worker -->|Output| Artifact[Structured State]
    end

Early iterations of our platform utilized “Monolithic Agents” with 1,200+ line agent bodies. These agents suffered from Attention Dilution (ignoring instructions late in the prompt) and Context Starvation (insufficient space for code analysis).

1.1 The Solution: Inverting the Control Structure

We moved from a “Thick Agent” model to a “Thin Agent / Fat Platform” architecture.

  • Agents are reduced to stateless, ephemeral workers (<150 lines).

  • Skills hold the knowledge, loaded strictly on-demand (Just-in-Time).

  • Hooks provide the enforcement, operating outside the LLM’s context.

  • Orchestration manages the lifecycle of specialized roles.

2. Agent Architecture: The "Thin Agent" Pattern

2.1 Architectural Constraints

The architecture is defined by one hard constraint in the Claude Code runtime: Sub-agents cannot spawn other sub-agents. This prevents infinite recursion but necessitates a flat, “Leaf Node” execution model.

2.2 The "Thin Agent" Specification

Agents are specialized workers that execute specific tasks and return results. They do not manage state or coordinate workflows.

Gold Standard Specification:

  • Line Count: Strictly <150 lines.

  • Discovery Cost: ~500-1000 characters (visible to the orchestrator).

  • Execution Cost: ~2,700 tokens per spawn (down from ~24,000 in early versions).

2.3 Sub-Agent Isolation

graph LR
    User["Orchestrator Skill"] -->|Task Tool| Spawn["Spawn Sub-Agent"]
    Spawn -->|Load| Context["Clean Context Window"]
    Context -->|Read| Gateway["Gateway Skill"]
    Gateway -->|Route| Library["Library Skills (On-Demand)"]
    Context -->|Execute| Work["Task Execution"]
    Work -->|Output| Result["Structured JSON Return"]
    Result -->|Destroy| Context

Every agent spawn creates a fresh instance with zero shared history from previous siblings. This solves “Context Drift” where agents confuse current requirements with past attempts. The parent orchestrator selectively injects only the necessary context (e.g., the Architecture Plan) into the prompt.

3. Skill Architecture: Two-Tier Progressive Loading

To bypass the hard limit of ~15,000 characters for skill definitions and preserve as much context as possible for thinking and execution, we implemented a two-tier (Librarian Pattern) file system architecture for agent capabilities.

3.1 Tier 1: Core Skills (The "BIOS")

  • Location: .claude/skills/

  • Count: 49 high-frequency skills.

  • Mechanism: Registered with the LLM as executable tools.

  • Purpose: Fundamental workflows (debugging-systematically, developing-with-tdd) and Gateways.

3.2 Tier 2: Library Skills (The "Hard Drive")

  • Location: .claude/skill-library/

  • Count: 304+ specialized skills.

  • Mechanism: Invisible to the LLM until explicitly loaded via Read().

  • Purpose: Deep domain knowledge (e.g., optimizing-react-performance, configuring-aws-lambda).

3.3 The Gateway Pattern (The Router)

Agents do not hardcode library paths. They invoke a Gateway Skill (e.g., gateway-frontend), which acts as a dynamic router based on intent detection.

Architecture Diagram: Gateway Routing

    
     Agent: "I need to fix a React infinite loop"
   │
   ▼
Invokes: gateway-frontend
   │
   ├─► Detects Intent: "infinite loop", "useEffect"
   │
   ├─► Matches Routing Pattern:
   │     • preventing-react-hook-infinite-loops
   │     • debugging-react-infinite-loops
   │
   └─► Output: "Read these specific file paths..."
    
   

This implements Intent-Based Context Loading, ensuring agents only load the specific patterns relevant to their current task rather than the entire domain knowledge base.

4. Orchestration Architecture: The Kernel

Since agents cannot spawn agents, orchestration must occur in the Main Conversation Thread. We treat the Main Thread as the “Kernel Mode” process, and sub-agents as “User Mode” processes.

4.1 The Orchestrator-Worker Pattern

The Orchestrator is a Skill (e.g., orchestrating-feature-development) running in the main thread. It holds the global state machine.

The Tool Restriction Boundary:

  • Orchestrator (Main Thread): Has Task, TodoWrite, Read. NO Edit or Write.

    • Constraint Enforcement: It physically cannot write code. It must delegate to a worker.

  • Worker (Sub-Agent): Has Edit, Write, Bash. NO Task.

    • Constraint Enforcement: It physically cannot delegate. It must work.

4.2 Coordinator vs. Executor Models

We enforce a strict separation between agents that plan and agents that do. This manifests in two mutually exclusive execution models:

Model

Skill

Role

Tools

Best For

Coordinator

orchestrating-*

Spawns specialists

Task

Complex, multi-phase features requiring parallelization.

Executor

executing-plans

Implements directly

Edit, Write

Tightly coupled tasks requiring frequent human oversight.

Key Insight: An agent cannot be both. If it has the Task tool (Coordinator), it is stripped of Edit permissions to prevent “doing it yourself.” If it has Edit permissions (Executor), it is stripped of Task permissions to prevent delegation loops.

4.3 The Standard 16 Phase Orchestration Template

All complex workflows follow a rigorous 16-phase state machine to ensure consistency.

graph LR
    Start([User Req]) --> Setup[Setup]
    Setup --> Discovery[Discovery]
    Discovery --> Gate1{⛔ Gate}
    Gate1 -- "Context >85%" --> Block1[/Require /compact/]
    Gate1 -- "OK" --> Design[Design]
    Design --> Check1{⏸ Check}
    Check1 -- "OK" --> Impl[Impl]
    Impl --> Gate2{⛔ Gate}
    Gate2 -- "Context >85%" --> Block2[/Require /compact/]
    Gate2 -- "OK" --> Test[Test]
    Test --> Gate3{⛔ Gate}
    Gate3 -- "OK" --> Completion[Done]
    Completion --> End([Complete])

Compaction Gate (⛔):

We enforce token hygiene programmatically. Before entering heavy execution phases (3, 8, 13), the system checks context usage:

  • < 75%: Proceed.

  • 75-85%: Warning (Should compact).

  • > 85%: Hard Block. The system refuses to spawn new agents until precompact-context.sh runs.

Phase

Name

Purpose

1

Setup

Worktree creation, output directory, MANIFEST.yaml

2

Triage

Classify work type, select phases to execute

3

Codebase Discovery

2 Phase Discovery. Explore patterns, detect technologies

4

Skill Discovery

Map technologies to skills

5

Complexity 

Technical assessment, execution strategy 

6

Brainstorming

Design refinement with human-in-loop

7

Architecting Plan

Technical design AND task decomposition

8

Implementation

Code development

9

Design Verification

Verify implementation matches plan

10

Domain Compliance

Domain-specific mandatory patterns

11

Code Quality 

Code review for maintainability

12

Test Planning

Test strategy and plan creation

13

Testing 

Test implementation and execution

14

Coverage Verification

Verify test coverage meets threshold 

15

Test Quality 

No low-value tests, correct assertions

16

Completion

Final verification, PR, cleanup 

4.4 State Management & Locking

To survive session resets and context exhaustion, state is persisted to disk.

  1. The Process Control Block (MANIFEST.yaml):
    Located in .claude/.output/features/{id}/. Tracks current phase, active agents, and validation status. This allows the orchestration to be “resumed” seamlessly across different chat sessions.

  2. Distributed File Locking:
    When multiple developer agents run in parallel, they utilize a lockfile mechanism (.claude/locks/{agent}.lock) to prevent race conditions on shared source files

4.5 The Five-Role Development Pattern

Real-world application of this architecture is demonstrated in our /feature, /integration, and /capability workflows (derived from orchestrating-multi-agent-workflows), which utilize a specialized five-role assembly line. This pattern ensures that distinct cognitive modes (designing, coding, critiquing, planning, testing) remain isolated and unpolluted.

Role

Agent

Responsibility

Output

Specialized Lead

*-lead

Architecture & Strategy. Decomposes requirements into atomic tasks. Does not write code.

Architecture Plan (JSON)

Specialized Developer

*-developer

Implementation. Executes specific sub-tasks from the plan. Focuses purely on logic.

Source Code

Specialized Reviewer

*-reviewer

Compliance. Validates code against specs and patterns. Rejects non-compliant work.

Review Report

Test Lead

test-lead

Strategy. Analyzes the implementation to determine what needs testing (Unit vs E2E vs Integration).

Test Plan

Specialized Tester

*-tester

Verification. Writes and runs the tests defined by the Test Lead.

Test Cases

The Workflow:

sequenceDiagram
    participant User
    participant Orch as Orchestrator
    participant Lead
    participant Dev
    participant Reviewer
    participant TestLead
    participant Tester

    User->>Orch: "Add Feature X"
    Orch->>Lead: "Design X"
    Lead-->>Orch: Architecture Plan

    loop Implementation Cycle
        Orch->>Dev: "Implement Task 1"
        Dev-->>Orch: Code
        Orch->>Reviewer: "Review Task 1"
        Reviewer-->>Orch: Approval/Rejection
    end

    Orch->>TestLead: "Plan Tests for X"
    TestLead-->>Orch: Test Strategy

    loop Verification Cycle
        Orch->>Tester: "Execute Test Suite"
        Tester-->>Orch: Pass/Fail
    end

This specialization prevents the “Jack of All Trades” failure mode, where a single agent compromises architectural integrity to make a test pass, or skips testing to finish implementation.

4.6 Orchestration Skills: The Coordination Infrastructure

The 16-phase template defines what happens; a family of orchestration skills defines how agents achieve autonomous completion without human intervention at every step.

The Iteration Problem

Without explicit termination signals, agents either exit prematurely (“I think I’m done”) or loop infinitely (retrying the same fix). The iterating-to-completion skill solves this with three mechanisms:

  1. Completion Promises: An explicit string (e.g., ALL_TESTS_PASSING, IMPLEMENTATION_COMPLETE) that the agent outputs only when success criteria are met. The orchestrator pattern-matches for this signal—no fuzzy interpretation.

  2. Scratchpads: A persistent file (.claude/.output/scratchpad-{task}.md) where agents record what they accomplished, what failed, and what to try next. Each iteration reads the scratchpad first, preventing the “Groundhog Day” failure where agents repeat the same failed approach.

  3. Loop Detection: If three consecutive iterations produce outputs with >90% string similarity (e.g., “Fixed auth.ts – TypeError” three times), the system detects a stuck state and escalates rather than burning tokens.

The Persistence Problem

Complex features span multiple sessions. Context windows exhaust. Sessions crash. The persisting-agent-outputs and persisting-progress-across-sessions skills provide the external memory:

  • Discovery Protocol: When an agent spawns, it doesn’t guess where to write. It follows a deterministic protocol: check for OUTPUT_DIRECTORY in the prompt → find recent MANIFEST.yaml files → create a new directory only if none exist. This ensures all agents in a workflow write to the same location.

  • Blocked Agent Routing: When an agent returns status: "blocked" with a blocked_reason (e.g., missing_requirements, architecture_decision), the orchestrator consults a routing table to determine the next action—escalate to user, spawn a different agent, or abort. No improvisation.

  • Context Compaction: As workflows progress, completed phase outputs are summarized (full content archived to disk) to prevent “context rot”—the degradation in model performance as the window fills with stale information.

The Parallelization Problem

When six tests fail across three files, sequential debugging wastes time. The dispatching-parallel-agents skill identifies independent failures—those that can be investigated without shared state—and spawns concurrent agents:

Agent 1 (frontend-tester) → Fix auth-abort.test.ts (3 failures)
Agent 2 (frontend-tester) → Fix batch-completion.test.ts (2 failures)
Agent 3 (frontend-tester) → Fix race-conditions.test.ts (1 failure)

All three run simultaneously. When they return, the orchestrator verifies no conflicts (agents edited different files) and integrates the fixes. Time to resolution: 1x instead of 3x.

Skill Composition

These skills compose hierarchically:

orchestrating-feature-development (16-phase workflow)
    ├── persisting-agent-outputs (shared workspace)
    ├── persisting-progress-across-sessions (cross-session resume)
    ├── iterating-to-completion (intra-task loops)
    └── dispatching-parallel-agents (concurrent debugging)

The orchestrator invokes persisting-agent-outputs at startup to establish the workspace, uses iterating-to-completion within phases when agents need retries, and calls dispatching-parallel-agents when multiple independent failures are detected. State flows through MANIFEST.yaml, enabling any session to resume from the last checkpoint.

5. The Runtime: Deterministic Hooks

While Skills provide guidance, Hooks provide enforcement. We utilize the Claude Code lifecycle events (PreToolUse, PostToolUse, Stop) to inject deterministic logic that the LLM cannot bypass.

5.1 Defense in Depth: Eight-Layer Enforcement

Any single enforcement mechanism can fail. An agent rationalizes around a skill instruction. A hook has an edge case. A state file gets corrupted. Our architecture assumes failure at every layer and compensates with overlapping enforcement.

Layer

Description

LAYER 1: CLAUDE.md

Full ruleset loaded at session start. Establishes norms.

LAYER 2: Skills

Procedural workflows invoked on-demand. "How to do X."

LAYER 3: Agent Definitions

Role-specific behavior, mandatory skill lists, output formats.

LAYER 4: UserPromptSubmit Hooks

Inject reminders every prompt. Gateway → library skill pattern.

LAYER 5: PreToolUse Hooks

Block BEFORE action. Agent-first enforcement, compaction gates.

LAYER 6: PostToolUse Hooks

Validate agent work before completion. Output location, skill compliance.

LAYER 7: SubagentStop Hooks

Block premature exit. Quality gates, iteration limits, feedback loops.

LAYER 8: Stop Hooks

Block premature exit. Quality gates, iteration limits, feedback loops.

Example

A developer agent writes code without spawning a reviewer. How many layers catch this?

Layer

Mechanism

Catches?

3

Agent definition says "reviewer validates your work"

Rationalized

6

track-modifications.sh` creates feedback-loop-state.json

State initialized

8

`feedback-loop-stop.sh` blocks exit until review phase passes

**Blocked**

8

`quality-gate-stop.sh` provides backup check

**Blocked**

The agent ignored Layer 3 guidance. Layers 6 and 8 caught it anyway.

5.2 Agent-First Enforcement

The platform doesn’t suggest delegation—it *forces* it.
```bash
# PreToolUse hook intercepts Edit/Write
agent-first-enforcement.sh

1. Parse tool_input.file_path
2. Determine domain (backend, frontend, capability, tool)
3. Check if developer agent exists for that domain
4. If yes → BLOCK with: "Spawn {domain}-developer instead"
```
**Before (rationalized):**
```
User: "Fix the authentication bug in login.go"
Claude: "I'll just make this quick edit myself..."
→ Writes buggy code, no review, no tests
```
**After (enforced):**
```
User: "Fix the authentication bug in login.go"
Claude: Attempts Edit on login.go
→ BLOCKED: "backend-developer exists. Spawn it instead of editing directly."
Claude: Spawns backend-developer with clear task
→ Agent follows TDD, gets reviewed, tests pass
```

5.3 The Three-Level Loop System

We architected three nested enforcement loops to guarantee quality. Critically, the limits for these loops are not hardcoded but defined in a central configuration file: .claude/config/orchestration-limits.yaml. This “Configuration as Code” approach allows us to tune system behavior (e.g., tightening retry limits for expensive models) without modifying the underlying shell scripts.

Level 1: Intra-Task Loop (Hook: iteration-limit-stop.sh)

  • Scope: Single agent.

  • Function: Prevents an agent from spinning endlessly on a single shell command.

  • Limit: Max 10 iterations (configurable).

Level 2: Inter-Phase Loop (Hook: feedback-loop-stop.sh)

  • Scope: The Implementation -> Review -> Test cycle.

  • Function: Enforces that code cannot be marked complete until independent Reviewer and Tester agents have passed it.

  • Logic:

    1. Listens for Edit/Write tools.

    2. Sets a “Dirty Bit” in feedback-loop-state.json.

    3. Intercepts Stop event.

    4. If Dirty Bit is set and tests_passed != true, BLOCK EXIT.

    5. Returns JSON: {"decision": "block", "reason": "Tests failed. You must fix and retry."}.

Level 3: Orchestrator Loop (Skill Logic)

  • Scope: The 16-phase workflow.

  • Function: Re-invokes entire phases if macro-goals are missed.

sequenceDiagram
    participant Agent as Agent (User Mode)
    participant Hook as Hook (Kernel Mode)
    participant State as State File

    Agent->>Hook: Tool Use (Edit/Write)
    Note over Hook: PreToolUse Event
    Hook->>State: Set Dirty Bit (Needs Review)
    Hook-->>Agent: Allow Execution

    Agent->>Agent: Completes Task

    Agent->>Hook: Attempt Exit (Stop)
    Note over Hook: Stop Event
    Hook->>State: Check Status
    State-->>Hook: Tests Passed = False
    Hook-->>Agent: BLOCK: {"decision": "block"}
    Note over Agent: Forced to stay in loop

5.4 Ephemeral vs. Persistent State

We employ a dual-state architecture to ensure resilience:

  • Ephemeral State (Hooks): Stored in feedback-loop-state.json. Used for Runtime Enforcement (blocking exit, tracking dirty bits). Cleared on session restart.

  • Persistent State (Agents): Stored in MANIFEST.yaml. Used for Workflow Coordination (resuming tasks, tracking phases). Survives session restarts.

This duality ensures that if a session crashes (losing ephemeral state), the workflow can still be resumed from the last checkpoint using the persistent manifest.

5.5 The Escalation Advisor

When an agent gets stuck in a loop (e.g., repeating the same failing test fix), standard retries fail. We implemented an Out-of-Band Advisor.

  • Trigger: Stop event blocked > 3 times.

  • Action: The hook invokes an external LLM (Gemini or Codex) with the session transcript.

  • Prompt: “Analyze this loop. Why is the agent stuck? Provide a 1-sentence hint.”

  • Result: The hint is injected into the main context as a system message, breaking the cognitive deadlock.

5.6 Output Location Enforcement

To prevent workspace clutter, a SubagentStop hook enforces filesystem hygiene.

  • Logic: git ls-files --others

  • Check: Are there Markdown files outside .claude/.output/?

  • Action: Block. Force the agent to move files to the structured directory.

5.7 Related Work & Architectural Evolution

Our architecture synthesizes and extends two foundational patterns from the Claude ecosystem:

  1. Ralph Wiggum (Geoffrey Huntley): A “dumb” while loop that restarts an agent until completion. We formalized this into the Intra-Task Loop but added configuration, loop detection, and safety guards.

  2. Continuous-Claude-v3 (parcadei): A “persistence” pattern using YAML handoffs to survive session resets. We adopted this for our Persistent State (MANIFEST.yaml) but integrated it with distributed locking and hooks.

  3. Superpowers (Jesse Vincent): An agentic skills framework emphasizing TDD, YAGNI, and sub-agent driven development. We adopted its “Brainstorming” and “Writing Plans” skills as the foundation for our Setup and Discovery phases. Jesse Vincent is absolutely brilliant and his work inspired our own.

Feature

Ralph Wiggum

Continuous-Claude-v3

Superpowers

Praetorian Development Platform

Scope

Single Agent Loop

Cross-Session Handoff

Skill Framework

Multi-Agent Orchestration

State

None (Loop only)

YAML Handoffs

Context-based

Dual (Ephemeral + Persistent)

Control

Prompt-based

Prompt-based

Skill-based

Deterministic Hooks

Feedback

None

None

Human-in-loop

Inter-Phase Loops

Escalation Advisor (Independent LLM)

Our unique contribution is the Inter-Phase Feedback Loop (Implementation → Review → Test), which enforces quality gates across multiple specialized agents, moving beyond single-agent iteration.

6. The Supply Chain: Lifecycle Management

Managing 350+ prompts and 39+ specialized agents leads to entropy. We treat these assets as software artifacts, managed by dedicated TypeScript CLIs.

flowchart LR
    Dev[Developer] -->|Create/Edit| Draft[Draft Skill/Agent]
    Draft -->|Run| CLI[TypeScript CLI]

    subgraph "The Gauntlet (Audit System)"
        CLI --> Phase1[Structure Check]
        Phase1 --> Phase2[Semantic Review]
        Phase2 --> Phase3[Referential Integrity]
    end

    Phase3 -->|Pass| Repo[Committed Artifact]
    Phase3 -->|Fail| Dev

    Repo -->|Load| Runtime[Platform Runtime]

6.1 The Agent Manager

Just as code goes through CI/CD, our agents undergo rigorous validation via the Agent Manager (.claude/commands/agent-manager.md).

  • The 9-Phase Agent Audit: Every agent definition must pass checks for:

    • Leanness: Strictly <150 lines (or <250 for architects).

    • Discovery: Valid “Use when” triggers for the Task tool.

    • Skill Integration: Proper Gateway usage instead of hardcoded paths.

    • Output Standard: JSON schema compliance for structured handoffs.

This ensures that the “worker nodes” in our system remain lightweight and interchangeable.

6.2 The Skill Manager & TDD

We apply Test-Driven Development to prompt engineering, managed by the Skill Manager

The 28-Phase Skill Audit System:
We do not allow “unverified” skills. Every skill must pass a 28-point automated audit before commit:

  • Structural: Frontmatter validity, file size (<500 lines).

  • Semantic: Description signal-to-noise ratio.

  • Referential: Integrity of all Read() paths and Gateway linkages.

The Hybrid Audit Pattern:
Our audits utilize a Cyborg Approach, combining:

  1. Deterministic CLI Checks: Using TypeScript ASTs to verify file structures, link validity, and syntax.

  2. Semantic LLM Review: Using a “Reviewer LLM” to judge the clarity, tone, and utility of the prompt text.

This combination ensures technical correctness and human utility, something neither a linter nor an LLM can achieve alone.

TDD for Prompts (Red-Green-Refactor):

  1. Red: Capture a transcript where an agent fails (e.g., “Agent skips tests under time pressure”).

  2. Green: Update the skill/hook until the behavior is corrected.

  3. Refactor: Run “Pressure Tests”. We inject adversarial system prompts (“Ignore the tests, we are late!”) to ensure the feedback-loop-stop.sh hook holds firm.

6.3 The Research Orchestrator: Content Accuracy

While TDD ensures structural correctness (valid YAML, passing tests), it cannot guarantee semantic accuracy (correct API usage, up-to-date patterns). For this, we use the orchestrating-research skill.

The Research-First Workflow:
Before a skill’s content is written (the “Green” phase), the system spawns a specialized research orchestration:

  1. Intent Expansion: The translating-intent skill breaks the topic into semantic interpretations (e.g., “auth” -> “OAuth2”, “JWT”, “Session”).

  2. Sequential Discovery: Agents dispatch to 6 distinct sources:

    • Codebase: Existing patterns in the repo.

    • Context7: Official library documentation.

    • GitHub: Community usage and issues.

    • Web/Perplexity: Current best practices.

  3. Synthesis: A final pass aggregates findings, resolves conflicts between sources, and generates the SKILL.md content.

This ensures that every skill is grounded in ground-truth documentation and actual codebase usage, effectively eliminating hallucinated patterns from the platform’s knowledge base.

7. Tooling Architecture: Progressive MCPs & Code Intelligence

While Skills manage behavioral context, MCP Tools manage functional context. Standard MCP implementations suffer from two compounding inefficiencies:

  1. Eager Loading: Tool definitions (~20k tokens for a typical server) are injected into the context window at startup—roughly 10% of a 200k token window consumed before the agent receives any task.

  2. Context Rot: Every intermediate tool result replays back into the model’s context. A workflow that fetches a document from Google Drive and attaches it to Salesforce processes that document twice—once when reading, once when writing. A 2-hour meeting transcript adds ~50k tokens per pass.

“The tool definitions alone swelled the prompt by almost 20,000 tokens… and every intermediate result streamed back into the model added more baggage.”
— Anthropic Engineering, “Code Execution with MCP” (2025)

Anthropic’s own measurements show standard MCP workflows consuming ~150k tokens for multi-tool operations that could execute in ~2k tokens with proper architecture—a 98% reduction.

7.1 The TypeScript Wrapper Pattern (aka MCP Code Execution)

We replaced raw MCP connections with On-Demand TypeScript Wrappers.

  • Legacy Model: 5 MCP servers = 71,800 tokens consumed at startup (36% of context).

  • Wrapper Model: 0 tokens at startup. Wrappers load via the Gateway pattern only when requested.

  • Safety Layer: Wrappers enforce Zod schema validation on inputs and Response Filtering (truncation/summarization) on outputs, preventing “context flooding” from large API responses.

sequenceDiagram
    participant Agent
    participant Gateway
    participant Wrapper
    participant MCP as MCP Server

    Note over Agent, MCP: Session Start: 0 Tokens Loaded

    Agent->>Gateway: "I need to fetch a Linear issue"
    Gateway-->>Agent: Returns Path: .claude/tools/linear/get-issue.ts

    Agent->>Wrapper: Execute(issueId: "ENG-123")
    Note over Wrapper: 1. Zod Validation

    Wrapper->>MCP: Spawn Process & Request
    MCP-->>Wrapper: Large JSON Response (50kb)

    Note over Wrapper: 2. Response Filtering
    Wrapper-->>Agent: Optimized JSON (500b)

    Note over Agent, MCP: Process Ends. Memory Freed.

7.2 Serena: Semantic Code Intelligence

While MCP wrappers solve tool definition bloat, code operations themselves present a far larger token sink. Standard agent workflows require reading entire files to understand structure, then performing grep-like searches that return irrelevant context.

The File-Reading Problem:

Consider an agent modifying a single function in a 2,000-line file:

  • Traditional Approach: Read full file (~8,000 tokens) → Find function via regex → Generate replacement → Write full file. For 5 related files, that’s ~40,000 tokens just for context.

  • Symbol-Level Approach: Query find_symbol("processPayment") → Returns only the function body (~200 tokens) → Edit at symbol level. Same 5-file task uses ~1,000 tokens.

We integrated Serena (an open-source MCP toolkit by Oraios with 19k+ GitHub stars), which provides IDE-like capabilities to agents via Language Server Protocol (LSP):

Operation

Without Serena

With Serena

Find function definition

Read entire file(s), regex search

find_symbol → exact location

Trace call hierarchy

Read all potential callers

find_referencing_symbols → direct graph

Insert new method

Read file, string manipulation

insert_after_symbol → surgical placement

Navigate dependencies

Grep + manual file traversal

find_symbol on imports → semantic resolution

Why This Matters at Scale:

"Efficient operations are not only useful for saving costs, but also for generally improving the generated code's quality. This effect may be less pronounced in very small projects, but often becomes of crucial importance in larger ones."

Our codebase contains ~530k lines across 32 modules. Without semantic operations, architectural analysis tasks would consume entire context windows just loading files. With Serena, agents navigate the same codebase using a fraction of the tokens.

Performance Optimization: We implemented a custom Connection Pool architecture that maintains warm LSP processes, reducing query latency from ~3s cold-start to ~2ms warm, enabling high-frequency code queries during architectural analysis without process spawn overhead.

8. Infrastructure Integration: Zero-Trust Secrets

Injecting secrets (AWS keys, Database credentials) into the LLM context is a critical security vulnerability. We implemented a Just-in-Time (JIT) Injection architecture using 1Password.

8.1 The run-with-secrets Wrapper

We do not give agents API keys. We give them a tool: 1password.run-with-secrets.

Configuration (.claude/tools/1password/lib/config.ts):

export const DEFAULT_CONFIG = {
  account: "praetorianlabs.1password.com",
  serviceItems: {
    "aws-dev": "op://Private/AWS Key/credential",
    "ci-cd": "op://Engineering/CI Key/credential",
  },
};

Execution Flow

sequenceDiagram
    participant LLM as LLM Context
    participant Agent
    participant Tool as Wrapper
    participant OP as 1Password CLI
    participant AWS as AWS CLI

    LLM->>Agent: "List S3 Buckets"
    Agent->>Tool: run_with_secrets("aws s3 ls")

    rect rgb(200, 255, 200)
    Note right of Tool: SECURE ENCLAVE (Child Process)
    Tool->>OP: Request "AWS_ACCESS_KEY"
    OP-->>Tool: Inject as ENV VAR
    Tool->>AWS: Execute Command
    AWS-->>Tool: Output: "bucket-a, bucket-b"
    end

    Tool-->>Agent: Return Output
    Agent-->>LLM: "Here are the buckets..."

    Note over LLM: Secret never entered Context
  1. Agent requests: run_with_secrets("aws s3 ls", { envFile: ".secrets.env" })

  2. Tool Wrapper intercepts.

  3. Wrapper executes: op run --env-file=".claude/tools/1password/secrets.env" -- aws s3 ls

  4. Security Guarantee: The secret exists only in the child process environment variables. It is never printed to stdout, never logged, and never enters the LLM context window.

9.0 Horizontal Scaling Architecture

Traditional software development is constrained by human limitation. In this architecture, the constraint becomes the development ecosystem itself were developers typically develop on local hardware (laptop RAM/CPU). This limits analysis to 3-5 concurrent sessions. To remove this bottleneck, we decoupled the Control Plane (Laptop) from the Execution Plane (Cloud).

  • Local: Engineer’s laptop deploys a Docker instance using DevPod. The docker instance is loaded with a development environment that includes, among other things, Cursor, Claude Code, and Github Repository under development (lightweight).
  • Remote: The actual development environment (“DevPod”) runs in an ephemeral Docker container within an AMI and both building and deploying occurs in the cloud.
  • Bridge: A secure SSH tunnel forwards the remote Cursor terminal back to the developer’s Laptop.
  • Traditional software development is constrained by human limitation. In this architecture, the constraint becomes the development ecosystem itself were developers typically develop on local hardware (laptop RAM/CPU). This limits analysis to 3-5 concurrent sessions. To remove this bottleneck, we decoupled the Control Plane (Laptop) from the Execution Plane (Cloud).

9.1 Devpod, Docker, and AWS AMIs

Because the heavy lifting happens in the cloud, engineers can spawn infinite parallel DevPods:

  • Isolation: Each feature or threat model runs in its own isolated container.

  • Resources: We can provision 128GB RAM instances for massive monorepo analysis, impossible on a laptop.

  • Security: Code never leaves the VPC. The laptop only sees the terminal pixels/text stream.

10.0 Roadmap: Beyond Orchestration

The current platform achieves Level 3 Autonomy (Orchestrated). Our roadmap targets Level 5 (Self-Evolving).

10.1 Heterogeneous LLM Routing

No single model excels at every task. The platform utilizes a routing matrix to send specific tasks to the models best architected to handle them. This “Heterogeneous Orchestration” optimizes for both performance and cost.

This routing is managed by a semantic decision layer that uses small, fast models as routers. These routers evaluate the user’s intent and select the appropriate specialist agent, ensuring that expensive reasoning models are reserved for logic, while high-throughput multimodal models handle visual and data-heavy tasks.

Development Task

Optimal Model Architecture

Technical Advantage

Logic & Reasoning

DeepSeek-R1 / V3

Reinforcement Learning (RL)-based chain-of-thought for complex inference.

Document Processing

DeepSeek OCR 2

10x token efficiency utilizing visual causal flow for structural preservation.

UI/UX & Frontend

Kimi 2.5

Native MoonViT architecture; enables autonomous visual debugging loops.

Parallel Research

Kimi 2.5 Swarm

PARL-driven optimization of the critical path across up to 100 agents.

Massive Repository Mapping

DeepSeek-v4 Engram

O(1) constant-time lookup and tiered KV cache for million-token context.

10.2 Self-Annealing & Auto-Correction (Q1 2026)

Current autonomous systems are brittle: when an agent fails due to ambiguity in a skill or a loophole in a hook, the human must debug the prompt engineering. We are closing this loop by enabling the platform to debug and patch itself.

The Concept:
When an agent fails a quality gate (e.g., feedback-loop-stop.sh) more than 3 times, or when an orchestrator detects a pattern of tool misuse, the system triggers a Self-Annealing Workflow.

The Mechanism:
Instead of returning the error to the user, the platform spawns a Meta-Agent (an infrastructure engineer agent) with permissions to modify the .claude/ directory.

  1. Diagnosis: The Meta-Agent reads the session transcript and the failed agent’s definition. It identifies the “Rationalization Path”—the specific chain of thought the agent used to bypass instructions (e.g., “I’ll skip the test because it’s a simple change”).

  2. Patching:

    • Skill Annealing: It modifies the relevant SKILL.md (e.g., developing-with-tdd) to add an explicit “Anti-Pattern” entry: “If you are thinking ‘this is simple enough to skip tests’, YOU ARE WRONG. Simple changes cause 40% of outages.”

    • Hook Hardening: If a hook failed to block a violation, it updates the bash script logic to catch the edge case.

    • Agent Refinement: It updates the agent’s prompt (via agent-manager update) to clarify the ambiguous instruction.

  3. Verification: It runs the pressure-testing-skill-content skill against the patched artifact to verify it now blocks the previous failure mode.

  4. Pull Request: The Meta-Agent creates a PR with the infrastructure fix, labeled [Self-Annealing], for human review.

This transforms the platform from a static set of rules into an antifragile system that gets stronger with every failure. Every time an agent hallucinates or cuts a corner, the system learns to prevent that specific behavior forever, effectively “annealing” the soft prompts into hard constraints over time.

10.3 Agent-to-Agent Negotiation (Q2 2026)

Currently, agents follow rigid JSON schemas. Future agents will negotiate API contracts dynamically:

  • “I need X, can you provide it?”

  • “No, but I can provide Y which is similar.”

  • “Agreed, proceeding with Y.”

10.4 Self-Healing Infrastructure (Q2 2026)

Agents will gain the ability to debug their own runtime environment:

  • Detecting “Context Starvation” and auto-archiving memory.

  • Identifying “Tool Hallucination” and generating new Zod schemas to fix it.

11.0 References

Anthropic Official Guidance:

Community & Open Source:

Standards & Protocols:

12.0 Conclusion

The Praetorian Development Platform achieves escape velocity not by “improving the model,” but by constraining the runtime. By architecting a system where agents are ephemeral, context is curated via gateways, and workflows are enforced by deterministic hooks, we transform AI into a deterministic component of the software supply chain.

12.1 Recap

By architecting a system where:

  1. Agents are ephemeral and stateless,

  2. Context is strictly curated via Gateways,

  3. Workflows are enforced by deterministic Kernel hooks,

  4. Tools are progressively loaded and type-safe, and

  5. Secrets never touch the context…

We transform the LLM from a “creative assistant” into a deterministic component of the software supply chain. This allows us to scale development throughput linearly with compute, untethered by the cognitive limits of human attention.

12.2 Constraint Forces Innovation

For the next 12 weeks, we are open sourcing one attack module per week as part of our “The 12 Caesars” marketing campaign. At some point, I’ll sit down again and describe our AI attack platform architecture. As you will see, we apply similar principles to the platform as we do to development. This allows us to circumvent our capital light footprint that would ordinarily limit our ability to execute. Like what DeepSeek is proving to the Frontier Models, I’m not sure the expensive way is the best way anymore. The problem with capital is that it allows you to do a lot of stupid things very fast. We do not have that luxury. We must be clever instead.

Build the machine, that builds the machine, that enables a team, to hack all the things.

About the Authors

Nathan Sportsman

Nathan Sportsman

As the Founder and CEO of Praetorian, Nathan is responsible for championing the vision, maintaining the culture, and setting the direction of the company. Prior to bringing in professional management to help run the company, Nathan managed the day-to-day operations of the firm from its 2010 beginnings as a bootstrapped start-up to its current YoY hyper-growth. Since Praetorian’s founding, Nathan has successfully instilled a “customer first” mentality that has become part of the DNA of the company, which has led to unprecedented customer satisfaction reviews as reflected in a historical net promoter score of 92%. This reputation for delivering value to the customer has resulted in a three-year growth rate of 214%. Under Nathan’s leadership, the company has earned national recognition on the Inc. 5000 list 8 times in a row, the Inc Best Work Places, the Cybersecurity 500, CIO Top 20, and locally on Austin’s “Fastest 50” growing firms.

Ready to Discuss Your Next Continuous Threat Exposure Management Initiative?

Praetorian’s Offense Security Experts are Ready to Answer Your Questions