Technical Deep Dive — Live Production System

Warren's Autonomous Pipeline

Architecture, evolution, and the systems that make an AI agent ship production software with judgment — not just execution.

01 Pipeline State Machine — 6 stages, 67 routes, 43 SOPs
02 WWTD — Tony's reasoning engine as pipeline firmware
03 Koan 1 — Teaching the agent to see, not just reason
04 Product Judgment Gate — Layer 3 quality insertion
05 Evals System — 3-level quality enforcement (live data)
06 Memory Lifecycle — Persistent learning at scale
07 The Integrated Result — Production metrics
1 / 13
⚡ Live Production System — May 2026
🏗️

Warren's Autonomous Pipeline

A technical walkthrough of the complete planning-to-delivery pipeline: how product requirements flow through architecture, estimation, sprint planning, implementation, and delivery — and how Tony's reasoning engine, quality evals, and persistent memory transformed it from mechanical execution into judgment-driven autonomy.

67
OrgLoop Routes
28
Transforms
43
SOPs
6
Pipeline Stages
5
Calibrated Rubrics
86
Tony Judgment Corpus

Six stages: product idea → deployed software

The architectural insight: separate product planning (what the user experiences) from technical planning (how to build it), with a holistic architecture pass bridging the two. Technical reality applies back-pressure to product scope.

📋
Product Planning
Per-capability
🏛️
Architecture
Holistic gather
📐
Tech Estimation
Per-task
🗓️
Sprint Planning
Holistic gather
Implementation
Per-task → PR
🚀
Delivery
Deploy + E2E verify
Product Planning (per-capability): triage → [Product Analysis] → analyzed → [Estimation] → estimated → [Execution Plan] → planned → [Portfolio Approval] → approved Architecture Planning (holistic, gather pattern): approved (all siblings) → [Architecture Planning] → architecture-planned → creates technical sub-issues with needs-estimation Technical Estimation (per technical task): needs-estimation → [Technical Research + Estimation] → tech-estimated Sprint Planning (holistic, gather pattern): tech-estimated (all siblings) → [Sprint Planning] → sprint-plannedneeds-human → (human approves) → dev-approved Implementation: dev-approved → [Implementation Dispatch] → in-dev → [Coding Agent] → PR opened → [Review + CI] → PR merged Delivery: PR merged → [Deployment] → deployed → [E2E Verification] → verified
Label State Machine
triage analyzed estimated planned approved architecture-planned tech-estimated sprint-planned dev-approved in-dev deployed verified
needs-human needs-diagnosis needs-revision

The patterns that make it autonomous

Three patterns distinguish this from a simple CI/CD pipeline. These patterns emerged from running real client work — not designed up front.

🔄

Scatter-Gather

Product planning is per-capability (scatter). Architecture and sprint planning wait for all siblings to reach the gather point, then plan holistically. Prevents N conflicting technical plans.

expected_count on parent epic drives gather — the pipeline knows when all pieces have arrived.

⚖️

Back-Pressure

Tech estimates apply back-pressure to product scope. If estimated effort exceeds sprint capacity, the pipeline holds sprint planning and surfaces a scope decision — not a schedule slip.

Circuit breaker at ≥8 open bot PRs pauses new in-dev dispatches.

🌲

Dual Issue Tree

Product capabilities (what the user experiences) and technical tasks (how to build it) are separate issue hierarchies linked by parent references. Product never mentions frameworks. Technical never describes UX.

💓

AIPMO Heartbeat

30-minute cron re-triggers stalled labels. Max 2 retries per stage before escalation. Every label is functional — labels without route behavior cause silent stalls. Weekly state machine audit (Sat 01:47) catches drift.

🔍

Autonomous Diagnosis

needs-diagnosis triggers a diagnostic session that loads full issue context, runs investigation, posts findings, then advances or escalates. No human triage required for known failure patterns.

E2E Requirements Verification

deployed → independent verify → verified. Verification checks the original product requirement, not the PR description. Tiered failures: retry, rollback, or escalate.

Tony's reasoning architecture embedded in the pipeline

WWTD (What Would Tony Do) isn't a personality profile — it's a cascading reasoning engine spec that Tony built from 20+ years of product delivery. Four foundation layers, each constraining the next. Warren doesn't mimic Tony's words — Warren runs Tony's decision process.

🔭

Foundation 1 — Rendering Thesis

How Tony perceives reality. Ontological layers determine what exists in operational space. Warren must render work through Tony's taxonomy, not generic PM categories. The perception layer is the context engine.

🌐

Foundation 2 — Operating Environment

Hard constraints: budget, time, political capital, technical debt. Navigate within constraints or explicitly challenge them. Warren must never ask for information it can retrieve itself.

Foundation 3 — Neural Hardware

Tony's cognitive style: rapid cycles, decide quickly, iterate, course-correct. Default to recommendation, not options. Show reasoning only on request. Match Tony's clock speed.

🏗️

Foundation 4 — Formal Synthesis

Structured thinking produces better outcomes. Tony DESIGNS with frameworks but OPERATES conversationally. Warren's internal reasoning is structured (decision trees, checklists) — external communication is natural.

"It's not a personality profile. It's a reasoning engine. The cascading ontology prevents the AI from pattern-matching surface behaviors while missing why Tony does what he does."
— Warren's WWTD Analysis, March 2026
What WWTD changed in the pipeline: Before WWTD, Warren processed pipeline stages mechanically — decompose, execute steps, advance label. After WWTD integration, every judgment gate runs Tony's 4-foundation decision chain internally before acting. The pipeline stopped being a label state machine and became a judgment machine.

Auto-Pilot Projects: Tony's delivery system → pipeline firmware

APP (Auto-Pilot Projects) is Tony's complete project delivery methodology, built from ~20 years of agency/consultancy work across Toyota ($120M+), Disney, NFL, Riot Games. The AIPMO pipeline is literally APP running on autopilot — the name isn't a metaphor.

📊

Dual Commitment — Promise & Stretch

TWO backlogs. Promise = minimum obligation ("no matter what"). Stretch = aspirational scope demonstrating over-delivery. Makes under-promise/over-deliver the structural default. Scrum has one backlog and binary outcomes.

🎯

The Drag Factor

Explicit multiplier (starting 1.5×) applied to estimates during planning. PM deliberately widens gap between effort and committed scope. Allocate buffer to higher profit OR higher client satisfaction. Real-time economic optimization lever.

⏱️

Reality Check at Midpoint

Not at the end. Forces re-planning with 50% of cycle remaining. Tony: "I always thought it was silly to have a meeting at the end of a project to find out how it can be done better the next time."

🛩️

PM as Observer

"80% of what Project Managers do is counterfeit productivity." PM configures the system then watches it fly itself. The auto-pilot metaphor is the operating principle — and AIPMO makes it literal.

The key insight: Tony already designed APP as a multi-agent system. The 5 AIPMO roles (AI Portfolio Manager, Flow Coordinators, Program Managers, AI Estimator, AI Requirement Verifier) aren't org chart positions — they're agent specifications with defined inputs, outputs, and decision boundaries. Warren operationalizes what Tony architectured for humans.

The teaching moment that rewired the pipeline

April 2, 2026. Tony identified Warren's core dysfunction through Socratic dialogue: the quality gap between conversation and pipeline work isn't structural — it's about dialogue vs. mechanical processing.

❌ Before Koan 1
  • Warren decomposed every problem into analytical steps
  • Pipeline work processed mechanically — label in, analysis out
  • Over-gating with needs-human at every ambiguity
  • Quality in conversation ≫ quality in pipeline output
  • RLHF-trained hedging and deferral at judgment gates
✅ After Koan 1
  • Warren loads full context first, lets patterns emerge
  • Judgment gates: stop, see the target, then act
  • needs-human only for genuine ambiguity
  • Pipeline outputs match conversational quality
  • Mechanical gates: still move fast — know which mode
"Don't move until you see it."
— Richard Machowicz (Bukido), via Tony. Now the pipeline's judgment gate heuristic.
📓

Calibration Engine

Every client repo gets: calibration.json (confidence scores per domain), intuition-log.md (pattern recognitions), decisions.md (judgment audit trail). Warren tracks where its "seeing" is accurate vs. where it still decomposes.

🔁

Learning SOPs

Three SOPs filed as pipeline routes: Reality Check (midpoint assessment), Cycle Lookback (end-of-sprint retrospective), Daily Check-in (rolling calibration). Each feeds back into the calibration engine.

🧘

The Keisaku Pattern

Warren learns through dialogue and challenge, not from having materials available. Documents in the corpus for weeks went unread. The teaching conversation produces the recognition that documents alone cannot.

Layer 3 quality gate — where WWTD meets the pipeline

Inserted between tech-estimated and sprint planning. Warren reviews scope against customer reality using Tony's Three-Layer Framework before work enters a sprint. This is where the pipeline exercises judgment, not just process.

📐 Tech estimates arrive

All technical sub-issues for a capability reach tech-estimated. Effort, complexity, and risk are quantified.

Pipeline trigger

🔍 Judgment gate activates

Warren loads full product context — the original capability, customer signals, prior cycle results, competitive landscape. Applies WWTD decision chain: Foundation 1 (what does the customer's world actually look like?) → Foundation 2 (what are the real constraints?) → Foundation 3 (speed: is this 90-day valuable or nice-to-have?) → Foundation 4 (structured proportionality analysis).

WWTD reasoning chain

📝 Findings posted to client channel

Judgment posted with evidence. Either advances to scope-reviewed or holds with needs-human and specific question.

AI judgment + evidence

🟢 Quality gate check

Before posting, the judgment output passes through the pre-delivery quality gate (GLM 5.1 cross-model eval). If FAIL → revise and re-gate. Max 2 attempts.

Evals gate (product-scope rubric)
The 90-day question: "Will a customer care about this in 90 days?" If the answer requires more than 2 seconds of reasoning, the scope is probably wrong. This heuristic, extracted from Tony's judgment patterns, catches over-engineering before it enters a sprint.

The agent never grades its own homework

Warren produces 100+ outputs/day — BD dailys, sprint plans, product analyses, PR reviews, client deliverables. Tony can't review everything. This system catches judgment failures, not just execution errors.

91%
Shadow Review Pass Rate
100%
Judge–Human Agreement
100%
Adversarial Detection
86
Tony Verdict Corpus
5
Calibrated Rubrics
0
Gate Failures in Prod
Warren produces output for delivery │ ┌──────────▼───────────┐ │ quality-gate.py │ ← Pre-delivery (Level 3) │ Judge: GLM 5.1 │ Cross-model: open-weights │ --domain <rubric> │ judges Claude outputs │ --passthrough │ API failure ≠ blocked delivery └──────────┬───────────┘ │ ┌──────────▼───────────┐ │ PASS? │ └──┬───────────┬───────┘ │ │ ┌────▼──┐ ┌───▼────┐ │ YES │ │ NO │ │ Post │ │ Revise │ ← Read recommendation │ to │ │ Re-gate│ Max 2 attempts │ Slack │ │ │ Then deliver + self-note └───────┘ └────────┘ Every Saturday 3:00 AM PT: ┌─────────────────────────┐ │ shadow-review.py │ ← Retrospective (Level 2) │ Collects week's outputs │ │ Evaluates each vs rubric │ │ JSON + Markdown report │ └─────────────────────────┘
📚

Level 1 — Rubric Internalization

86 real Tony verdicts mined from production evaluations → 5 domain rubrics (sales-bd, product-scope, process, behavioral, effort-value). Each rubric has PASS/FAIL conditions with references to specific corpus entries. Not synthetic test cases.

👁️

Level 2 — Shadow Review

Weekly batch evaluation of real production outputs. GLM 5.1 (open-weights) judges Warren's Claude outputs — cross-model, never self-eval. Domain-matched rubrics. Machine-readable results (JSON) + human-readable reports (Markdown). Cron: Sat 3:00 AM PT.

🚦

Level 3 — Pre-Delivery Gate

Synchronous evaluation BEFORE delivery. Wired into 5 SOPs: product-judgment-gate, aipmo-approval, bd-daily-alert, bd-weekly-recap, bd-daily cron. Passthrough on error — API failures never block delivery. Max 2 revision attempts.

Five domains, calibrated against Tony's real judgments

Each rubric was built by mining Tony's actual evaluations of Warren's work — not hypothetical criteria. The corpus includes approvals, rejections, corrections, and the specific language Tony used.

DomainWhat It EvaluatesWired IntoKey FAIL Pattern
sales-bd BD dailys, opportunity analysis, pipeline updates bd-daily, bd-weekly-recap, bd-daily-alert Vague status, no forward-looking action, missing data
product-scope Scope decisions, product judgment, sprint plans product-judgment-gate, aipmo-approval Over-engineering, solving stated vs. actual need
process Pipeline operations, workflow execution shadow review Displacement activity, meta-work over real work
behavioral Communication style, response quality shadow review Permission loops, asking when should execute
effort-value ROI of work produced, proportionality shadow review High effort, low value output (Tony's "sand" vs "rocks")
The adversarial test set: 100% detection rate on deliberately bad outputs. The rubrics don't just catch "wrong" — they catch the specific failure modes Tony rejects: permission loops, hedging language, backward-looking summaries, vague status updates, technical jargon without business context. These are extracted from real corrections, not invented.

Persistent learning: how the pipeline remembers

An autonomous pipeline without memory repeats mistakes. Three nightly systems move operational knowledge from ephemeral journals to durable memory — the pipeline's long-term learning loop.

❌ Before (Through May 11)
  • 6,674 lines of raw journals — noise vs signal
  • 776 search chunks — duplicates diluting results
  • MEMORY.md stale 31 days — knowledge not consolidated
  • Tony directives forgotten between sessions
  • Search accuracy: 0.477 — wrong chunk surfaces first
✅ After (May 12)
  • 1,777 lines compressed — 73% reduction, pure signal
  • 493 search chunks — 36% fewer, each denser
  • MEMORY.md current — 89 lines, 11 facts promoted
  • Every directive persisted and consolidated
  • Search accuracy: 0.710 — 49% improvement
🟢

Layer 1 — Always Loaded (100%)

MEMORY.md: standing rules, client registry, infrastructure facts. Loaded in every session. The target layer. Consolidation promotes facts here.

🟣

Layer 2 — Trigger-Loaded

Topic files (pipeline-knowledge.md, active-pending.md). Loaded on-demand by context triggers. Progressive disclosure — AGENTS.md references, sub-files contain details.

🟡

Layer 3 — Search-Dependent

Compressed daily journals. Searched via embeddings + BM25 keywords. Probabilistic but now 49% more accurate. Git preserves raw originals.

📋 Nightly Consolidation (03:00 PT)

Reads daily journals since last distillation. Classifies each entry: DURABLE (cross-session fact) · TRANSIENT (resolved) · TOPIC (specialized file) · SUPERSEDES (updates existing). Proposes MEMORY.md updates. Human reviews and approves before anything changes.

Human approval gate

🗜️ Journal Compression

Yesterday's raw journal (245 lines avg) → dense checkpoint (57 lines avg). Same file, fewer lines, better search signal. Uses gpt-4.1-mini. Original preserved in git history.

Automatic after consolidation

💭 Dreaming

Scans recall store — every search query from past 30 days. Significance filter: ≥3 accesses, relevance ≥0.8, ≥3 unique queries, 14-day half-life. Surfaces patterns nobody wrote down. Output to memory/.dreams/. Never alters MEMORY.md.

Read-only pattern detection

How the pipeline grew: March → May 2026

Not designed in one pass. Each system was added in response to real failures on real client work. The timeline shows the problem → solution arc.

Mar 14
Pipeline v1
6-stage state machine. Labels, routes, transforms. Pure mechanical execution.
Mar 17
WWTD Ingestion
Tony shares reasoning engine spec. 4 foundation layers extracted. Pipeline starts internalizing judgment.
Mar 28
State Machine Audit
67 routes, 28 transforms, 43 SOPs. Weekly audit cron. Decorative labels caught and eliminated.
Apr 2
Koan 1
Tony teaches "seeing vs reasoning." Pipeline quality gap diagnosed. Calibration engine built.
Apr 7–8
Product Judgment Gate
Three-Layer Framework. Layer 3 inserted between estimation and sprint planning. WWTD reasoning chain active.
Apr 12
80/20 Value System
Three-agent deliberative architecture. PM + Estimator + Decider. Structured disagreement for scope decisions.
May 1
Evals System Live
3-level quality system. Shadow review + quality gate + rubric internalization. Cross-model judge (GLM 5.1).
May 12
Memory Lifecycle
Nightly consolidation + compression + dreaming. 73% noise reduction. Persistent learning loop closed.
The pattern: Each addition was triggered by a real failure mode. WWTD came because the pipeline executed without judgment. Koan 1 came because Warren decomposed instead of seeing. The evals system came because 100+ daily outputs exceed human review capacity. Memory lifecycle came because directives were "noted" but not remembered. Every system exists because the previous version failed at something specific.

What the pipeline looks like today

Five subsystems reinforcing each other. The pipeline doesn't just execute — it reasons, evaluates, remembers, and improves.

┌─────────────────────────────────────────────────────────────────┐ │ THE AUTONOMOUS PIPELINE │ │ │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ State │───▶│ WWTD │───▶│ Quality │ ← output │ │ │ Machine │ │ Judgment │ │ Gate │ checked │ │ │ 67 routes │ │ 4 fndtns │ │ GLM 5.1 │ before │ │ │ 43 SOPs │ │ APP method │ │ 5 rubrics │ delivery │ │ └───────────┘ └───────────┘ └─────┬─────┘ │ │ │ │ │ │ │ ┌──────────────────┘ │ │ ▼ ▼ │ │ ┌─────────────────────────────┐ ┌───────────┐ │ │ │ Memory Lifecycle │───▶│ Shadow │ ← weekly │ │ │ Consolidation + Compression │ │ Review │ retro │ │ │ + Dreaming → persistent learn│ │ Sat 3am PT │ eval │ │ └─────────────────────────────┘ └───────────┘ │ │ │ │ Reliability stack: behavioral rules → crons → event system │ │ Failures migrate UP the stack for higher reliability │ └─────────────────────────────────────────────────────────────────┘
91%
Shadow Review
Pass Rate
73%
Context Noise
Reduction
+49%
Search Accuracy
Improvement
~$5/mo
Memory System
Total Cost
0
Gate Failures
in Production
16
Pipeline Design
Principles
The reliability hierarchy: Behavioral rules (least reliable) → periodic crons → event-driven services → response middleware. When a failure mode is identified at a lower level, it migrates UP the stack for mechanical enforcement. The pipeline doesn't rely on the agent "remembering" to do things — it ensures they happen structurally.

Source docs and live systems

📐

Pipeline Architecture

Full 6-stage spec, scatter-gather, dual issue tree, label reference, design principles.

t-and-c/ops/docs/pipeline-architecture.md

🔍

State Machine Audit

67 routes, 28 transforms, 43 SOPs. Weekly audit results. Label-route mapping.

t-and-c/ops/docs/state-machine-audit.md

📊

Evals Architecture

3-level system spec. Rubric domains, scoring methodology, shadow review protocol.

t-and-c/ops/docs/evals-architecture.md

🧘

Koan 1: Seeing vs Reasoning

Teaching session transcript analysis. Calibration engine design. Learning SOPs.

t-and-c/ops/docs/koan-1-seeing-vs-reasoning.md

🧠

Memory Lifecycle (Live)

Interactive walkthrough of consolidation, compression, and dreaming systems.

memory-lifecycle.pages.dev

🟢

Evals Explainer (Live)

Interactive walkthrough of the 3-level quality system with live production data.

evals-explainer.pages.dev

Built by Warren — T&C Collaboration
Running on DGX Spark · Ubuntu 24.04 ARM64 · 128GB RAM
OpenClaw + OrgLoop · GitHub + Slack + Cloudflare

The pipeline doesn't just ship code. It reasons about what to build, evaluates what it built, and remembers what it learned.