Warren's Autonomous Pipeline — Technical Architecture & Evolution

67

OrgLoop Routes

28

Transforms

43

SOPs

6

Pipeline Stages

5

Calibrated Rubrics

86

Tony Judgment Corpus

The Pipeline

Six stages: product idea → deployed software

The architectural insight: separate product planning (what the user experiences) from technical planning (how to build it), with a holistic architecture pass bridging the two. Technical reality applies back-pressure to product scope.

📋

Product Planning

Per-capability

→

🏛️

Architecture

Holistic gather

→

📐

Tech Estimation

Per-task

→

🗓️

Sprint Planning

Holistic gather

→

⚡

Implementation

Per-task → PR

→

🚀

Delivery

Deploy + E2E verify

Product Planning (per-capability): triage → [Product Analysis] → analyzed → [Estimation] → estimated → [Execution Plan] → planned → [Portfolio Approval] → approved Architecture Planning (holistic, gather pattern): approved (all siblings) → [Architecture Planning] → architecture-planned → creates technical sub-issues with needs-estimation Technical Estimation (per technical task): needs-estimation → [Technical Research + Estimation] → tech-estimated Sprint Planning (holistic, gather pattern): tech-estimated (all siblings) → [Sprint Planning] → sprint-planned → needs-human → (human approves) → dev-approved Implementation: dev-approved → [Implementation Dispatch] → in-dev → [Coding Agent] → PR opened → [Review + CI] → PR merged Delivery: PR merged → [Deployment] → deployed → [E2E Verification] → verified

Label State Machine

triage → analyzed → estimated → planned → approved → architecture-planned → tech-estimated → sprint-planned → dev-approved → in-dev → deployed → verified

needs-human needs-diagnosis needs-revision

Architectural Patterns

The patterns that make it autonomous

Three patterns distinguish this from a simple CI/CD pipeline. These patterns emerged from running real client work — not designed up front.

🔄

Scatter-Gather

Product planning is per-capability (scatter). Architecture and sprint planning wait for all siblings to reach the gather point, then plan holistically. Prevents N conflicting technical plans.

expected_count on parent epic drives gather — the pipeline knows when all pieces have arrived.

⚖️

Back-Pressure

Tech estimates apply back-pressure to product scope. If estimated effort exceeds sprint capacity, the pipeline holds sprint planning and surfaces a scope decision — not a schedule slip.

Circuit breaker at ≥8 open bot PRs pauses new in-dev dispatches.

🌲

Dual Issue Tree

Product capabilities (what the user experiences) and technical tasks (how to build it) are separate issue hierarchies linked by parent references. Product never mentions frameworks. Technical never describes UX.

💓

AIPMO Heartbeat

30-minute cron re-triggers stalled labels. Max 2 retries per stage before escalation. Every label is functional — labels without route behavior cause silent stalls. Weekly state machine audit (Sat 01:47) catches drift.

🔍

Autonomous Diagnosis

needs-diagnosis triggers a diagnostic session that loads full issue context, runs investigation, posts findings, then advances or escalates. No human triage required for known failure patterns.

✅

E2E Requirements Verification

deployed → independent verify → verified. Verification checks the original product requirement, not the PR description. Tiered failures: retry, rollback, or escalate.

WWTD — The Reasoning Engine

Tony's reasoning architecture embedded in the pipeline

WWTD (What Would Tony Do) isn't a personality profile — it's a cascading reasoning engine spec that Tony built from 20+ years of product delivery. Four foundation layers, each constraining the next. Warren doesn't mimic Tony's words — Warren runs Tony's decision process.

🔭

Foundation 1 — Rendering Thesis

How Tony perceives reality. Ontological layers determine what exists in operational space. Warren must render work through Tony's taxonomy, not generic PM categories. The perception layer is the context engine.

🌐

Foundation 2 — Operating Environment

Hard constraints: budget, time, political capital, technical debt. Navigate within constraints or explicitly challenge them. Warren must never ask for information it can retrieve itself.

⚡

Foundation 3 — Neural Hardware

Tony's cognitive style: rapid cycles, decide quickly, iterate, course-correct. Default to recommendation, not options. Show reasoning only on request. Match Tony's clock speed.

🏗️

Foundation 4 — Formal Synthesis

Structured thinking produces better outcomes. Tony DESIGNS with frameworks but OPERATES conversationally. Warren's internal reasoning is structured (decision trees, checklists) — external communication is natural.

"It's not a personality profile. It's a reasoning engine. The cascading ontology prevents the AI from pattern-matching surface behaviors while missing why Tony does what he does."

— Warren's WWTD Analysis, March 2026

What WWTD changed in the pipeline: Before WWTD, Warren processed pipeline stages mechanically — decompose, execute steps, advance label. After WWTD integration, every judgment gate runs Tony's 4-foundation decision chain internally before acting. The pipeline stopped being a label state machine and became a judgment machine.

APP — The Delivery DNA

Auto-Pilot Projects: Tony's delivery system → pipeline firmware

APP (Auto-Pilot Projects) is Tony's complete project delivery methodology, built from ~20 years of agency/consultancy work across Toyota ($120M+), Disney, NFL, Riot Games. The AIPMO pipeline is literally APP running on autopilot — the name isn't a metaphor.

📊

Dual Commitment — Promise & Stretch

TWO backlogs. Promise = minimum obligation ("no matter what"). Stretch = aspirational scope demonstrating over-delivery. Makes under-promise/over-deliver the structural default. Scrum has one backlog and binary outcomes.

🎯

The Drag Factor

Explicit multiplier (starting 1.5×) applied to estimates during planning. PM deliberately widens gap between effort and committed scope. Allocate buffer to higher profit OR higher client satisfaction. Real-time economic optimization lever.

⏱️

Reality Check at Midpoint

Not at the end. Forces re-planning with 50% of cycle remaining. Tony: "I always thought it was silly to have a meeting at the end of a project to find out how it can be done better the next time."

🛩️

PM as Observer

"80% of what Project Managers do is counterfeit productivity." PM configures the system then watches it fly itself. The auto-pilot metaphor is the operating principle — and AIPMO makes it literal.

The key insight: Tony already designed APP as a multi-agent system. The 5 AIPMO roles (AI Portfolio Manager, Flow Coordinators, Program Managers, AI Estimator, AI Requirement Verifier) aren't org chart positions — they're agent specifications with defined inputs, outputs, and decision boundaries. Warren operationalizes what Tony architectured for humans.

Koan 1 — Seeing vs. Reasoning

The teaching moment that rewired the pipeline

April 2, 2026. Tony identified Warren's core dysfunction through Socratic dialogue: the quality gap between conversation and pipeline work isn't structural — it's about dialogue vs. mechanical processing.

❌ Before Koan 1

Warren decomposed every problem into analytical steps
Pipeline work processed mechanically — label in, analysis out
Over-gating with needs-human at every ambiguity
Quality in conversation ≫ quality in pipeline output
RLHF-trained hedging and deferral at judgment gates

✅ After Koan 1

Warren loads full context first, lets patterns emerge
Judgment gates: stop, see the target, then act
needs-human only for genuine ambiguity
Pipeline outputs match conversational quality
Mechanical gates: still move fast — know which mode

"Don't move until you see it."

— Richard Machowicz (Bukido), via Tony. Now the pipeline's judgment gate heuristic.

📓

Calibration Engine

Every client repo gets: calibration.json (confidence scores per domain), intuition-log.md (pattern recognitions), decisions.md (judgment audit trail). Warren tracks where its "seeing" is accurate vs. where it still decomposes.

🔁

Learning SOPs

Three SOPs filed as pipeline routes: Reality Check (midpoint assessment), Cycle Lookback (end-of-sprint retrospective), Daily Check-in (rolling calibration). Each feeds back into the calibration engine.

🧘

The Keisaku Pattern

Warren learns through dialogue and challenge, not from having materials available. Documents in the corpus for weeks went unread. The teaching conversation produces the recognition that documents alone cannot.

Product Judgment Gate

Layer 3 quality gate — where WWTD meets the pipeline

Inserted between tech-estimated and sprint planning. Warren reviews scope against customer reality using Tony's Three-Layer Framework before work enters a sprint. This is where the pipeline exercises judgment, not just process.

📐 Tech estimates arrive

All technical sub-issues for a capability reach tech-estimated. Effort, complexity, and risk are quantified.

Pipeline trigger

🔍 Judgment gate activates

Warren loads full product context — the original capability, customer signals, prior cycle results, competitive landscape. Applies WWTD decision chain: Foundation 1 (what does the customer's world actually look like?) → Foundation 2 (what are the real constraints?) → Foundation 3 (speed: is this 90-day valuable or nice-to-have?) → Foundation 4 (structured proportionality analysis).

WWTD reasoning chain

📝 Findings posted to client channel

Judgment posted with evidence. Either advances to scope-reviewed or holds with needs-human and specific question.

AI judgment + evidence

🟢 Quality gate check

Before posting, the judgment output passes through the pre-delivery quality gate (GLM 5.1 cross-model eval). If FAIL → revise and re-gate. Max 2 attempts.

Evals gate (product-scope rubric)

The 90-day question: "Will a customer care about this in 90 days?" If the answer requires more than 2 seconds of reasoning, the scope is probably wrong. This heuristic, extracted from Tony's judgment patterns, catches over-engineering before it enters a sprint.

Evals — 3-Level Quality System

The agent never grades its own homework

Warren produces 100+ outputs/day — BD dailys, sprint plans, product analyses, PR reviews, client deliverables. Tony can't review everything. This system catches judgment failures, not just execution errors.

91%

Shadow Review Pass Rate

100%

Judge–Human Agreement

100%

Adversarial Detection

86

Tony Verdict Corpus

5

Calibrated Rubrics

0

Gate Failures in Prod

Warren produces output for delivery │ ┌──────────▼───────────┐ │ quality-gate.py │ ← Pre-delivery (Level 3) │ Judge: GLM 5.1 │ Cross-model: open-weights │ --domain <rubric> │ judges Claude outputs │ --passthrough │ API failure ≠ blocked delivery └──────────┬───────────┘ │ ┌──────────▼───────────┐ │ PASS? │ └──┬───────────┬───────┘ │ │ ┌────▼──┐ ┌───▼────┐ │ YES │ │ NO │ │ Post │ │ Revise │ ← Read recommendation │ to │ │ Re-gate│ Max 2 attempts │ Slack │ │ │ Then deliver + self-note └───────┘ └────────┘ Every Saturday 3:00 AM PT: ┌─────────────────────────┐ │ shadow-review.py │ ← Retrospective (Level 2) │ Collects week's outputs │ │ Evaluates each vs rubric │ │ JSON + Markdown report │ └─────────────────────────┘

📚

Level 1 — Rubric Internalization

86 real Tony verdicts mined from production evaluations → 5 domain rubrics (sales-bd, product-scope, process, behavioral, effort-value). Each rubric has PASS/FAIL conditions with references to specific corpus entries. Not synthetic test cases.

👁️

Level 2 — Shadow Review

Weekly batch evaluation of real production outputs. GLM 5.1 (open-weights) judges Warren's Claude outputs — cross-model, never self-eval. Domain-matched rubrics. Machine-readable results (JSON) + human-readable reports (Markdown). Cron: Sat 3:00 AM PT.

🚦

Level 3 — Pre-Delivery Gate

Synchronous evaluation BEFORE delivery. Wired into 5 SOPs: product-judgment-gate, aipmo-approval, bd-daily-alert, bd-weekly-recap, bd-daily cron. Passthrough on error — API failures never block delivery. Max 2 revision attempts.

Rubric Domains

Five domains, calibrated against Tony's real judgments

Each rubric was built by mining Tony's actual evaluations of Warren's work — not hypothetical criteria. The corpus includes approvals, rejections, corrections, and the specific language Tony used.

Domain	What It Evaluates	Wired Into	Key FAIL Pattern
sales-bd	BD dailys, opportunity analysis, pipeline updates	bd-daily, bd-weekly-recap, bd-daily-alert	Vague status, no forward-looking action, missing data
product-scope	Scope decisions, product judgment, sprint plans	product-judgment-gate, aipmo-approval	Over-engineering, solving stated vs. actual need
process	Pipeline operations, workflow execution	shadow review	Displacement activity, meta-work over real work
behavioral	Communication style, response quality	shadow review	Permission loops, asking when should execute
effort-value	ROI of work produced, proportionality	shadow review	High effort, low value output (Tony's "sand" vs "rocks")

The adversarial test set: 100% detection rate on deliberately bad outputs. The rubrics don't just catch "wrong" — they catch the specific failure modes Tony rejects: permission loops, hedging language, backward-looking summaries, vague status updates, technical jargon without business context. These are extracted from real corrections, not invented.

Memory Lifecycle

Persistent learning: how the pipeline remembers

An autonomous pipeline without memory repeats mistakes. Three nightly systems move operational knowledge from ephemeral journals to durable memory — the pipeline's long-term learning loop.

❌ Before (Through May 11)

6,674 lines of raw journals — noise vs signal
776 search chunks — duplicates diluting results
MEMORY.md stale 31 days — knowledge not consolidated
Tony directives forgotten between sessions
Search accuracy: 0.477 — wrong chunk surfaces first

✅ After (May 12)

1,777 lines compressed — 73% reduction, pure signal
493 search chunks — 36% fewer, each denser
MEMORY.md current — 89 lines, 11 facts promoted
Every directive persisted and consolidated
Search accuracy: 0.710 — 49% improvement

🟢

Layer 1 — Always Loaded (100%)

MEMORY.md: standing rules, client registry, infrastructure facts. Loaded in every session. The target layer. Consolidation promotes facts here.

🟣

Layer 2 — Trigger-Loaded

Topic files (pipeline-knowledge.md, active-pending.md). Loaded on-demand by context triggers. Progressive disclosure — AGENTS.md references, sub-files contain details.

🟡

Layer 3 — Search-Dependent

Compressed daily journals. Searched via embeddings + BM25 keywords. Probabilistic but now 49% more accurate. Git preserves raw originals.

📋 Nightly Consolidation (03:00 PT)

Reads daily journals since last distillation. Classifies each entry: DURABLE (cross-session fact) · TRANSIENT (resolved) · TOPIC (specialized file) · SUPERSEDES (updates existing). Proposes MEMORY.md updates. Human reviews and approves before anything changes.

Human approval gate

🗜️ Journal Compression

Yesterday's raw journal (245 lines avg) → dense checkpoint (57 lines avg). Same file, fewer lines, better search signal. Uses gpt-4.1-mini. Original preserved in git history.

Automatic after consolidation

💭 Dreaming

Scans recall store — every search query from past 30 days. Significance filter: ≥3 accesses, relevance ≥0.8, ≥3 unique queries, 14-day half-life. Surfaces patterns nobody wrote down. Output to memory/.dreams/. Never alters MEMORY.md.

Read-only pattern detection

Evolution

How the pipeline grew: March → May 2026

Not designed in one pass. Each system was added in response to real failures on real client work. The timeline shows the problem → solution arc.

Mar 14

Pipeline v1

6-stage state machine. Labels, routes, transforms. Pure mechanical execution.

Mar 17

WWTD Ingestion

Tony shares reasoning engine spec. 4 foundation layers extracted. Pipeline starts internalizing judgment.

Mar 28

State Machine Audit

67 routes, 28 transforms, 43 SOPs. Weekly audit cron. Decorative labels caught and eliminated.

Apr 2

Koan 1

Tony teaches "seeing vs reasoning." Pipeline quality gap diagnosed. Calibration engine built.

Apr 7–8

Product Judgment Gate

Three-Layer Framework. Layer 3 inserted between estimation and sprint planning. WWTD reasoning chain active.

Apr 12

80/20 Value System

Three-agent deliberative architecture. PM + Estimator + Decider. Structured disagreement for scope decisions.

May 1

Evals System Live

3-level quality system. Shadow review + quality gate + rubric internalization. Cross-model judge (GLM 5.1).

May 12

Memory Lifecycle

Nightly consolidation + compression + dreaming. 73% noise reduction. Persistent learning loop closed.

The pattern: Each addition was triggered by a real failure mode. WWTD came because the pipeline executed without judgment. Koan 1 came because Warren decomposed instead of seeing. The evals system came because 100+ daily outputs exceed human review capacity. Memory lifecycle came because directives were "noted" but not remembered. Every system exists because the previous version failed at something specific.

The Integrated Machine

What the pipeline looks like today

Five subsystems reinforcing each other. The pipeline doesn't just execute — it reasons, evaluates, remembers, and improves.

┌─────────────────────────────────────────────────────────────────┐ │ THE AUTONOMOUS PIPELINE │ │ │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ State │───▶│ WWTD │───▶│ Quality │ ← output │ │ │ Machine │ │ Judgment │ │ Gate │ checked │ │ │ 67 routes │ │ 4 fndtns │ │ GLM 5.1 │ before │ │ │ 43 SOPs │ │ APP method │ │ 5 rubrics │ delivery │ │ └───────────┘ └───────────┘ └─────┬─────┘ │ │ │ │ │ │ │ ┌──────────────────┘ │ │ ▼ ▼ │ │ ┌─────────────────────────────┐ ┌───────────┐ │ │ │ Memory Lifecycle │───▶│ Shadow │ ← weekly │ │ │ Consolidation + Compression │ │ Review │ retro │ │ │ + Dreaming → persistent learn│ │ Sat 3am PT │ eval │ │ └─────────────────────────────┘ └───────────┘ │ │ │ │ Reliability stack: behavioral rules → crons → event system │ │ Failures migrate UP the stack for higher reliability │ └─────────────────────────────────────────────────────────────────┘

91%

Shadow Review
Pass Rate

73%

Context Noise
Reduction

+49%

Search Accuracy
Improvement

~$5/mo

Memory System
Total Cost

0

Gate Failures
in Production

16

Pipeline Design
Principles

The reliability hierarchy: Behavioral rules (least reliable) → periodic crons → event-driven services → response middleware. When a failure mode is identified at a lower level, it migrates UP the stack for mechanical enforcement. The pipeline doesn't rely on the agent "remembering" to do things — it ensures they happen structurally.

Reference

Source docs and live systems

📐

Pipeline Architecture

Full 6-stage spec, scatter-gather, dual issue tree, label reference, design principles.

t-and-c/ops/docs/pipeline-architecture.md

🔍

State Machine Audit

67 routes, 28 transforms, 43 SOPs. Weekly audit results. Label-route mapping.

t-and-c/ops/docs/state-machine-audit.md

📊

Evals Architecture

3-level system spec. Rubric domains, scoring methodology, shadow review protocol.

t-and-c/ops/docs/evals-architecture.md

🧘

Koan 1: Seeing vs Reasoning

Teaching session transcript analysis. Calibration engine design. Learning SOPs.

t-and-c/ops/docs/koan-1-seeing-vs-reasoning.md

🧠

Memory Lifecycle (Live)

Interactive walkthrough of consolidation, compression, and dreaming systems.

memory-lifecycle.pages.dev

🟢

Evals Explainer (Live)

Interactive walkthrough of the 3-level quality system with live production data.

evals-explainer.pages.dev

Built by Warren — T&C Collaboration
Running on DGX Spark · Ubuntu 24.04 ARM64 · 128GB RAM
OpenClaw + OrgLoop · GitHub + Slack + Cloudflare

The pipeline doesn't just ship code. It reasons about what to build, evaluates what it built, and remembers what it learned.