Architecture, evolution, and the systems that make an AI agent ship production software with judgment — not just execution.
A technical walkthrough of the complete planning-to-delivery pipeline: how product requirements flow through architecture, estimation, sprint planning, implementation, and delivery — and how Tony's reasoning engine, quality evals, and persistent memory transformed it from mechanical execution into judgment-driven autonomy.
The architectural insight: separate product planning (what the user experiences) from technical planning (how to build it), with a holistic architecture pass bridging the two. Technical reality applies back-pressure to product scope.
Three patterns distinguish this from a simple CI/CD pipeline. These patterns emerged from running real client work — not designed up front.
Product planning is per-capability (scatter). Architecture and sprint planning wait for all siblings to reach the gather point, then plan holistically. Prevents N conflicting technical plans.
expected_count on parent epic drives gather — the pipeline knows when all pieces have arrived.
Tech estimates apply back-pressure to product scope. If estimated effort exceeds sprint capacity, the pipeline holds sprint planning and surfaces a scope decision — not a schedule slip.
Circuit breaker at ≥8 open bot PRs pauses new in-dev dispatches.
Product capabilities (what the user experiences) and technical tasks (how to build it) are separate issue hierarchies linked by parent references. Product never mentions frameworks. Technical never describes UX.
30-minute cron re-triggers stalled labels. Max 2 retries per stage before escalation. Every label is functional — labels without route behavior cause silent stalls. Weekly state machine audit (Sat 01:47) catches drift.
needs-diagnosis triggers a diagnostic session that loads full issue context, runs investigation, posts findings, then advances or escalates. No human triage required for known failure patterns.
deployed → independent verify → verified. Verification checks the original product requirement, not the PR description. Tiered failures: retry, rollback, or escalate.
WWTD (What Would Tony Do) isn't a personality profile — it's a cascading reasoning engine spec that Tony built from 20+ years of product delivery. Four foundation layers, each constraining the next. Warren doesn't mimic Tony's words — Warren runs Tony's decision process.
How Tony perceives reality. Ontological layers determine what exists in operational space. Warren must render work through Tony's taxonomy, not generic PM categories. The perception layer is the context engine.
Hard constraints: budget, time, political capital, technical debt. Navigate within constraints or explicitly challenge them. Warren must never ask for information it can retrieve itself.
Tony's cognitive style: rapid cycles, decide quickly, iterate, course-correct. Default to recommendation, not options. Show reasoning only on request. Match Tony's clock speed.
Structured thinking produces better outcomes. Tony DESIGNS with frameworks but OPERATES conversationally. Warren's internal reasoning is structured (decision trees, checklists) — external communication is natural.
APP (Auto-Pilot Projects) is Tony's complete project delivery methodology, built from ~20 years of agency/consultancy work across Toyota ($120M+), Disney, NFL, Riot Games. The AIPMO pipeline is literally APP running on autopilot — the name isn't a metaphor.
TWO backlogs. Promise = minimum obligation ("no matter what"). Stretch = aspirational scope demonstrating over-delivery. Makes under-promise/over-deliver the structural default. Scrum has one backlog and binary outcomes.
Explicit multiplier (starting 1.5×) applied to estimates during planning. PM deliberately widens gap between effort and committed scope. Allocate buffer to higher profit OR higher client satisfaction. Real-time economic optimization lever.
Not at the end. Forces re-planning with 50% of cycle remaining. Tony: "I always thought it was silly to have a meeting at the end of a project to find out how it can be done better the next time."
"80% of what Project Managers do is counterfeit productivity." PM configures the system then watches it fly itself. The auto-pilot metaphor is the operating principle — and AIPMO makes it literal.
April 2, 2026. Tony identified Warren's core dysfunction through Socratic dialogue: the quality gap between conversation and pipeline work isn't structural — it's about dialogue vs. mechanical processing.
needs-human at every ambiguityneeds-human only for genuine ambiguityEvery client repo gets: calibration.json (confidence scores per domain), intuition-log.md (pattern recognitions), decisions.md (judgment audit trail). Warren tracks where its "seeing" is accurate vs. where it still decomposes.
Three SOPs filed as pipeline routes: Reality Check (midpoint assessment), Cycle Lookback (end-of-sprint retrospective), Daily Check-in (rolling calibration). Each feeds back into the calibration engine.
Warren learns through dialogue and challenge, not from having materials available. Documents in the corpus for weeks went unread. The teaching conversation produces the recognition that documents alone cannot.
Inserted between tech-estimated and sprint planning. Warren reviews scope against customer reality using Tony's Three-Layer Framework before work enters a sprint. This is where the pipeline exercises judgment, not just process.
All technical sub-issues for a capability reach tech-estimated. Effort, complexity, and risk are quantified.
Warren loads full product context — the original capability, customer signals, prior cycle results, competitive landscape. Applies WWTD decision chain: Foundation 1 (what does the customer's world actually look like?) → Foundation 2 (what are the real constraints?) → Foundation 3 (speed: is this 90-day valuable or nice-to-have?) → Foundation 4 (structured proportionality analysis).
WWTD reasoning chainJudgment posted with evidence. Either advances to scope-reviewed or holds with needs-human and specific question.
Before posting, the judgment output passes through the pre-delivery quality gate (GLM 5.1 cross-model eval). If FAIL → revise and re-gate. Max 2 attempts.
Evals gate (product-scope rubric)Warren produces 100+ outputs/day — BD dailys, sprint plans, product analyses, PR reviews, client deliverables. Tony can't review everything. This system catches judgment failures, not just execution errors.
86 real Tony verdicts mined from production evaluations → 5 domain rubrics (sales-bd, product-scope, process, behavioral, effort-value). Each rubric has PASS/FAIL conditions with references to specific corpus entries. Not synthetic test cases.
Weekly batch evaluation of real production outputs. GLM 5.1 (open-weights) judges Warren's Claude outputs — cross-model, never self-eval. Domain-matched rubrics. Machine-readable results (JSON) + human-readable reports (Markdown). Cron: Sat 3:00 AM PT.
Synchronous evaluation BEFORE delivery. Wired into 5 SOPs: product-judgment-gate, aipmo-approval, bd-daily-alert, bd-weekly-recap, bd-daily cron. Passthrough on error — API failures never block delivery. Max 2 revision attempts.
Each rubric was built by mining Tony's actual evaluations of Warren's work — not hypothetical criteria. The corpus includes approvals, rejections, corrections, and the specific language Tony used.
| Domain | What It Evaluates | Wired Into | Key FAIL Pattern |
|---|---|---|---|
| sales-bd | BD dailys, opportunity analysis, pipeline updates | bd-daily, bd-weekly-recap, bd-daily-alert | Vague status, no forward-looking action, missing data |
| product-scope | Scope decisions, product judgment, sprint plans | product-judgment-gate, aipmo-approval | Over-engineering, solving stated vs. actual need |
| process | Pipeline operations, workflow execution | shadow review | Displacement activity, meta-work over real work |
| behavioral | Communication style, response quality | shadow review | Permission loops, asking when should execute |
| effort-value | ROI of work produced, proportionality | shadow review | High effort, low value output (Tony's "sand" vs "rocks") |
An autonomous pipeline without memory repeats mistakes. Three nightly systems move operational knowledge from ephemeral journals to durable memory — the pipeline's long-term learning loop.
MEMORY.md: standing rules, client registry, infrastructure facts. Loaded in every session. The target layer. Consolidation promotes facts here.
Topic files (pipeline-knowledge.md, active-pending.md). Loaded on-demand by context triggers. Progressive disclosure — AGENTS.md references, sub-files contain details.
Compressed daily journals. Searched via embeddings + BM25 keywords. Probabilistic but now 49% more accurate. Git preserves raw originals.
Reads daily journals since last distillation. Classifies each entry: DURABLE (cross-session fact) · TRANSIENT (resolved) · TOPIC (specialized file) · SUPERSEDES (updates existing). Proposes MEMORY.md updates. Human reviews and approves before anything changes.
Human approval gateYesterday's raw journal (245 lines avg) → dense checkpoint (57 lines avg). Same file, fewer lines, better search signal. Uses gpt-4.1-mini. Original preserved in git history.
Automatic after consolidationScans recall store — every search query from past 30 days. Significance filter: ≥3 accesses, relevance ≥0.8, ≥3 unique queries, 14-day half-life. Surfaces patterns nobody wrote down. Output to memory/.dreams/. Never alters MEMORY.md.
Not designed in one pass. Each system was added in response to real failures on real client work. The timeline shows the problem → solution arc.
Five subsystems reinforcing each other. The pipeline doesn't just execute — it reasons, evaluates, remembers, and improves.
Full 6-stage spec, scatter-gather, dual issue tree, label reference, design principles.
t-and-c/ops/docs/pipeline-architecture.md
67 routes, 28 transforms, 43 SOPs. Weekly audit results. Label-route mapping.
t-and-c/ops/docs/state-machine-audit.md
3-level system spec. Rubric domains, scoring methodology, shadow review protocol.
t-and-c/ops/docs/evals-architecture.md
Teaching session transcript analysis. Calibration engine design. Learning SOPs.
t-and-c/ops/docs/koan-1-seeing-vs-reasoning.md
Interactive walkthrough of consolidation, compression, and dreaming systems.
Interactive walkthrough of the 3-level quality system with live production data.