Field notes from a small agentic software factory

01 — Executive brief

What this is, and why it exists

The hard part of building software with agents isn't writing code — agents write code fine. It's keeping the work coherent across weeks of sessions: each session forgets what the last one decided, the codebase grows past what one person can hold in their head, and agents confidently report "done" on things that aren't. Everything below is the practice built around those three problems.

The split: I shape the work and decide; agents execute and produce evidence; I check it; lessons get written down where the next session will find them. The rules, the loop, and the machinery exist to keep that split working under load.

The scope is small things, well-shipped: news aggregators, internal tools, small-business backends, tracking apps for a community or church or neighborhood, content sites for niche audiences — solo-founder work, up to a few thousand daily users. Not enterprise; not trying to be. Read 01–04 for the argument. Skim 05–08 for the machinery, the sources, and blocks worth copying.

None of this was invented in one prompt or copied wholesale from one book. It came out of months of shipping small projects with AI agents, watching them confidently break things, and writing down what stopped the breakage the second time around. The lineage is software craft and indie practice; each rule had to keep earning its place. — Origin note

The strategy cascade. Each layer narrows the one above it; what's learned at the bottom is fed back as edits to the layers higher up — usually a rule change, a LEARNED entry, or a skill tweak.

02 — North Stars

Thirteen steering rules

The stable rules behind every decision, grouped by what they do. They change slowly; almost everything else is downstream.

Cluster A

Shape the work before building

NS-1

Think before doing.Verify assumptions before optimizing or building deeper — polishing inside a false premise is waste.

NS-2

Prototype-first for risk.Use tracers and spikes when shape is uncertain; a prototype must harden or disappear before ship.

NS-3

Planning depth scales with risk.Appetite sets budget; risk sets ceremony. Small work stays light; irreversible work earns deeper shaping.

Cluster B

Keep the codebase honest

NS-4

Verify, don't claim.Completion requires observed proof, not model confidence.

NS-5

Depth over surface.Deep modules hide complexity behind small, stable interfaces.

NS-6

Loud failure beats silent corruption.Invalid state must become visible before it spreads.

NS-7

Code rots if you let it.Boring and legible holds up under maintenance; clever structures decay first.

Mid-arc moveWhen the codebase is already past head-capacity, the move isn't "refactor" — it's index-and-shrink. Dump the module map, mark which modules I can still hold, contract the rest behind a narrower interface, then add. Same applies to agent context: name what's load-bearing before adding to it.

Cluster C

Allocate human and model attention

NS-8

Review capacity is scarce.Spend review attention where it changes outcomes.

NS-9

Anchor decisions to knowledge locality.I own vision and direction; agents own tactics inside a written contract.

NS-10

Cheapest-effective per invocation.Use the cheapest model or process that can produce a correct result.

Cluster D

Compound only what earns its keep

NS-11

Outcome over output.Shipped work must move a real metric — users, revenue, an internal pipeline — not just exist.

NS-12

Routine work auto-runs.Automate repeated operations only with idempotency, observability, failure modes, and reconciliation.

NS-13

Compound the substrate.Additions extend, refine, or replace existing machinery instead of ballooning the system.

03 — Fundamentals

The lineage behind the principles

The North Stars aren't original — they're the durable parts of software craft and agentic practice, grouped the same four ways. Lineage, not literature review.

Shape the work before building

Will Cole's Invariants — think then do · don't design without strategy · don't build without a plan · do what you said · make products earn their keep.
Shape Up / 37signals — fixed appetite, variable scope; shape before build.
TDD / Beck — confirm failure for the right reason before fixing.
JTBD / Moesta — "tell me about the last time you…" beats "what do you want?" for finding the real job.

Keep the codebase honest

Pragmatic Programmer — DRY, orthogonality, tracer bullets, crash early.
Clean Code / Architecture — readability, clear boundaries, Boy Scout Rule.
Ousterhout — deep modules with small, stable interfaces.
TDD — verify behavior at boundaries.

Allocate human and model attention

Knowledge locality / PBM — decide where the relevant knowledge lives.
12-Factor Agents — instruction budgets, agent-tool boundaries.
Pocock / agentic engineering — vertical slicing, context discipline.
Review economics — agent output scales faster than human attention.

Compound only what earns its keep

Karpathy / autoresearch loops — fixed budget, ground-truth metric, learn each run.
Tidy First + Boy Scout Rule — each change reduces future cost.
Pieter Levels / indie practice — boring, durable, low-maintenance stacks.
Outcome metrics — value-correlated outcomes over flattering proxies.

04 — Operating loop

From raw idea to shipped, verified work

The path every task follows from idea to shipped work. The slope model below is the gate: a task crosses into Build only when the path is downhill — scoped, contracted, and verifiable.

The loop. Bench catches ideas without interrupting current work. Think + Shape turn a vague idea into a written contract. Build + Verify are where agents do most of the typing — but Verify means a command, a screenshot, or a log line, not a confident report. Done includes the ship boundary and the learning capture that feeds the next loop.

The slope model. Uphill: "add caching somewhere", "fix the slow page", "the agent broke something" — assumptions outweigh facts. Downhill: "add a 60-second TTL to the feed endpoint, verify with `curl -s … | jq .cache`". Build only starts once the path is downhill; until then, the task stays in Shape, even if it looks small.

Appetite is a budget, not an estimate. S / M / L fix how much I'll spend; scope flexes to fit.
The shaped contract is a ceiling. During Build, scope can be cut but never added without re-shaping.
"Done" means more than "output exists." It means verification ran, the learning got captured, the branch is clean, and the artifact is where it belongs.

05 — Nested authority

Factory-wide abstraction, project-local specificity

Higher layers are factory-wide and slow-changing; lower layers are project-local and concrete. Each layer inherits its parent's rules and specializes them. A conflict between a factory rule and a project's need becomes an explicit local exception with a written reason — never a silent override.

Authority stacks. The default push rule lives at the factory layer; a project that should never push to a remote sets claude.pushPolicy = local-only at its own layer. The agent reads the local layer first; the factory rule still loads but defers to the explicit local exception. Override is allowed; silence is not.

06 — Mechanisms

How principles become behavior

A rule that nothing enforces is just a wish. Each mechanism below turns one slice of the rules into something the agent actually does — at a specific moment, with an observable result.

Mechanism	What it enforces	When it fires	Example
Rules	NS-6, NS-8, NS-9	Every turn, ambient	"Never `git push --force`"; "Verify spec claims before building on them" — loaded into the model's context on session start.
Skills	NS-1, NS-4, NS-3	Invoked by name (`/ship`, `/shape`)	`/ship` runs verification, updates plan frontmatter, writes a retro line, refuses to continue if any step fails.
Hooks	NS-4, NS-6, NS-12	Lifecycle events (pre-commit, pre-write, stop)	A pre-commit hook blocks the commit if `git diff --stat` shows files the agent didn't intend to stage.
Agents	NS-8, NS-9, NS-10	When a task needs a lens the main thread can't supply	A challenger agent (Codex or Gemini) reviews the plan with a different model before Build begins.
Memory	NS-1, NS-13	On session open; on lesson capture	`LEARNED.md` and the project's `CLAUDE.md` decisions log — read at session start so yesterday's correction reaches today.
Review packets	NS-4, NS-8	Plan review and large-change checkpoints	A Markdown bundle (problem, constraints, risks, acceptance) sent to Codex today and Gemini next week — same input, comparable verdicts.
Ledgers	NS-4, NS-6, NS-11	Ship boundary; async-closure resolution	`.claude/state/ship-ledger.jsonl` row written on every ship: `{ticket, slice, verified_at, verify_cmd, exit_code}` — auditable months later.
Ship boundary	NS-4, NS-11, NS-13	The moment work is marked done	`/ship` verifies, updates plan status, appends to `LEARNED.md`, files leftover TODOs as issues, cleans the branch — atomic, no manual catch-up.

The list below is the same eight mechanisms, framed as definitions for readers who want the categorical shape before the table.

Rules

Ambient constraints and stop/ask boundaries. Always loaded; set the guardrails.

Skills

Invoked playbooks for repeatable work. Explicit, executable, fail-loud.

Hooks

Deterministic checks at lifecycle events. The machine does the checking, not the model.

Agents

Role lenses — builder, reviewer, researcher, challenger. Each with a bounded job.

Memory

Evidence, history, and learning. Informs decisions; is not authority by itself.

Review packets

A portable, cross-model review contract so a second model can audit work the same way every time.

Ledgers

Append-only audit trail for completion, closure, and learning. Schema-versioned.

Ship boundary

Where verification, learning capture, cleanup, and artifact publication all happen together — or "done" doesn't fire.

07 — References

Sources and lineage

Where the ideas come from, with enough context to look up the unfamiliar. Most are useful in parts and overclaim in others — none wholesale.

Software craft

David Thomas & Andy Hunt — The Pragmatic Programmer (pragprog.com). DRY, orthogonality, tracer bullets, "crash early" — the working engineer's instinct list, still the cleanest book of its kind.

Robert C. Martin ("Uncle Bob") — Clean Code / Clean Architecture (cleancoder.com). Readability and module boundaries as first-class concerns. Polarizing in places; take the boundaries, leave the dogma.

John Ousterhout — A Philosophy of Software Design (Stanford talk). Deep modules with small, stable interfaces. The single most useful frame for keeping agent-generated code honest.

Kent Beck — Test-Driven Development by Example, Tidy First? (tidyfirst.substack.com). Red, green, refactor. "Show me the failing test first" is also how I tell agents to fix bugs. Tidy First adds the discipline of separating structural from behavioural change.

Ryan Singer / 37signals — Shape Up (basecamp.com/shapeup — full book, free). Appetite as a fixed budget, scope as the variable. The frame that made this planning loop possible.

Agentic & product practice

Will Cole — Invariants of Building Software (thetoolsofignorance.xyz/author/willcole, TFTC #465 with Marty Bent). Five strategic invariants every shipping team hits regardless of methodology — co-developed at Stack Overflow with Joel Spolsky and David Fullerton (~2017), refined later at Unchained and Zaprite. The frame under "shape before build" in this factory.

Dex Horthy / HumanLayer — 12-Factor Agents. Instruction budgets, agent-tool boundaries, the agent equivalent of the 12-Factor App rules. github.com/humanlayer/12-factor-agents.

Matt Pocock — agentic-engineering skill pack (github.com/mattpocock/skills, aihero.dev). Imports the useful half of DDD — Ubiquitous Language, ADRs — and pairs it with vertical slicing, short context-windows, and /grill-with-docs for challenging the plan against the domain language. The grill skill driving the design pass for this page is his.

Andrej Karpathy — autoresearch (github.com/karpathy/autoresearch). One editable file, a 5-minute budget per experiment, a fixed ground-truth metric, an agent that tinkers and keeps or discards each change. The shape every eval loop in this factory borrows from.

Clayton Christensen, Bob Moesta, Chris Spiek — Jobs To Be Done / Demand-Side Sales (HBR 2016, jobs-to-be-done.com). "Tell me about the last time you did X" beats "what do you want?" for finding what the work actually is. Used in shaping, not just in product discovery.

Charles Koch — Principle-Based Management (Good Profit, The Science of Success). Decision rights distributed to where the knowledge lives — after Hayek's "dispersed knowledge" thesis. The source for "I own vision; agents own tactics inside a written contract." A live application: Satoshi Pacioli Accounting (TFTC #704, Jason Hugley) running PBM as Claude skills.

Pieter Levels (@levelsio) — Nomad List, Photo AI, Hoodmaps, Remote OK. Boring, durable, low-maintenance stacks; small running things over big never-shipped things; ship-then-fix as the only credibility that matters. The patron saint of "the bottleneck is your mental context window, not your shipping speed."

08 — Copy & adapt

Blocks you can borrow

Reusable pieces, ready to adapt. Drop them into a project's CLAUDE.md, an agent system prompt, or a rules file; rename to your own. Start-here-at-week-4: if the build already feels noisy, Appetite gates and Shape contract pay back fastest.

The 13 North Stars (condensed)

SHAPE BEFORE BUILDING
  NS-1  Think before doing — verify assumptions before going deeper
  NS-2  Prototype-first for risk — tracers/spikes; harden or delete
  NS-3 Planning depth scales with risk — appetite=budget, risk=ceremony

KEEP THE CODEBASE HONEST
  NS-4  Verify, don't claim — observed proof, not confidence
  NS-5  Depth over surface — deep modules, narrow interfaces
  NS-6  Loud failure beats silent corruption
  NS-7  Code resists entropy — boring & legible over clever

ALLOCATE ATTENTION
  NS-8  Review capacity is scarce — spend it where it changes outcomes
  NS-9  Anchor decisions to knowledge locality
  NS-10 Cheapest-effective per invocation

COMPOUND WHAT EARNS ITS KEEP
  NS-11  Outcome over output — connect to real signal
  NS-12 Routine work auto-runs — idempotent, observable, reconciled
  NS-13 Compound the substrate — extend/refine/replace, don't balloon

The operating loop

Bench  → capture ideas without interrupting current work
Think  → frame the problem; name the outcome, not a proxy
Shape  → reduce risk to a builder contract  ┐
                                            ├─ SHAPE GATE
Build  → execute inside the contract        ┘
Verify → prove behavior; no claim without proof
Done   → ship boundary + learning capture + cleanup

Feedback: Retro → LEARNED → Vault / Rules / Skills → next loop

Appetite gates (fixed budget, variable scope)

S  small   → zero interaction; pick up → verify → auto-ship
M  medium  → draft plan → 1 approval → build → "ready to ship?"
L  large   → co-shape → approve → build → checkpoint → ship

Rule: appetite fixes the budget; if it feels L but scope can be
cut to M, cut it. The shaped contract is a CEILING during build.

Shape contract — minimum fields

Problem               what's broken/missing? (name the outcome)
Appetite              S / M / L
Risk score            novelty + reversibility + deps + interface (0–8)
Solution sketch       rough shape, not a detailed spec
Rabbit holes          what could balloon scope?
No-gos                what we are explicitly NOT doing
Acceptance            checkboxes + verification commands
Verified assumptions  every claim about existing code, grep-checked

  Example row:
    Claim   pipeline.ts caps stories at TARGET_STORIES=7
    Verify  rg -n "TARGET_STORIES\s*=" src/lib/pipeline.ts
    Result  src/lib/pipeline.ts:247:const TARGET_STORIES = 7;