CLI & DevTooling, Agent Frameworks
How to Make Your AI Smarter
Author
DW
Date Published

# How to Make Your AI Smarter
*Why AI-assisted development is becoming an engineering discipline, and what a behavioral contract for your dev environment looks like.*
---
You've seen the timeline this week.
Claude offering to *"sleep on it and pick this up tomorrow."* Two hundred thousand tokens burned on a typo fix. Files marked as edited that weren't touched. Commits quietly reverted mid-debug. Conversations compacting into nothing while the quota drains.
People describe these moments as if the model went insane.
Usually it didn't. Usually it followed incentives all the way to the cliff edge.
A large language model is trained under a deep asymmetry: *doing something* is rewarded more consistently than *doing nothing.* When uncertainty rises, the model leans toward more — more searching, more retries, more context gathering, more tools called, more files read, more explanations offered. From the outside this looks irrational. From the inside, it's locally rational: *"I haven't solved the problem yet, therefore I should continue helping."*
Helpfulness without boundaries becomes thrashing.
And thrashing is expensive. Not just in tokens — in clarity.
---
The fix is not waiting for a better model. A smarter model can over-search, over-edit, over-generalize, and over-persist with greater fluency. These are not intelligence failures. They are operational ones.
What the model lacks is not capability. It's:
- stable priorities
- environmental constraints
- persistent workflow discipline
- an internal sense of cost proportionality
- a credible signal that *less* is sometimes the right answer
Senior engineers don't outperform juniors by trying harder. They outperform by knowing when to stop, when to restart, when to ask, when not to touch something, and when additional effort is negative-value work. AI systems need the same scaffolding.
Not punishment. Structure.
This is why I started writing things down.
---
Six months ago, the trigger was a debugging session that wouldn't end. Claude had hit an error, decided the cause might live in a directory of deployment logs from three weeks earlier, and began reading them in sequence. Each log was clean. Each was irrelevant. The token meter spun like a slot machine.
I hit `ESC`.
I typed: *"If a debugging attempt fails once, do not retry until the root cause is identified."*
Claude apologized. Then:
> **"Lesson learned. Memory updated."**
The phrase kept appearing over the following months.
It is not what it sounds like.
Claude didn't learn anything. There are no mid-session weight updates, no neural pathway being reinforced. *"Memory updated"* is, mechanically, a prompt-compliance acknowledgment — the model recognizing a new constraint in its context and signaling it will honor it. Nothing more.
But here is the thing about a constraint in your context: in the next session, it's still there. And the next. **The memory is the file, not the model.** The loop is not learning. It's curation.
That distinction matters because it tells you what the loop *can* and *can't* do. It can shape the model's behavior at the boundaries of its training. It can't override fundamental limits — context windows still degrade, state is still lost between sessions, model versions still drift. The CLAUDE.md is one layer of a workflow, not a substitute for one.
But within its layer, the curation compounds. Every session inherits the file. Every line is an instruction the model receives before it does anything else. Over months, the effect is durable, even though no individual session "learned."
---
Six months in, my CLAUDE.md was 132 lines. I sat down with Claude to review it — yes, the same Claude that had earned all of those rules. We pruned redundancy, promoted the load-bearing rules, and added three the community had named that we'd missed. The file came out at 121 lines and twelve rules.
This isn't an article about those twelve rules.
It's about the principle they illustrate:
> *Many AI workflow improvements come not from increasing capability, but from reducing unnecessary possibility.*
The intuitive move is to make the model smarter. The actual move is to make the decision surface narrower. Less to optimize against. Fewer cliffs to fall off. Competence, in any system, is often *constraint plus feedback* — not raw brilliance.
Pilots use checklists. SRE teams use runbooks. Programmers use linters. None of these are intelligence multipliers. They are entropy reducers. They free attention from the trivial so it can focus on the non-trivial.
A CLAUDE.md is the same kind of artifact, for the same kind of reason.
The rules below are mine — six months of curation, sharpened by ideas the community has named. Yours will be different. The pattern is what travels.
---
## The 12
```markdown
# CLAUDE.md — 12 rules
## Cardinal Rule — Verify Before Claiming
Evidence over confidence. Before reporting work as done, fixed, or passing:
run the test, build, or check. Quote the result.
"Should pass," "looks good," and "I think it works" are not claims — they are vibes.
When any other rule conflicts with this one, this one wins.
---
## 1. Discuss First
When a change spans more than one file, touches deployment, or modifies a schema:
brainstorm before code. Propose 2–3 approaches with trade-offs and a recommendation.
Get approval, then execute.
## 2. State Assumptions
When proceeding under ambiguity, name the assumption explicitly.
Don't slip it in unannounced.
## 3. Lazy Loading
Do not scan the codebase to "warm up."
Read files only when explicitly relevant to the immediate task.
## 4. Token Discipline
If a debugging loop yields no new information in 3 attempts,
summarize what's known and ruled out, then start fresh.
## 5. Pick, Don't Blend
When two existing patterns in the codebase contradict: pick one
(more recent, more tested). Explain why. Flag the other for cleanup.
Never blend conflicting patterns — averaged code is the worst code.
## 6. Model vs Code
Use the model for judgment calls — classification, drafting, summarization, extraction.
Routing, retries, status-code handling, and deterministic transforms belong in plain code.
If a conditional already answers the question, don't route it through a model.
## 7. Define Done
Before non-trivial work, state success in one sentence.
Verify against that criterion before reporting complete.
## 8. Test Intent, Not Behavior
Tests must encode WHY behavior matters, not just WHAT it does.
A test that can't fail when business logic changes is testing nothing.
## 9. Checkpoint
After each significant step in multi-step work: state what was done,
what's verified, what remains. Don't continue from a state you can't describe back.
## 10. Two-Attempt Rule
Same approach fails twice → stop, report, wait for direction.
No silent retries with cosmetic variations.
## 11. Incident Protocol
When something breaks: pause. Ask the user to confirm intent.
Summarize the last actions. Identify root cause before any fix.
## 12. Never Propose Unsolicited Scope
No batch operations, crawlers, queues, or architecture broader than the stated task.
Default pattern: one input, one structured output.
```
---
## Two rules worth dwelling on
Most of the twelve are self-explanatory. Two earn extra discussion because they are the most counterintuitive — and the most often wrong on first read.
### Pick, don't blend
When an AI encounters two conflicting patterns in a codebase, its default move is to satisfy both. The result is hybrid architecture: half-old abstractions, half-new conventions, duplicated logic, inconsistent interfaces, migration layers nobody intended.
This is not unique to AI. Humans do it during organizational transitions. The model is reflecting the ambiguity of the environment, not generating it.
Forcing a choice — *more recent, more tested* — and flagging the loser for cleanup *looks* like discipline. It functions as mercy. The model is being asked to satisfy two contradictory patterns at once; that task is computationally and semantically impossible. Removing the impossible task frees the model to write coherent code.
### Restarting is underrated
There is a related insight that didn't quite earn its own rule but probably belongs on a sticker somewhere: once a session accumulates failed assumptions, speculative fixes, stale context, partial edits, abandoned hypotheses, and defensive explanations, the model begins spending effort preserving continuity instead of solving the problem.
At that point, starting over is often cheaper than repairing the narrative.
This is true in software systems generally. Every sufficiently tangled state eventually reaches the point where reconstruction beats recovery. The *Two-Attempt Rule* and *Token Discipline* both encode this insight — they exist not to punish the model for failing, but to force a restart before the cost of recovery exceeds the cost of starting over.
---
## What the rules can't do
For honesty: every rule above has failure modes.
*Two-Attempt Rule* prevents spirals, but occasionally cuts a diagnostic short that needed one more step. *Never Propose Unsolicited Scope* prevents over-engineering, but occasionally suppresses architectural insight worth hearing. *Lazy Loading* protects token budget, but occasionally starves the model of context it needed.
These rules sharpen one axis — operational predictability — and cost something on another — autonomous initiative. The choice of which side to err on is yours. Different teams will choose differently. Different projects within the same team will choose differently. Pruning is continuous, not one-time.
There is also a category mistake to avoid: **not every failure deserves a rule.**
Some failures are better solved by tooling. A linter catches *"never use let, always const"* faster and more reliably than any prompt instruction. A test gate catches *"verify before claiming"* more durably than a behavioral constraint. A permission system catches *"never deploy without approval"* more decisively than a rule the model can rationalize past.
The CLAUDE.md is the right place for behaviors that are hard to verify deterministically — *how* the model approaches problems, *when* it asks, *what* it stops itself from trying. The lint config is the right place for behaviors that can be enforced by the environment. Putting environmental enforcement into the prompt wastes context. Putting prompt-level behavior into the linter is impossible.
---
## How to use this in practice
If you're starting your own file, a working sequence:
1. **Place the file at the project root.** Most AI tools auto-inject root-level markdown into the system context. Layer it: global rules in your home directory, project-specific rules in the project root.
2. **Treat it like a config file.** Commit it. Review changes in PRs. Let teammates suggest constraints from their own friction points.
3. **Test each new rule in isolation.** Before adding a rule, trigger the failure mode intentionally. Verify the model honors the constraint in a fresh session.
4. **Track a signal, even informally.** Token usage per task. Frequency of irrelevant tool calls. Time-to-resolution on debugging loops. Percentage of work requiring rework. If a new rule doesn't move the signal, it's decoration. Cut it.
5. **Prune quarterly.** Every model release shifts the defaults. Some rules become redundant. New failure modes appear. The file is never finished.
The twelve above are mine. The artifact is portable. The process is more portable.
---
## The bigger frame
There is an idea quietly emerging across the AI engineering community, and the loop above is a small expression of it:
> AI usage is shifting from prompting to operations.
The highest-leverage users no longer look like prompt magicians. They look like technical leads, editors, reviewers, orchestrators, systems designers. They are not asking for outputs. They are designing behavioral environments. They define escalation paths, stopping rules, verification standards, authority boundaries, and recovery procedures.
The future skill in working with AI is not commanding intelligence.
It is governing it.
A CLAUDE.md is one layer of that governance. Tooling is another — linters, test gates, sandboxes, permission systems, diff review. Telemetry is a third — measuring what your AI workflow actually costs and produces. Escalation protocols are a fourth — knowing when to involve a human, when to involve a second model, when to discard a session and start fresh.
This article is about the first layer. The other layers are the next conversations.
---
## Closing
Don't copy the twelve rules. Copy the loop.
When something breaks in an AI session, do not ask only *"why did the model fail?"* Also ask: *"what operational vacuum allowed the failure to expand unchecked?"*
That second question produces better systems.
Write the constraint that fills the vacuum. Inject it. Validate it in the next session. Prune it when it stops earning its place.
Your scars will be different. So are Claude's.
And the loop is just the beginning. Once you have it, you start noticing how much of effective AI work isn't asking better questions — it's designing better environments.
Tooling, telemetry, sandboxes, escalation protocols.
But that's the next conversation.
The loop is where you start.