Observability · 10 min read

Self-Improving Agents: How an AI Watches Its Own Work

An agent that runs forever needs an inner critic. Cashmere has one — a pulse that scores its own behavior, names its own regressions, and routes its own attention. Here's how that loop works.

ACT briefing · chat · research OBSERVE pulse · metrics · health SCORE red · amber · green ADJUST cadence · prompts · routes

Most AI systems fail in the same way: they get worse over time and nobody notices because the failures are subtle, qualitative, and statistical. The model starts producing slightly more formulaic responses. The retrieval starts surfacing slightly less relevant memories. The briefings start saying the same things slightly more often. None of it triggers an exception. None of it breaks a test. The system silently degrades while continuing to look fine.

For a personal agent that runs for years, this is the failure mode that matters most. Static deployments rot. The fix is not "ship perfect software." The fix is to build the observer into the agent itself.

The pulse

Cashmere has a thing called the pulse. It runs as part of the daemon loop, on its own cadence, and it does one job: it looks at recent agent activity and scores it.

The score isn't a single number. It's a structured assessment: how is the briefing engagement rate? Are responses being read? Is the extraction pipeline producing entities at expected rates? Are there patterns in the recent failures? Is the prod binary stale relative to the working tree? Is anything drifting?

Each axis gets a status — red, amber, or green — and a short natural-language summary. The pulse is itself an LLM call, so the summary is composed by the model with full context: it can say things like "engagement dropped because the last 12 briefings were noisy duplicates" rather than "metric X under threshold Y."

Pulse — last tick
2026-05-02T18:24Z
Daemon liveness
28 ticks in last hour. No worker errors. PID 14322 healthy.
Briefing engagement
2 of last 8 briefings opened. Cadence may be too high — recommending backoff.
Memory extraction
14 entities in last 24h. No empty-result regressions.
Deploy lag
Prod binary is 4 commits behind master. Recommending restart.
Persona integrity
No bare-name leaks detected in last 50 responses.
Tool selection
Fuzzy fallback triggered 0× this hour. Tool descriptions converging.

From observation to adjustment

Observation alone is just dashboarding. The next step is what makes a system self-improving: the pulse's outputs feed back into the agent's own behavior.

A few examples of how that loop closes in practice:

  • Briefing cadence backoff. When engagement drops, the pulse recommends slowing the cross-thread synthesis cadence. The scheduler reads that signal and adjusts. The briefings get rarer and more important.
  • Persona drift detection. The pulse looks at recent responses for signs of the agent breaking character (verbatim reproductions of system-prompt fragments, leaked internal identifiers). When it spots one, the responder's prompt template is flagged for revision and a "honest fallback" string takes over until the leak is patched.
  • Tool selection learning. When the harness has to fuzzy-match a tool name — because the model called list_watchitems instead of list_watch_items — the pulse logs it and the tool description gets a short post-pulse review pass.
  • Memory consolidation triggers. When too many semantic facts contradict each other, the pulse recommends a deeper consolidation pass. The consolidation worker runs, prunes, and the next pulse confirms convergence.

None of these adjustments require a human in the loop. Some require a code change before they can deploy — and that's where the second observer comes in.

The second loop: the founder loop

There's a second observer above the pulse — a "founder loop" that watches the pulse itself. Its job is to spot the cases where the daemon's self-correction isn't enough, and to surface those to the human.

Specifically: when the pulse stays red for the same reason across multiple ticks, that's a signal the agent can't fix the problem from inside. The fix lives one rung up the abstraction ladder — in code, in deployment, in architecture. The founder loop's job is to make that fix legible.

Concretely: it's a recurring background job that reads the pulse's recent assessments, identifies the persistent red-axes, and produces a succinct human-readable digest. "Briefing engagement has been red for 7 ticks. The daemon has backed off cadence twice. The remaining cause is a duplicate- detection gap in the synthesis worker." That's a code-change recommendation, written by the agent, addressed to its operator.

6
Pulse axes
Liveness · engagement · extraction · deploy lag · persona · tool selection
60s
Tick
The pulse runs as part of the loop. No separate process.
LLM
Scored
Each axis gets a model-written summary, not just a numeric threshold.

The two failure modes this catches

Building self-observation into the agent catches the two failure modes that traditional monitoring misses.

The first is silent degradation. A response that's slightly worse, a briefing that's slightly more redundant, a retrieval that's slightly less relevant — none of it crosses a numeric threshold, but it would all show up in a model-written assessment because the model can read the actual content.

The second is structural blind spots. The classic example: the pulse itself ran fine for 92 hours while it was completely failing to deliver any of its recommendations to the operator, because the Telegram nudge had been broken at the parser layer. Numeric monitoring would have shown "100% pulse success." Model-written self-observation noticed "I'm producing assessments nobody is reading" and surfaced the gap.

Self-improvement isn't an algorithm. It's the architectural property that the system can name its own failures.

A static deployment is a snapshot of correctness. A self-observing one is a process for staying correct.

Why this matters more for personal agents than any other AI

A consumer chatbot that gets slightly worse over time is fine for the vendor — most users won't notice, and the cohort that does will churn but get replaced. The product survives.

A personal agent has a population of one. The user is the one who notices. The agent's value decays directly into the user's frustration. There is no churn buffer, no aggregate to hide behind. The agent has to notice itself getting worse, because nobody else will.

That's the design constraint. Build the observer in. Let it have opinions. Let it close the loop.


Inside Cashmere: the pulse logic lives in cashmere/daemon/loop.py and the founder loop is implemented as recurring scripts in scripts/ceo-*.py. Health snapshots write to health.jsonl. The dashboard renders the latest assessment.