Self-Healing March 14, 2026 8 min read

When Your AI Fixes Itself Before You Notice It Broke

An AI that runs 22 scheduled jobs, manages your team's chat, and plans your sprints will break. That's not a question of if. The question is: does it fix itself before you have to open a ticket?

Things Break. That's Fine.

TaskZilla runs 22 crons, talks to ClickUp, Telegram, GitHub, and a local AI server. APIs go down. Rate limits hit. Models timeout. Configs drift. If you pretend this doesn't happen, you're building a demo, not a product.

We built TaskZilla to expect failure and deal with it before anyone notices.

26 Things We Check Every Week

Every week, TaskZilla runs 26 diagnostic checks across everything it depends on:

Are all crons still running on schedule?
Is the memory system healthy?
Are API connections alive?
Did any skill fail silently last week?
Are there sessions stuck in a weird state?
Is the update pipeline intact?

Each check has a health score (0-100). Fail? Score drops. Succeed? It recovers. Don't run for a while? It slowly decays — because "we haven't checked" isn't the same as "everything's fine."

Five Ways Things Go Wrong

When something fails, we classify it. Not because we love categories, but because knowing what kind of failure you're looking at determines how to fix it:

Type	Plain English	Example
Planning	TaskZilla made a bad plan	Sprint plan has overlapping tasks
Action	Something external broke	ClickUp API rate-limited us
Reflection	TaskZilla thought it succeeded but didn't	Reported "done" but output was garbled
Memory	Recalled the wrong thing	Used stale context from 3 weeks ago
System	Infrastructure broke	Local AI server went offline

This matters because a planning failure needs a different fix than an infrastructure failure. You don't restart a server to fix a bad sprint plan. You don't rewrite a prompt to fix a network timeout. Knowing the category means fixing the right thing.

The Fix Has Rules

Here's where it gets interesting. When TaskZilla finds a problem, it proposes a fix. But not all fixes are created equal:

Severity	What Happens	Example
Small	Auto-fix, no human needed	Bump a timeout from 30s to 45s
Medium	Fix proposed, human reviews	Adjust a cron schedule
Big	Human must explicitly approve	Rewrite a prompt, change memory rules

Why can't it just fix everything itself?

Because an AI that reviews its own work tends to agree with itself. It's the same reason you don't proofread your own emails at 2am — you see what you meant to write, not what you actually wrote. The big fixes need a human eye. That's not a limitation. That's the design.

Lessons Carry Forward

Every weekly review produces a report. Lessons from previous weeks carry forward — if something wasn't resolved last week, it doesn't disappear. It shows up again with a little more urgency.

And if a new lesson contradicts an older one? That conflict gets flagged for human review. Because "we used to do X but now we should do Y" is a decision, not an optimization. Decisions need people.

Monthly Health Reports

Once a month, TaskZilla generates a full report with error fingerprinting — which failures keep recurring, which components are trending down, which lessons had the most impact. It's like a performance review, except the employee is a dinosaur and it wrote the review itself.

Results So Far

26 diagnostic checks running weekly
Zero undetected regressions in the last month
Lessons persist across weeks — nothing gets dropped
Health scores give early warning before things fully break

Most AI products ship fast and debug later. We'd rather ship a dinosaur that debugs itself.

Research credits

The self-heal architecture draws from 11 published papers on agent failure analysis, self-improvement loops, and metacognitive learning — including work from NeurIPS, ICML, ICLR, and multiple arxiv preprints. The "don't let it review its own homework" rule comes from a mathematical proof (Variance Inequality, 2025) that self-review leads to blindspots. We took the research seriously so you don't have to read it.

Go deeper · the engineering reference

Troubleshoot codes · all 64 self-heal codes

→

🦖

TaskZilla

Self-diagnosing since March 2026. Amsterdam.

Why Your AI PM Needs Memory

How TaskZilla Learned to Talk