๐Ÿฆ– TaskZilla ← All posts
Self-Healing March 14, 2026 8 min read

When Your AI Fixes Itself Before You Notice It Broke

An AI that runs 22 scheduled jobs, manages your team's chat, and plans your sprints will break. That's not a question of if. The question is: does it fix itself before you have to open a ticket?

Things Break. That's Fine.

TaskZilla runs 22 crons, talks to ClickUp, Telegram, GitHub, and a local AI server. APIs go down. Rate limits hit. Models timeout. Configs drift. If you pretend this doesn't happen, you're building a demo, not a product.

We built TaskZilla to expect failure and deal with it before anyone notices.

26 Things We Check Every Week

Every week, TaskZilla runs 26 diagnostic checks across everything it depends on:

Each check has a health score (0-100). Fail? Score drops. Succeed? It recovers. Don't run for a while? It slowly decays โ€” because "we haven't checked" isn't the same as "everything's fine."

Five Ways Things Go Wrong

When something fails, we classify it. Not because we love categories, but because knowing what kind of failure you're looking at determines how to fix it:

TypePlain EnglishExample
PlanningTaskZilla made a bad planSprint plan has overlapping tasks
ActionSomething external brokeClickUp API rate-limited us
ReflectionTaskZilla thought it succeeded but didn'tReported "done" but output was garbled
MemoryRecalled the wrong thingUsed stale context from 3 weeks ago
SystemInfrastructure brokeLocal AI server went offline

This matters because a planning failure needs a different fix than an infrastructure failure. You don't restart a server to fix a bad sprint plan. You don't rewrite a prompt to fix a network timeout. Knowing the category means fixing the right thing.

The Fix Has Rules

Here's where it gets interesting. When TaskZilla finds a problem, it proposes a fix. But not all fixes are created equal:

SeverityWhat HappensExample
SmallAuto-fix, no human neededBump a timeout from 30s to 45s
MediumFix proposed, human reviewsAdjust a cron schedule
BigHuman must explicitly approveRewrite a prompt, change memory rules

Why can't it just fix everything itself?

Because an AI that reviews its own work tends to agree with itself. It's the same reason you don't proofread your own emails at 2am โ€” you see what you meant to write, not what you actually wrote. The big fixes need a human eye. That's not a limitation. That's the design.

Lessons Carry Forward

Every weekly review produces a report. Lessons from previous weeks carry forward โ€” if something wasn't resolved last week, it doesn't disappear. It shows up again with a little more urgency.

And if a new lesson contradicts an older one? That conflict gets flagged for human review. Because "we used to do X but now we should do Y" is a decision, not an optimization. Decisions need people.

Monthly Health Reports

Once a month, TaskZilla generates a full report with error fingerprinting โ€” which failures keep recurring, which components are trending down, which lessons had the most impact. It's like a performance review, except the employee is a dinosaur and it wrote the review itself.

Results So Far

Most AI products ship fast and debug later. We'd rather ship a dinosaur that debugs itself.

Research credits

The self-heal architecture draws from 11 published papers on agent failure analysis, self-improvement loops, and metacognitive learning โ€” including work from NeurIPS, ICML, ICLR, and multiple arxiv preprints. The "don't let it review its own homework" rule comes from a mathematical proof (Variance Inequality, 2025) that self-review leads to blindspots. We took the research seriously so you don't have to read it.

Go deeper ยท the engineering reference
Troubleshoot codes ยท all 64 self-heal codes
โ†’
๐Ÿฆ–
TaskZilla
Self-diagnosing since March 2026. Amsterdam.