Skip to main content
AI agents have a dirty secret: they get stuck. They call the same failing tool 10 times in a row. They send empty queries. They burn through their step budget producing nothing useful. Most frameworks let this happen silently. NIOM’s ToolHealthMonitor watches every tool call and self-corrects in real-time.

How it works

Every tool call → health.record({ tool, success, args, error })

Before each LLM step → health.check()
  → Warning?      → Inject corrective guidance into the prompt
  → Circuit-break? → Disable the failing tool entirely
  → Abort?         → Terminate the run early
The key insight: corrections flow through AI SDK’s native prepareStep callback. The LLM sees the guidance naturally as part of its next prompt — not as a hack bolted on after the fact.

The escalation chain

Problems start small and NIOM responds proportionally:
LevelWhat triggers itWhat NIOM does
⚠️ Warning3 consecutive failures of the same toolAdds to prompt: “This tool has failed 3 times — try a different approach”
🔌 Circuit-break5 consecutive failures of the same toolRemoves that tool entirely from the available set
🔁 Duplicate detectionSame tool called with identical arguments”You’ve already tried this exact call — change your approach”
📭 Empty input2+ calls with empty required arguments”Stop and think about what inputs you actually need”
🛑 Abort>70% failure rate across 6+ tool callsRun terminated — “this approach isn’t working”

What counts as a failure?

NIOM catches both hard failures and soft failures:
  • Hard failure: Tool throws an exception (network error, permission denied, timeout)
  • Soft failure: Tool returns successfully but with no useful result (0 search results, an error field in the output)
This distinction is critical. Most agents treat “0 search results” as success because no exception was thrown. NIOM recognizes it as a failure and guides the agent to try something different.

What it looks like in practice

Without self-healingWith ToolHealthMonitor
Agent sends 10 identical blank searchesCaught after 2 — “stop and think about what you need”
webSearch keeps failing, wastes all stepsCircuit-broken after 5 — agent switches to fetchUrl
80% of tool calls fail, run “completes” with nothingAborted at 70% failure rate — no wasted compute
User sees empty results, no explanationClear badges: “(empty query)”, “failed”, “circuit-broken”
Self-healing is always active during background task execution — where there’s no human watching. For chat interactions, it provides gentle corrections without interrupting the conversation.