Troubleshooting agents

Updated June 11, 2026 10:08

Agents (Early Access) · Troubleshoot

When an agent run goes off, almost everything you'll see falls into one of four failure modes — instructions not being followed, runs that loop, runs that are slow, or tools that don't get called or return bad output. This article walks through each, with the specific checks to run before reaching out.

1Check auth › 2Check agent › 3Get help

💡

Beta note: This article covers the Keragon Agents early-access release. Behaviour, limits, and UI may change during the early access period; this page is updated on a rolling basis as we observe real failures with early adopters.

If a tool inside your agent can't reach an external system, the issue is usually authentication rather than the agent itself — see Troubleshooting authentication issues in agents.

💡

Reading the error first: when a run shows an error, open it and read the message — error text that used to appear as a bare "pending" now includes context explaining what happened. Start there before working through the failure modes below.

Click any failure mode below to expand the diagnostic steps.

My agent is not following my instructions

TL;DR: Almost always instruction ambiguity — make the goal, inputs, and stop condition more specific so the agent can't pick a different interpretation than you intended.

The most common reason an agent goes off-script is instruction ambiguity, not the model ignoring you. Agents read your instructions, decide on a plan, and then execute — if any step is open to interpretation, the agent will pick an answer, which may not be the one you wanted.

For a deeper guide on writing clear instructions, see Write your agent's instructions.

What to check:

Be specific about the goal. "Summarise new patient intakes" leaves a lot open. "When a new patient intake form is submitted in Healthie, post a one-line summary to the #patient-intake Slack channel with the patient's name, intake type, and submission timestamp" is unambiguous.
Spell out the inputs. If your agent needs to operate on a specific field, list it. If a tool has a required input that should always take a known value (e.g. a specific Slack channel, a specific Healthie patient list, a specific Athena Health practice ID), name that value in your instructions explicitly rather than leaving the agent to guess it at runtime.
Constrain the scope. Tell the agent what not to do. "Do not contact the patient directly", "Only act on records created in the last 24 hours", "Stop after the Slack message is sent" are all valid guardrails.
State the success criterion. "Done means the Slack message has been posted and the Healthie patient has been tagged 'notified'" gives the agent something to verify against before it stops.
Match the tools to the job. An agent will only use tools you've attached to it. If your instructions reference a step the agent has no tool for, it'll either skip the step, hallucinate the result, or get stuck — none of which you want.

💡

Quick test: read your instructions aloud and imagine a brand-new contractor doing the work. If you'd need to answer follow-up questions, the agent will too — and it'll answer them by guessing.

My agent is stuck in a loop

TL;DR: The agent can't tell what "done" looks like, or it's retrying a config error. Add a clear stop condition and open the run history to spot repeated failing tool calls.

Agents iterate: the model decides on a step, calls a tool, reads the result, decides on the next step, and so on. Keragon enforces internal caps on how many steps an agent can take in a single run as a safety net. When the run reaches a cap, it terminates automatically with status "limit reached". The cap is the floor — but a run that hits it has almost always gone wrong earlier.

The cap isn't the only guard. Keragon also watches for repeated actions inside a run: when the agent makes the same tool call with the same inputs about three times, it gets nudged to change course instead of repeating itself. This catches repeats even when other steps sit between them — an agent cycling between two write actions (post to the #patient-intake Slack channel, update the same Healthie record, post the same message again) gets nudged just like one hammering a single tool. The nudge breaks the repetition; it doesn't fix the cause — if the loop came from an unclear stop condition or a failing tool, work through the steps below.

Why agents loop:

Ambiguous instructions. When the agent can't tell what "done" looks like, it keeps trying. We've observed runs that iterated 300+ times against an unclear instruction set before the cap kicked in.
Tool returns an error but the agent retries. When a tool fails (e.g. Athena Health returns a 400, Slack returns "channel not found"), the agent often retries with slightly different inputs instead of stopping. If the underlying cause is config (wrong field name, missing scope, bad ID), every retry will fail the same way.
Missing input the agent has to guess. If a tool requires an input the agent has to fill in at runtime, and the instructions don't name it, the agent may try several values in sequence — each one a turn.
Open-ended search. Instructions like "find all the relevant records" without a clear stop condition can send the agent paging through results until the cap.

What to do when you see a loop:

Open the run history and look at the steps. The same tool being called repeatedly with small input changes is the giveaway.
Tighten the success criterion. Add a clear stop condition — for example, "Stop once the patient has been created in Healthie and the appointment confirmation has been posted to the #patient-intake Slack channel."
Name the value in your instructions wherever you know it — specific Slack channel, specific Healthie patient list, specific Athena Health practice ID. Don't make the agent rediscover what you already know.
Fix the underlying tool error if one is present — see My agent fails to call a tool below.

My agent is too slow

TL;DR: Too many tool calls, slow tool calls, or both. Open the run, count the turns, and narrow whichever tool is the bottleneck.

Agent runs are I/O-bound: most of the wall-clock time is the model waiting on tool calls, not the model thinking. A slow agent almost always means too many tool calls, slow tool calls, or both.

What to check:

How many turns is the run taking? Open the run, look at the turn count. A well-scoped agent often finishes in 2–5 turns. A 12-turn run is doing too much.
Which tool is slow? If one specific tool (e.g. a large Athena Health patient search, a long Slack history fetch) is taking many seconds, the agent will inherit that latency on every call. Narrow the input — search by ID, filter by date, paginate — so the call returns faster.
Is the agent re-doing work? Look for repeated calls to the same read tool with similar inputs across turns. That usually means the agent isn't holding onto results well and is re-fetching. Tighter instructions help.
Is the agent's planning step doing more than needed? Agents plan before acting, and you'll see this step in the run history. If the plan is huge, the run will be slow. Scoping the instructions narrows the plan.

The agent runtime itself can support runs of up to two hours per activity, with internal heartbeats keeping long tool calls alive — so "slow" rarely means "stuck." But if a run takes more than a few minutes for what feels like a small task, the four bullets above are where to start.

My agent fails to call a tool

TL;DR: Three possibilities — the tool errored and the agent kept going, the agent guessed wrong inputs, or the tool wasn't called at all. Open the run view first to see which.

This section covers three failure modes that look similar from the outside.

(a) The tool was called, but it returned an error — and the agent kept going anyway.

This is the one to watch most closely during the early access period. When a Keragon tool errors (for example, Athena Health returns a 400 or Slack returns "channel not found"), Keragon's tool layer returns the error as a structured response to the agent rather than crashing the run. That's deliberate — it lets the agent recover from transient failures and try again — but it has a known downside: the agent may not fail clearly. It can continue and produce a final answer based on the error, sometimes hallucinating the data the tool was supposed to return.

What to check in the run view:

Look for an error icon on any tool call step. That's the signal — something went wrong even if the final message looks normal.
Open the failed tool-call step — it now shows the error message and the reason the call failed, not just the arguments it was sent. Read it to see exactly what went wrong (for example, Athena Health: 400 — invalid department id, or Slack: channel_not_found). If there's an error here, the call failed and the agent saw the failure but kept going.
Compare the agent's final output to the actual tool results. If the output references records or values that don't appear in any successful tool call, treat it as unreliable.

What to do:

Don't trust the final output until you've confirmed every tool call succeeded. During the early access period, this is the single most important debugging habit.
Fix the underlying tool error. Most are config issues: wrong subdomain, wrong field name, missing permission scope, bad authentication (see Troubleshooting authentication issues in agents).
Add a fail-stop instruction. Something like "If any tool call returns an error, stop and report the error. Do not proceed with summary data." — this won't fix the platform behaviour but reduces the chance of hallucinated output.
Re-run after fixing. Tool errors are usually deterministic; if the same call fails twice with the same inputs, the config is wrong, not transient.

(b) The tool produced wrong output because the agent guessed the inputs.

Some tools have inputs the agent has to fill in at runtime — for example, picking a Slack channel ID, an Athena Health practice ID, or an enum value like a Healthie appointment type. If your instructions don't name those inputs explicitly, the agent has to guess from your instructions plus whatever the tool's schema tells it. Today the agent doesn't see the list of valid enum values for tools that have them, so it may pass a value the tool then rejects (or, worse, a plausible-but-wrong value the tool accepts).

What to do:

Name the value in your instructions as explicitly as you can. "Use Athena Health appointment status scheduled" is better than "use the right appointment status" — the agent will copy a literal value but has to guess at a description.
Spell out exact channel IDs, list names, practice IDs. If the agent should always write to one Slack channel, name that channel by ID; if it should always look up records in one Healthie patient list, name the list ID. Anything you can hard-code in the instructions is one fewer thing the agent can get wrong.
Re-check the run to see which inputs the agent chose. If it picked something unexpected, you've found your fix.

(c) The tool was never called at all.

Less common, but worth checking: if the agent is supposed to call a tool and didn't, the issue is almost always upstream of the tool itself — the agent decided the tool wasn't needed. That's a "not following instructions" problem; see My agent is not following my instructions above. The most common causes are: the tool isn't attached to the agent, the tool's description doesn't match the language in your instructions, or the instructions never explicitly tell the agent to use it.

What's next Submit feedback, support questions, or bug reports Reach out to our customer support and product team — share feedback, ask a question, or report a bug. Reports are prioritized during early access. Read the article

Articles in this section