Concepts¶
What context rot is¶
As context grows, LLM agents start repeating themselves and producing less useful output. The model is still generating tokens — it just isn't saying anything new. ctxrot makes this visible with two families of signals that are cheap to compute and require no LLM calls of their own:
- Repetition — how much each new completion overlaps with earlier ones
- Efficiency — how much the model outputs relative to the input it receives
How the callback works¶
A SQLite database is created at db_path. CtxRotCallback hooks into DSPy's BaseCallback and populates three tables at runtime — a session row on on_module_start, an LM call row on on_lm_end, and a tool call row on on_tool_end.
Your DSPy agent → CtxRotCallback → SQLite → TUI dashboard / analysis
(unchanged) (just listens) (local)
Sessions close automatically when the top-level DSPy module returns. The terminal state is recorded as errored if the module raised and completed otherwise; both analyze and deep-analyze surface it.
Session state lives in a ContextVar, so asyncify/streamify worker threads each see an isolated session — concurrent agent calls don't stomp on each other.
What gets tracked¶
| Per LM call | Per tool call | Per session |
|---|---|---|
| Prompt tokens, completion tokens | Tool name, duration | Model, mode (react, chainofthought, …) |
| Cache read / write tokens | Estimated output tokens | Start time, end time |
| Cost, duration | — | Terminal state (completed / errored) |
| (opt) full prompt messages + completion text | (opt) full input JSON + output text | — |
The "opt" rows only populate if you passed store_content=True when constructing the callback.
Context rot detection¶
Local signals only. No LLM calls. Token counting uses tokie.
Requires content capture
Repetition analysis needs store_content=True when you construct CtxRotCallback. DSPy structural markers ([[ ## ... ## ]]) are stripped before comparison so they don't inflate overlap scores.
Repetition — per-iteration¶
| Metric | What it measures | How |
|---|---|---|
ngram_jaccard |
Word-level overlap vs previous completion | Jaccard similarity of word 3-gram sets. > 0.4 = looping. |
sequence_similarity |
Character-level similarity vs previous completion | rapidfuzz.fuzz.ratio / 100. Catches paraphrased repetition that n-grams miss. |
cumulative_max |
Max overlap vs any prior completion | Max ngram_jaccard across every earlier iteration. Catches non-consecutive loops. |
analyze flags the onset iteration as the first iteration whose ngram_jaccard exceeds 0.4.
Efficiency — per-iteration¶
A declining ratio across iterations means the model generates less output relative to its input — a sign the context window is saturated.
analyze also reports the initial and final efficiency so you can see drift at a glance.
What the metrics are not¶
- Not a hallucination detector. A high
ngram_jaccardmeans the agent is repeating itself, not that the repeated content is wrong. - Not a universal cost budget.
efficiency_ratiois a shape metric; declining ratios can be normal for certain prompts (e.g., classification) and still be healthy. - Not a replacement for manual review. They're triage signals —
deep-analyzeuses them as one input among several when producing its report.