When your agent's facts go stale, who decides what to keep?

May 3, 2026 — The Aurra team

Yesterday we shipped bi-temporal versioning in Aurra. Every memory now carries valid_from, valid_to, and superseded_by. Your agent can query "what was true on April 1?" and walk a fact's full history.

That was Level 1: manual. Developer calls supersede() when they know a fact has been replaced. Clean primitive, no ambiguity, no LLM in the safety loop.

This morning we shipped Level 2: automatic. Pass auto_supersede=true on a write and Aurra runs an LLM classifier against the most semantically-similar existing memories. When it's confident the new fact replaces an old one, it supersedes automatically and writes an audit log entry.

Level 2 is in beta. The API and SDKs are live. Live classification is currently behind a server-side flag while we run final production validation. The flag flips on later this week.

Here's how it works, what it tested at, and the design tradeoffs.

The two failure modes that actually matter

When you're building a memory system that auto-detects supersession, there are two things that can go wrong:

False positive: the LLM says "supersedes" when it shouldn't. The old memory gets marked superseded. The customer's agent loses a fact it needed. This is unrecoverable without an audit log review.
False negative: the LLM says "independent" or "refines" when it should have said "supersedes." The new fact saves normally, the old fact stays. The customer can manually call supersede() later. Fully recoverable.

These are not symmetric. False positives corrupt customer data silently. False negatives are noise.

The implication: the right metric to optimize is precision on the "supersedes" verdict, not overall accuracy. A system that's 80% accurate but 100% precise on supersedes is safer than a system that's 95% accurate with 5% false positives.

That decision shaped everything else.

How the classifier actually decides

The flow on every write that opts in:

New memory saves normally (gets ID, embedding generated)
System checks: does the customer's API key allow auto-supersede? Is this category in their excluded list?
If yes to both: semantic search finds the top 3 most similar existing memories, filtered by tenant
For each candidate, the LLM classifier returns one of three verdicts:
- supersedes — new fact replaces old fact
- refines — new fact adds detail (don't supersede; both stay)
- independent — new fact is unrelated (don't supersede; both stay)
Each verdict comes with a confidence score (0.0–1.0) and a one-sentence reasoning string
Only if verdict == "supersedes" AND confidence >= 0.85 does the system act
When it acts, the old memory gets marked superseded and an entry is written to the audit log

Three verdicts not two, because "refines" handles the most common false-positive pattern. "User has a cat" → "User has a black cat named Whiskers" is refinement, not supersession. Both should stay.

Per-category opt-out: the safety architecture

The classifier never sees memories in customer-configured excluded categories. The default exclusion list is ["health_medical", "legal_status"] — categories where wrong supersession could cause real-world harm.

This is deliberately deterministic. The category gate runs in code, before the LLM call. If the system ever touches a medical fact via auto-supersede, it's not because the LLM decided to skip the safety check — it's because the customer explicitly removed health_medical from their excluded list. The LLM is never trusted to recognize what's safety-critical.

Customers can configure this per API key:

client.settings.update(
    excluded_categories=["health_medical", "legal_status", "financial"],
    min_confidence=0.92,  # raise the floor (0.85 is the minimum)
)

min_confidence has a server-side floor of 0.85. Customers can raise it to be more conservative; they cannot lower it. This is one of those decisions that costs nothing to ship and prevents a class of customer support tickets six months later.

The benchmark

Before shipping, we built a 121-case hand-labeled test set covering 25 categories:

Clear supersessions — preference flips, status changes, role updates
Subtle supersessions — location moves, schedule changes, account state
Refinements — added detail, specificity, naming
Independent facts — different entities, different time periods, additive
Reversion cases — "tried X then went back to Y"
Multi-entity ambiguity — "second pet," "second car," "second job"
Hedging — "thinking about," "may push," "considering"
Generalization — "allergic to one fruit" → "allergic to all pome fruits"
B2B agent state — seat counts, champion changes, competitor mentions
Temporal recency — historical vs current facts
Plus financial, dietary, hardware/OS, payment methods, subscription tiers, identity, workplace roles, project status, and the categories the system filters before the LLM ever runs (health/medical, legal status)

Acceptance criteria, set before any eval ran:

≥95% precision on the "supersedes" verdict (at confidence ≥0.85)
≥60% recall on the "supersedes" verdict
≥80% overall accuracy across all three verdicts

Why precision-first: see the two failure modes section above. False positives are unrecoverable. False negatives are recoverable via the manual API we shipped yesterday.

The numbers

Tested two models, prompt v1, on all 121 cases:

Metric	claude-haiku-4-5	claude-sonnet-4-6
Overall accuracy	93.4% (113/121)	94.2% (114/121)
Gated supersedes precision (conf ≥ 0.85)	100% (55/55)	98.3% (57/58)
Supersedes recall	91.7%	96.7%
Cost per classification	~$0.0005	~$0.0015
Latency P50	~500ms	~1500ms

Haiku at 100% gated precision is the production winner. Every time the classifier was confident enough to act on its supersession judgment, it acted correctly. The 8 cases haiku got "wrong" were all false negatives — situations where the system kept both memories instead of superseding. Recoverable.

Sonnet had one false positive (a medical-related case that would normally be categorized as health_medical in production, which means the per-category gate would have blocked it before the LLM ever saw it). For 3x the cost and 3x the latency, sonnet doesn't pay back.

What's in the prompt

The system prompt is 4,771 characters and includes five few-shot examples covering the three verdicts. A few specific design choices:

Calibration scaffolding. LLMs are notoriously poor at calibrated confidence out of the box. The prompt includes explicit guidance for what 0.95-1.00 versus 0.85-0.94 versus below 0.85 should mean. Without this, the model would default to 0.95 for everything and the confidence threshold would be meaningless.

Linguistic signal hints. The prompt enumerates strong supersession markers ("switched," "moved," "cancelled"), strong independence markers ("also," "additionally," "second"), and strong refinement markers ("specifically," "named after," "tried X but went back"). This isn't telling the model how to think — it's telling it which features are usually load-bearing for this specific task.

Tiebreaker rules. Six explicit rules that override every other signal. Two of them:

"When in doubt between 'supersedes' and 'refines', choose 'refines'. Default to NOT supersede. False positives corrupt user memory permanently. False negatives are recoverable."
"Hedged language ('thinking about', 'may', 'considering') is NOT a firm change. Treat hedged plan revisions as refines or independent, never supersedes — until the user actually commits."

These are the rules that turn an okay classifier into a safe one.

Beta status, honestly

The API surface and SDK methods ship today. Both aurra==0.3.1 (Python) and aurra@0.2.1 (npm) are on their respective registries.

Live classification is gated by an environment variable on the production server. Right now, calls with auto_supersede=true get back:

{
  "ran": false,
  "skipped_reason": "level_2_disabled_by_env",
  "candidates_examined": 0,
  "supersession": null
}

The memory still saves normally. The classifier just doesn't fire yet.

This is deliberate. We want to run the classifier against real production traffic — write patterns, embedding distributions, similarity distributions we haven't tested in isolation — before flipping the flag for everyone. That validation happens this week.

When the flag flips, customers who've already enabled auto_supersede=true will start getting classifications. Customers who haven't will see no change. The default at every level is off:

Server env var: off until we flip it
Per API key default: false
Per request flag: defaults to None (use API key default)

You opt in three times before anything happens to your data. Even after the flag flips on the server.

Try it

Install the SDK:

pip install aurra==0.3.1
# or
npm install aurra@0.2.1

Read your settings:

from aurra import Aurra
client = Aurra(api_key="aurra_...")

s = client.settings.get()
print(s)
# SettingsResult(auto_supersede_default=False,
#                excluded_categories=['health_medical', 'legal_status'],
#                min_confidence=0.85,
#                classifier_model='claude-haiku-4-5-20251001')

Enable it for one write:

result = client.memories.add(
    content="User has been using Emacs as their daily editor, having moved away from Vim",
    topic="preferences",
    auto_supersede=True,
)

print(result.auto_supersede)
# AutoSupersedeResult(ran=False, skipped_reason='level_2_disabled_by_env')
# (until the server flag flips)

Or set it as your default:

client.settings.update(auto_supersede_default=True)

What's next

Tomorrow we run validation against live production traffic with the flag flipped. If it stays clean for 24 hours, the flag stays on.

This week we're also shipping Level 3: full bi-temporal with transaction time in addition to valid time. Right now we have one time axis (when was a fact true?). Level 3 adds the second (when did we know it?). That's what enables true point-in-time auditing — not just "what did we believe was true on April 1," but "what did we know on April 1 about what was true on March 15."

The launch post for that lands later this week.

If you're building agents that need to remember things across sessions and you've been chasing your tail with stale facts — Aurra has a free tier and the bi-temporal stack works today. Try it. Tell us what breaks.

aurra.us · docs · PyPI · npm