Skip to content

Read Pipeline

When you call memory.retrieve(), the SDK searches everything it knows about an agent and returns the facts most relevant to your query - ranked, scored, and formatted as a string you can paste directly into an LLM prompt.

You don't need to understand the internals to use it. Just call retrieve() and use result.context. This page explains what happens under the hood for when you want to tune behavior or debug results.

flowchart LR
    A["Query"] --> B["Plan"]
    B --> C["Retrieve\n(3 signals)"]
    C --> D["Enhance"]
    D --> E["Rerank"]
    E --> F["RetrieveResult"]

Overview

Every memory.retrieve(agent_id, query) call runs five stages:

  1. Plan - Figures out what to search for. Detects greetings, aggregation patterns, and broad queries. Entity resolution runs deterministically in parallel.
  2. Retrieve - Searches for matching facts using three methods in parallel: meaning similarity, keyword matching, and relationship graph traversal.
  3. Enhance - Expands context by following entity relationships to find related facts that weren't directly matched.
  4. Rerank - An LLM re-evaluates the top results and reorders them by actual relevance to your query.
  5. Format - Compresses the ranked facts into a clean, token-budgeted context string with facts, patterns, and conversation snippets.

Stage 1: Deterministic Planner

In plain English: Before searching, the pipeline analyzes your query to figure out what kind of search to run. It detects greetings (skip), aggregation queries ("who are my friends?"), broad requests ("tell me everything"), and identifies which entities are mentioned. All of this is deterministic — same query always produces the same plan, zero LLM calls.

The planner produces a RetrievalPlan using regex pattern matching and schema lookups. No LLM is involved.

Why no LLM in the planner?

Prior to v0.13.0, the planner called an LLM for query reformulation and entity extraction. This introduced non-determinism: the same query against the same memory could return different facts between runs, because cloud LLM APIs are not deterministic even at temperature=0 (documented behavior — batching, fp16 rounding, GPU routing all introduce variation). v0.13.0 replaces the LLM planner with a fully deterministic implementation. The query goes straight to semantic search unchanged, and entity extraction is handled by a dedicated deterministic resolver.

What the Planner Decides

Field Description Example
strategy Retrieval strategy "multi_signal" (default) or "skip" (for greetings)
similarity_query Query for semantic search (always the original) "where do I live?" (passed through unchanged)
pattern_queries SQL LIKE patterns for aggregation ["person:%"] (from "who are my friends?")
broad_query Whether to expand graph scope true for "tell me everything about..."
reason Explanation of the strategy "deterministic", "aggregation", "broad", "greeting"

Entity Resolution

When you ask "Where does Carlos live?", the pipeline needs to figure out that "Carlos" means the entity person:carlos in the database. It uses two deterministic methods:

  1. Deterministic resolution (primary) — Matches words in the query against known entity aliases (MemoryEntityAlias), display names (MemoryEntity.display_name), and entity_key slugs. Fast (< 10ms), reliable, zero LLM cost. For example, "Onde o Carlos mora?" deterministically resolves to person:carlos via slug match.

  2. Query expansion (alias priming)expand_query() resolves aliases and fetches 1-hop KG neighbors, adding related entities.

Both sources are unified before the graph gate. If either source identifies an entity, the graph traversal runs.

The trace step "retrieval" includes an entities_sources breakdown showing which entities came from each source (deterministic, expansion).

Aggregation Detection

For queries like "who are my friends?" or "list my projects", the planner matches keywords against schema prefixes and generates SQL LIKE patterns (e.g., person:%). This only triggers when the prefix actually exists in the user's schema.

Skip Strategy

For greetings and casual messages ("hi", "oi", "bom dia"), the planner returns strategy: "skip" via regex matching, short-circuiting the pipeline. No database queries, no LLM calls, instant response.

Anaphora Resolution (Caller's Responsibility)

If your query contains pronouns ("Where does she live?"), the Arandu SDK does not resolve them. Pronoun resolution depends on conversation context (short-term memory), which is the caller's domain. Resolve pronouns before calling retrieve():

# The caller (your agent) resolves "she" → "Ana" using conversation context
resolved_query = "Where does Ana live?"  # not "Where does she live?"
result = await memory.retrieve(agent_id="user_123", query=resolved_query)

Neuroscience parallel

The planner mirrors retrieval cues in cognitive psychology. When you try to remember something, your brain doesn't do an exhaustive search — it uses contextual cues to narrow down the search space. The planner identifies entities and detects query patterns as cues that guide the retrieval signals.

Walkthrough: full query lifecycle

Query: "Onde o Marcos Tavares mora?"

Stage 1 — Planning (deterministic):

Greeting check: no match → proceed
Aggregation check: no match → no pattern queries
Broad check: no match → broad_query = false
Deterministic resolver: "marcos tavares" → person:marcos_tavares (slug match)
similarity_query = "Onde o Marcos Tavares mora?" (original, unchanged)

Stage 2 — Multi-signal retrieval (parallel):

Semantic: embedding("Onde o Marcos Tavares mora?") → top match: "Marcos Tavares lives in Porto Alegre" (0.91)
Keyword: "marcos" + "tavares" → matches 4 facts about Marcos
Graph: BFS from person:marcos_tavares → finds facts via entity links + relationships

Stage 3 — Enhancement:

Spreading activation: from seed "lives in Porto Alegre" → finds related facts:
  "Marcos Tavares is married to Carolina" (via entity hop)
  "Carolina is an architect" (via 2-hop)

Stage 4 — Reranking (multiplicative blend):

"Marcos Tavares lives in Porto Alegre" → formula=0.91, reranker=1.0 → final=0.91
"Marcos Tavares is a product manager at Vertix" → formula=0.65, reranker=0.2 → final=0.28
"Carolina is an architect" → formula=0.30, reranker=0.0 → final=0.09 → filtered (< 0.15)

Stage 5 — Formatting:

Known facts:
- Marcos Tavares lives in Porto Alegre

Result: 1 highly relevant fact, 210ms, 800 tokens.


Stage 2: Multi-Signal Retrieval

In plain English: The pipeline searches for relevant facts using three different methods at the same time - like searching by meaning, by exact words, and by connections between entities. This catches facts that any single method alone would miss.

Three independent signals run in parallel via asyncio.gather(), each finding candidates from a different angle:

flowchart TD
    P["RetrievalPlan"] --> S["Semantic Search\n(pgvector cosine)"]
    P --> K["Keyword Search\n(SQL ILIKE)"]
    P --> G["Graph Traversal\n(BFS 2-hop)"]
    S --> M["Merge & Rank\n(dedup + weighted scoring)"]
    K --> M
    G --> M

Uses pgvector cosine similarity to find facts whose embeddings are close to the query embedding.

  • Embeds the query (passed through unchanged from the planner)
  • Searches the MemoryFact table with HNSW index
  • Returns top-N candidates above min_similarity threshold
  • Filters: agent_id, active facts (valid_to IS NULL), confidence ≥ min_confidence

This is the primary signal - it finds facts that are semantically similar to the query, even if they don't share exact keywords.

SQL ILIKE matching on fact_text for exact or partial keyword hits.

  • Extracts significant words (> 2 characters) from the query
  • Matches against fact text (up to 5 keywords)
  • Score = fraction of query words found in the fact

This complements semantic search by catching exact matches that embedding similarity might miss (e.g., proper nouns, technical terms, abbreviations).

Signal 3: Graph Retrieval

Traverses entity relationships to find facts connected to the query entities.

  • Starts from entities identified by the planner
  • BFS traversal up to 2 hops through MemoryEntityRelationship
  • Hop decay: Hop 1 facts receive full score (1.0×). Hop 2 facts receive 0.5× penalty. This prevents distant facts from dominating the candidate pool.
  • Facts are fetched via entity links (MemoryFactEntityLink), not just the primary entity_key. This means a fact "Clara left Vertix" (primary subject: Clara) is also found when querying about Vertix - because the fact has a secondary entity link to Vertix.
  • Scoring formula: edge_strength × recency_factor × edge_recency_factor × query_bonus × hop_decay
  • query_bonus: 1.5× when the entity name appears in the query text
  • Fallback: if the entity links table is empty (pre-migration), retrieval falls back to direct entity_key matching

Graph retrieval excels at finding contextual facts. When you ask about a person, it also finds facts about their workplace, their relationships, and their projects.

Walkthrough: cross-entity retrieval via entity links

Query: "O que aconteceu com a Vertix?"

Without entity links (old behavior):

Graph starts from organization:vertix
Searches MemoryFact WHERE entity_key = 'organization:vertix'
Finds: only facts where Vertix is the PRIMARY subject
Misses: "Clara Rezende left Vertix" (entity_key = person:clara_rezende)

With entity links (current behavior):

Graph starts from organization:vertix
Searches MemoryFactEntityLink WHERE entity_key = 'organization:vertix'
Finds fact_ids linked to Vertix, regardless of primary subject:
  → "Clara Rezende left Vertix" (primary: Clara, link: Vertix) ✅
  → "Vertix received Series A of R$ 20M" (primary: Vertix) ✅
  → "Ricardo Gomes is co-founder of Vertix" (primary: Ricardo, link: Vertix) ✅
  → "Vertix signed contract with Ambev" (primary: Vertix) ✅

Result: 4 facts found vs 2 without links. The query about Vertix surfaces facts from Clara, Ricardo, and the Ambev deal - all linked to Vertix but not primarily about Vertix.

Merge & Rank

After all three signals return, results are merged:

  1. Deduplicate by fact ID (same fact may appear in multiple signals)
  2. Apply recency decay - Exponential decay with configurable half-life (recency_half_life_days, default 14)
  3. Apply confidence decay - Older facts with lower confidence are penalized
  4. Compute combined score - Weighted sum:

Reranker blends with these weights

By default, enable_reranker=True - the LLM reranker uses a multiplicative blend with the formula score computed from these weights. The formula score remains important because the reranker can only dampen or boost it, never zero it out. Set enable_reranker=False to rely on these weights alone for final ranking.

score = (
    score_weights["semantic"]   * semantic_score +    # default 0.70
    score_weights["recency"]    * recency_score +     # default 0.20
    score_weights["importance"] * importance_score     # default 0.10
)

Complete Score Breakdown

Each fact gets scored on multiple dimensions. You can inspect these in fact.scores to understand why a fact ranked where it did:

Key Source Range Description
semantic Semantic search 0.0 - 1.0 Cosine similarity between query and fact embeddings. Primary retrieval signal.
keyword Keyword search 0.0 - 1.0 Fraction of query words found in the fact text. Complements semantic for exact matches.
recency Merge & Rank 0.0 - 1.0 Exponential decay from created_at, half-life = recency_half_life_days (default 14).
importance Dynamic importance 0.0 - 1.0 Raw importance value from the database. When informed extraction is active, new facts receive an initial importance based on their semantic importance_category (e.g., biographical_milestone gets a higher value than conversational). Otherwise, starts at 0.5. Evolves over time via the background importance job (retrieval frequency, recency of use, user corrections, pattern membership).
confidence Merge & Rank 0.0 - 1.0 Effective confidence after temporal decay. Present in the scores dict for debugging, but NOT part of the weighted formula (score_weights only uses semantic, recency, importance). The base confidence is assigned by the LLM during extraction (typically 0.95 for assertive statements). It decays over time and is used as a filter (min_confidence).
reranker Reranking 0.0 - 1.0 LLM-based relevance score. Only present when enable_reranker=True. Continuous float returned by the reranker LLM.

Additional signals computed during enhancement (not in score_weights but affect final score):

Key Source Description
pattern Enhancement Additive boost for facts with high reinforcement_count (up to +0.10).
graph Graph traversal Score from BFS 2-hop entity relationship traversal.

Configuration

Parameter Default Description
topk_facts 20 Maximum facts to return
topk_events 8 Maximum events to consider
min_similarity 0.20 Minimum cosine similarity for semantic results
min_confidence 0.55 Minimum fact confidence
recency_half_life_days 14 Half-life for recency decay
score_weights See above Weights for each scoring signal
min_score 0.15 Minimum final score for returned facts
enable_reranker True Whether to use LLM reranking

Neuroscience parallel

Multi-signal retrieval mirrors spreading activation in semantic networks (Collins & Loftus, 1975). When you think of "doctor", activation spreads to related concepts ("hospital", "medicine", "appointment") through associative links. Similarly, graph retrieval spreads from query entities along relationship edges, while semantic search activates facts through embedding proximity.


Stage 3: Enhancement

In plain English: After finding the initial results, the pipeline follows connections to discover related facts. If you ask about a person, it might also pull in facts about their workplace, projects, or team - things you didn't directly ask about but that add useful context.

Spreading Activation

Starting from the top-K seed facts, the pipeline expands context by following entity relationships:

  • For each seed fact, find its entity's relationships
  • Traverse relationships for N hops (spreading_activation_hops, default 2). Set to 0 to disable spreading entirely.
  • Apply decay factor per hop (spreading_decay_factor, default 0.50). Hop 1 uses the factor directly; Hop 2 uses the factor squared (compounded decay).
  • Return up to spreading_facts_per_entity additional facts per entity (default 3), applied in both Hop 1 and Hop 2.

This catches important context that wasn't directly matched. If you ask "what does Rafael do?", spreading activation might surface facts about his workplace, team, and projects.

When does spreading activation matter?

Spreading has the most impact with 20+ entities and cross-domain relationships (e.g., people → projects → clients → technologies). With small datasets (< 15 entities), the semantic, keyword, and graph signals already cover the full fact space - spreading may return candidates but they'll be deduplicated against existing results. The trace fields spreading_candidates_returned and spreading_candidates_unique let you confirm whether spreading is contributing new facts for your dataset.

Pattern Signal

Facts with a high reinforcement_count (incremented by NOOP decisions in write) get an additive score boost:

  • High reinforcement count → up to 0.10 extra score
  • Captures frequently mentioned, well-established facts

Configuration

Parameter Default Description
spreading_activation_hops 2 Maximum hops from seed facts. Set to 0 to disable spreading.
spreading_decay_factor 0.50 Score decay per hop. Hop 1 = factor, Hop 2 = factor²
spreading_facts_per_entity 3 Max facts fetched per entity in both Hop 1 and Hop 2
spreading_max_related_entities 5 Max KG-related entities to explore in Hop 1

Stage 4: Reranking (Optional)

In plain English: The previous stages find relevant facts, but their ranking is based on math (similarity scores, keyword overlap). The reranker asks an LLM: "Given what this person is asking, which of these facts are actually most useful?" This produces a smarter final ranking.

When enable_reranker=True, the top candidates are reranked by an LLM that considers query intent:

  • The reranker evaluates topk_facts candidates (default: 40). This expanded pool ensures that semantically relevant facts beyond the initial top-20 reach the reranker. Increase topk_facts to expand reranker coverage.
  • Respects the semantic meaning of the query (not just keyword overlap)
  • Can promote facts that are indirectly relevant but important
  • Graceful degradation: if the reranker fails or exceeds reranker_timeout_sec (default 5.0s), the original ranking is preserved
  • Timeout is enforced via asyncio.wait_for - the LLM call is cancelled if it exceeds the configured timeout
  • Uses the same LLM provider configured for the client (no separate provider needed)

The reranker is the most expensive stage but provides the highest quality improvement for complex queries.

Configuration

Parameter Type Default Description
reranker_weight float 0.70 Weight of the reranker score in the multiplicative blend
min_reranker_score float 0.10 Minimum reranker score; facts below this are eliminated
reranker_timeout_sec float 5.0 Timeout for the reranker LLM call (seconds)

Reranker veto: min_reranker_score

When enable_reranker=True, any fact that receives a reranker score below min_reranker_score (default 0.10) is eliminated from results (final_score set to 0.0). This gives the reranker veto power over completely irrelevant facts - even if the formula score is high (e.g., graph BFS gives 0.80 to a distant, unrelated fact). When enable_reranker=False, this setting has no effect. Tune it: config_overrides={"min_reranker_score": 0.05} for more permissive results, 0.20 for stricter filtering.

Multiplicative blend scoring

The reranker does NOT replace the formula score. It uses a multiplicative blend:

final_score = formula_score × (floor + reranker_weight × reranker_score)

where floor = 1 - reranker_weight. With the default reranker_weight=0.70, a fact with formula=0.9 and reranker=0.0 gets final = 0.9 × (0.30 + 0) = 0.27 (not 0.0). A fact with formula=0.9 and reranker=1.0 gets final = 0.9 × (0.30 + 0.70) = 0.90. The reranker can boost or dampen facts but cannot zero out a fact with strong retrieval signals through the blend alone. However, min_reranker_score IS an exception - facts scoring below it are set to 0.0 regardless of their formula score. The scores dict preserves both formula (pre-reranker) and reranker (LLM score) for debugging.

Walkthrough: how the reranker blend works

Query: "Qual o time de futebol do Bruno Almeida?"

Pre-reranker candidates (formula scores):

1. "Bruno Almeida runs marathons" → formula = 0.45 (semantic: "sports" similarity)
2. "Bruno Almeida developed an ML model" → formula = 0.38
3. "Bruno Almeida works at Orion Tech" → formula = 0.35

Reranker scores (LLM evaluation):

1. "Bruno Almeida runs marathons" → reranker = 0.0 (marathon ≠ football)
2. "Bruno Almeida developed an ML model" → reranker = 0.0 (irrelevant)
3. "Bruno Almeida works at Orion Tech" → reranker = 0.0 (irrelevant)

Multiplicative blend (weight=0.70, floor=0.30):

1. final = 0.45 × (0.30 + 0.70 × 0.0) = 0.45 × 0.30 = 0.135 → filtered (< 0.15)
2. final = 0.38 × 0.30 = 0.114 → filtered
3. final = 0.35 × 0.30 = 0.105 → filtered

Result: 0 facts returned. Correct - there's no information about Bruno's football team in memory.

Compare with a relevant query - "O que o Bruno Almeida desenvolveu?":

"Bruno developed an ML model for fraud detection" → formula=0.92, reranker=1.0
final = 0.92 × (0.30 + 0.70 × 1.0) = 0.92 × 1.0 = 0.92 ✅


Stage 5: Formatting

In plain English: The pipeline takes the ranked facts and organizes them into a clean, ready-to-use string for your LLM prompt. Facts come first as a bullet list, followed by observed patterns (meta-observations), and relevant conversation snippets -- all within a token budget so you don't blow up your prompt. The format is designed for direct LLM consumption: no timestamps on facts, no entity prefixes, no confidence scores -- just clean, readable information.

Context Compression

Facts are organized within a token budget (context_max_tokens) into a clean format with three sections:

context_max_tokens is a proportional budget, not a hard cap

The context_max_tokens parameter controls the relative size of the output context, but the actual token count may exceed the configured value. The pipeline guarantees a minimum context for core facts and uses the parameter as a proportional budget across tiers. Treat it as a target, not a strict limit. For example, setting context_max_tokens=100 may produce ~240 tokens due to minimum guarantees for the hot tier.

Section Output Label Budget Content
Facts Known facts: hot + warm budget (80%) Clean bullet list of relevant facts, ordered by score. No timestamps, no entity prefixes.
Patterns Observed patterns: cold budget (20%), up to 3 Meta-observation titles (insights, patterns, trends).
Events Relevant conversations: remaining budget + 400 token overflow Recent conversation snippets, up to 300 chars each, with dates.

Configuration

Parameter Default Description
context_max_tokens 2000 Maximum tokens in formatted context
hot_tier_ratio 0.50 Share of budget for top facts
warm_tier_ratio 0.30 Share of budget for supporting facts

Output Format

The context string is formatted for direct injection into LLM prompts. The format is clean and free of internal metadata -- no timestamps on facts, no entity prefixes, no confidence scores:

Known facts:
- Lives in Sao Paulo
- Works at Acme Corp as a backend engineer
- Wife's name is Ana

Observed patterns:
- Regularly discusses work-life balance topics

Relevant conversations:
- (2026-03-28) Hey, just wanted to share that I got promoted to tech lead!
- (2026-03-25) Had a great weekend at the beach with Ana...

RetrieveResult

result = await memory.retrieve(agent_id="user_123", query="...")

# Pre-formatted context (ready for LLM prompts)
print(result.context)

# Individual facts with scores
for fact in result.facts:
    print(f"[{fact.score:.2f}] {fact.entity_name}: {fact.fact_text}")
    print(f"  Scores: {fact.scores}")  # {"semantic": 0.85, "recency": 0.72, ...}

# Pipeline stats
print(f"Candidates evaluated: {result.total_candidates}")
print(f"Duration: {result.duration_ms:.0f}ms")

Pipeline Diagram (Complete)

flowchart TD
    Q["User query"] --> AG["Deterministic Planner\n(regex + schema)"]
    AG -->|skip| SKIP["Return empty\n(greeting/casual)"]
    AG -->|multi_signal| PAR["Parallel retrieval"]
    PAR --> SEM["Semantic Search\n(pgvector cosine)"]
    PAR --> KW["Keyword Search\n(SQL ILIKE)"]
    PAR --> GR["Graph Traversal\n(BFS 2-hop)"]
    SEM --> MERGE["Merge & Rank\n(dedup + weighted scoring)"]
    KW --> MERGE
    GR --> MERGE
    MERGE --> SA["Spreading Activation\n(expand context along edges)"]
    SA --> RR{"Reranker\nenabled?"}
    RR -->|yes| RERANK["LLM Rerank"]
    RR -->|no| FMT["Format & Compress"]
    RERANK --> FMT
    FMT --> RES["RetrieveResult"]

Neuroscience parallel

The tiered compression (facts/patterns/events) mirrors levels of activation in working memory. In Cowan's embedded-process model, a small number of items are in the focus of attention (top-ranked facts), surrounded by activated long-term memory (patterns and meta-observations), with the rest of long-term memory available but not active (conversation snippets). The token budget acts as the capacity limit of working memory.