Building Memory for AI Agents: Short-Term Context, Long-Term Storage, and Episodic Recall

A stateless agent that forgets every conversation is useful for one-shot tasks. An agent that remembers — who the user is, what was discussed last week, what decisions were made — is useful as a collaborator. The gap between the two is memory architecture, and most teams implement it as an afterthought. This post covers the patterns that make memory work reliably.

The Four Types of Agent Memory

Memory in AI agents maps loosely to cognitive memory categories, but the implementation details are what matter for production systems:

In-context (working) memory: Everything currently in the model's context window. Fast, no retrieval step, but limited by context size and lost when the session ends.
External short-term memory: A database-backed store of recent conversation history, retrieved and injected into context at the start of each turn. Survives session boundaries; scoped to a conversation.
Long-term semantic memory: A vector store of facts, preferences, and summarized past interactions, retrieved by relevance. Survives indefinitely; scoped to a user or entity.
Episodic memory: A structured log of past interactions — what happened, when, and what the outcome was. Enables the agent to reason about past events and avoid repeating mistakes.

In-Context Memory: Managing the Window

The simplest form of memory is including the full conversation history in every prompt. This works well for short conversations but breaks down as the context grows: latency increases, cost rises, and the model's attention on early messages degrades.

The two practical strategies for managing in-context history:

Sliding window: keep only the last N turns in context. Simple, but loses important early context — a user's initial goals mentioned at turn 1 may not appear in the window by turn 20.
Progressive summarization: when history exceeds a token threshold, summarize the oldest N turns into a compact summary and replace them. The summary is included at the top of context; recent turns are included verbatim. This preserves key facts while controlling window size.

python

async def compress_history(messages: list[dict], max_tokens: int = 4000) -> list[dict]:
    if token_count(messages) <= max_tokens:
        return messages

    # Split into old (to compress) and recent (to keep verbatim)
    midpoint = len(messages) // 2
    old_messages, recent_messages = messages[:midpoint], messages[midpoint:]

    summary = await llm.acomplete(
        f"Summarize the key facts, decisions, and context from this conversation:\n\n"
        + format_messages(old_messages)
    )

    return [
        {"role": "system", "content": f"[Earlier conversation summary]:\n{summary}"},
        *recent_messages,
    ]

Long-Term Semantic Memory

Long-term memory stores facts and preferences about a user or entity that should persist across sessions and be retrieved when relevant. The implementation uses a vector store:

1Extract: after each conversation, run an extraction pass to identify facts worth remembering — stated preferences, decisions made, key facts about the user.
2Store: embed each fact and store it in a vector database keyed by user ID.
3Retrieve: at the start of each new conversation, embed the user's first message and retrieve the top-N most relevant memories.
4Inject: include retrieved memories in the system prompt as context before the conversation begins.

python

# Memory extraction after conversation ends
EXTRACTION_PROMPT = """
Review this conversation and extract facts worth remembering about the user.
Focus on: preferences, stated goals, important decisions, constraints, and personal context.
Return a JSON array of short fact strings. Only include genuinely useful persistent facts.
Example: ["Prefers Python over JavaScript", "Working on a healthcare startup", "Has a 50K/month budget"]
"""

async def extract_and_store_memories(user_id: str, messages: list[dict]):
    raw = await llm.acomplete(EXTRACTION_PROMPT + format_messages(messages))
    facts = json.loads(raw)

    for fact in facts:
        embedding = await embed(fact)
        await vector_store.upsert(
            collection="user_memories",
            id=f"{user_id}:{hash(fact)}",
            vector=embedding,
            payload={"user_id": user_id, "fact": fact, "created_at": now()},
        )

async def retrieve_relevant_memories(user_id: str, query: str, top_k: int = 5) -> list[str]:
    results = await vector_store.search(
        collection="user_memories",
        query_vector=await embed(query),
        filter={"user_id": user_id},
        limit=top_k,
    )
    return [r.payload["fact"] for r in results]

Episodic Memory: Remembering What Happened

Episodic memory is a structured log of past interactions — not just facts extracted from conversations, but what happened, when, and what the outcome was. It enables agents to say 'last time you asked me to do X, the result was Y' and to avoid repeating failed approaches.

Store episode records: each completed task or conversation gets a structured record with timestamp, summary of what was done, outcome (success/failure), and any artifacts produced.
Retrieve by recency and relevance: for a new task, retrieve episodes that are both recent and semantically similar to the current task.
Include failure episodes: failed past attempts are as valuable as successes. An agent that knows 'approach X was tried on this problem and failed because Y' avoids repeating the mistake.
Scope appropriately: episodic memory is most valuable scoped to an entity (user, project, codebase) rather than globally. Global episodic memory becomes noisy quickly.

Memory Hygiene: What Gets Forgotten

Indefinitely growing memory without a pruning strategy degrades retrieval quality over time. Old, outdated, or contradicted facts pollute the memory store and reduce the relevance of retrieved context.

TTL-based expiry: set a time-to-live on episodic memories. Conversations from 6 months ago are rarely relevant; facts about a user's current project may be.
Contradiction detection: when storing a new fact, check if it contradicts an existing one. 'User prefers Python' and 'User is learning Rust and prefers it now' should not coexist. Update, do not append.
Confidence scoring: tag memories with a confidence score based on how explicitly the fact was stated. 'I only use PostgreSQL' is high confidence; an inferred preference is lower confidence. Retrieve high-confidence memories preferentially.
User control: for consumer-facing applications, give users the ability to view and delete their stored memories. This is increasingly a legal requirement in GDPR jurisdictions.

Tip:Start with in-context sliding window memory. Add vector-backed long-term memory only when users explicitly encounter the problem it solves — 'the assistant forgot who I am'. Build episodic memory only when you have multi-step agents that need to learn from past failures. Premature memory complexity is a common trap.