Long-Term Memory and Write Policies

Thread memory and workflow state handle the current session. But some facts are worth keeping longer: the user's preferred coding style, stable architectural decisions about the repository, debugging patterns that recur across sessions. These are long-term memories, or facts that persist across conversations and improve the system over time.

Long-term memory is also where things go wrong. Without write policies, the system stores everything: stale preferences, one-time context mistaken for stable facts, and (most dangerously) sensitive data like credentials and PII. This lesson covers both the opportunity and the risks. We'll build a long-term memory layer with explicit write policies, measure its impact, and add the safeguards that prevent memory from becoming a liability.

What you'll learn

Build a long-term memory layer that stores facts across sessions using Mem0
Implement write policies that control what gets stored, when, and under what conditions
Identify and prevent context rot in persistent memory: stale facts, irrelevant retrievals, and uncontrolled writes
Add PII and credential filtering before any memory write
Measure whether long-term memory improves benchmark quality without introducing regressions

Concepts

Long-term memory — facts, preferences, and patterns that persist across conversation sessions. Long-term memory is the third memory layer after thread memory (current conversation) and workflow state (current task). Examples: "this repository uses pytest for testing," "the user prefers type annotations," "the auth module was refactored in March and the old patterns are deprecated." These facts help the system avoid re-learning what it already knows.

Memory write policy — a set of rules that govern when facts get promoted to long-term storage. Without a write policy, you're letting the model decide what's worth remembering on every turn. That leads to memory pollution: stale facts, duplicate entries, trivial observations that consume retrieval capacity without adding value. A write policy makes memory promotion explicit and reviewable.

Context rot (in memory) — the degradation of system quality caused by accumulated stale, irrelevant, or incorrect memories. We first introduced context rot in Module 1 as the general problem of context window content degrading over time. In the memory context, context rot takes specific forms:

Stale memories: facts that were true when stored but aren't anymore. "The API uses v2 authentication," except it was upgraded to v3 last month.
Irrelevant retrievals: the memory store returns facts that match the query semantically but don't help the current task. "The user asked about Python error handling last week" retrieved for a question about Go error handling.
Uncontrolled writes: the system stores something on every turn, and the memory store fills with low-value observations that dilute the useful facts during retrieval.

Context rot is insidious because it doesn't cause obvious failures. The system just gets gradually worse. Answers become less relevant, context windows fill with unhelpful history, and retrieval quality degrades because the memory store has too many low-quality entries competing with the good ones.

Memory retrieval — the process of finding relevant long-term memories for the current context. Memory retrieval uses the same techniques as document retrieval (embedding similarity, keyword matching), but with an additional challenge: the memory store grows continuously, and its quality depends entirely on write policies. A clean memory store with 50 high-quality entries retrieves better than a polluted store with 5,000 entries.

PII (Personally Identifiable Information) — data that can identify a specific person: names, email addresses, phone numbers, government IDs. PII in memory creates compliance risk (GDPR, CCPA) and security risk (data exposure). Any memory system that stores user interactions must filter PII before storage.

Problem-to-Tool Map

Problem class	Symptom	Cheapest thing to try first	Tool or approach
System re-learns known facts every session	User repeats the same preferences or context	Manual notes file	Long-term memory with write policy
Memory store fills with junk	Retrieval returns irrelevant memories	Store everything and hope	Curated write policy with usefulness criteria
Stored facts become stale	System uses outdated information	No expiration	Staleness checks and expiration policy
Sensitive data in memory	PII or credentials stored in memory layer	Trust the model to be careful	Pre-write filter for PII and credentials
Too many memories dilute retrieval	The right memory is in the store but gets outranked	Increase retrieval limit	Prune low-value entries, improve write selectivity

Walkthrough

Default: Mem0

Why this is the default: Mem0 provides a dedicated memory layer that you can evaluate separately from the rest of the system. It handles embedding, storage, and retrieval, letting you focus on the harder problem: deciding what to store.

Portable concept underneath: Memory is a retrieval-and-lifecycle problem, not a hidden state blob. Whatever tool you use, you're making the same decisions: what to store, when to store it, how to retrieve it, and when to expire it.

Closest alternatives and when to switch:

Zep: use when temporal awareness or graph-structured memory becomes important, when you need to reason about when things happened or how facts relate to each other over time.
Letta: use when you want memory as the central design principle of your agent, with persistent agent state and self-editing memory as first-class features.
Framework-only session history: if your application doesn't truly need cross-session memory, don't add it. Thread memory from the previous lesson may be sufficient.

Setting up Mem0

pip install mem0ai

# memory/long_term.py
"""Long-term memory layer using Mem0.

Stores facts that persist across sessions with write policies
to control what gets stored and when.
"""
from mem0 import Memory

# Initialize Mem0.
# By default, Memory() uses OpenAI for embeddings (requires OPENAI_API_KEY).
# To use a different embedding provider, pass an embedder config:
#   memory = Memory(embedder={"provider": "ollama", "config": {"model": "nomic-embed-text"}})
# See https://docs.mem0.ai/open-source/python-quickstart for provider options.
# In production, configure a persistent backend (PostgreSQL, etc.)
memory = Memory()


def store_memory(
    content: str,
    user_id: str,
    metadata: dict | None = None,
) -> dict:
    """Store a fact in long-term memory.

    This function should only be called after write policy checks pass.
    """
    result = memory.add(
        content,
        user_id=user_id,
        metadata=metadata or {},
    )
    return result


def recall_memories(
    query: str,
    user_id: str,
    limit: int = 5,
) -> list[dict]:
    """Retrieve relevant memories for a query.

    Returns memories ranked by relevance, with metadata
    including when they were stored and their confidence.
    """
    response = memory.search(query, user_id=user_id, limit=limit)
    # Mem0's search() returns {"results": [...]} — extract the list.
    return response.get("results", []) if isinstance(response, dict) else response


def list_memories(user_id: str) -> list[dict]:
    """List all memories for a user. Useful for debugging and auditing."""
    response = memory.get_all(user_id=user_id)
    # get_all() also returns {"results": [...]}.
    return response.get("results", []) if isinstance(response, dict) else response


def delete_memory(memory_id: str) -> None:
    """Delete a specific memory. Used for expiration and PII cleanup."""
    memory.delete(memory_id)

Memory write policies

This is the most important section of this lesson. The write policy determines what long-term memory becomes, and getting it wrong creates context rot.

The hard rule: do not let the model write arbitrary long-term memory on every turn. Memory promotion should be explicit or at least policy-driven.

Here are four write policies, ordered from most conservative to most permissive:

# memory/write_policy.py
"""Memory write policies.

Controls what gets promoted to long-term storage. Each policy
implements a should_store() function that returns True/False
with a reason.
"""
from __future__ import annotations

from dataclasses import dataclass


@dataclass
class WriteDecision:
    """Result of a write policy check."""
    should_store: bool
    reason: str
    confidence: float  # 0-1


def policy_save_after_success(
    fact: str,
    task_outcome: str,
    confidence: float,
) -> WriteDecision:
    """Store memories only after the task succeeded.

    Rationale: if the system gave a correct answer, the facts
    it used to get there are likely worth remembering. If the
    answer was wrong, the facts may be wrong too.
    """
    if task_outcome == "fully_correct" and confidence > 0.7:
        return WriteDecision(
            should_store=True,
            reason="Task succeeded with high confidence",
            confidence=confidence,
        )
    return WriteDecision(
        should_store=False,
        reason=f"Task outcome '{task_outcome}' or confidence {confidence:.2f} below threshold",
        confidence=confidence,
    )


def policy_save_after_user_confirmation(
    fact: str,
    user_confirmed: bool,
) -> WriteDecision:
    """Store memories only when the user explicitly confirms.

    Rationale: the user is the authority on what's worth remembering.
    "Remember that I prefer pytest over unittest" is explicit. An
    offhand mention of a testing preference is not.
    """
    if user_confirmed:
        return WriteDecision(
            should_store=True,
            reason="User explicitly confirmed this should be remembered",
            confidence=0.95,
        )
    return WriteDecision(
        should_store=False,
        reason="No explicit user confirmation",
        confidence=0.0,
    )


def policy_save_after_repeated_evidence(
    fact: str,
    occurrence_count: int,
    min_occurrences: int = 3,
) -> WriteDecision:
    """Store facts only after they've appeared multiple times.

    Rationale: if a fact comes up in three separate sessions,
    it's likely stable and worth remembering. One-time context
    is probably not.
    """
    if occurrence_count >= min_occurrences:
        return WriteDecision(
            should_store=True,
            reason=f"Fact appeared {occurrence_count} times (threshold: {min_occurrences})",
            confidence=min(0.95, 0.5 + occurrence_count * 0.1),
        )
    return WriteDecision(
        should_store=False,
        reason=f"Only {occurrence_count}/{min_occurrences} occurrences so far",
        confidence=0.3,
    )


def policy_expire_stale(
    memory_age_days: int,
    last_retrieved_days: int,
    max_age_days: int = 90,
    max_unused_days: int = 30,
) -> WriteDecision:
    """Check whether an existing memory should be expired.

    Two staleness signals:
    - Age: memories older than max_age_days are candidates for expiration
    - Disuse: memories not retrieved in max_unused_days are likely irrelevant
    """
    if memory_age_days > max_age_days:
        return WriteDecision(
            should_store=False,
            reason=f"Memory is {memory_age_days} days old (max: {max_age_days})",
            confidence=0.8,
        )
    if last_retrieved_days > max_unused_days:
        return WriteDecision(
            should_store=False,
            reason=f"Memory unused for {last_retrieved_days} days (max: {max_unused_days})",
            confidence=0.7,
        )
    return WriteDecision(
        should_store=True,
        reason="Memory is still fresh and recently used",
        confidence=0.9,
    )

In practice, you'll combine these policies. A reasonable starting combination: store after user confirmation OR after repeated evidence, and expire based on age and disuse. The save_after_success policy is a good default for automated workflows where the eval harness provides task outcomes.

Context rot: how memory goes wrong

Context rot in memory is gradual and hard to detect. Here are the three main failure modes and how to monitor for them:

Stale memories. A memory says "the API uses v2 auth" but the codebase was updated to v3. The system retrieves the stale memory and gives advice based on outdated information. The answer might look reasonable (it's internally consistent with the memory) but it's wrong.

Mitigation: The expiration policy catches memories by age. But age alone isn't enough. A fact can go stale in a week if the codebase changes fast. For critical facts (API versions, configuration patterns, dependency versions), tag memories with a "verify before use" flag and include a freshness check in the specialist prompt.

Irrelevant retrievals. The memory store contains "the user asked about Python error handling last Tuesday," a meta-observation rather than a useful fact. When the user asks about Go error handling, semantic similarity pulls this irrelevant memory into context, wasting tokens and potentially confusing the specialist.

Mitigation: Write policies should filter meta-observations and one-time context. A useful memory is a fact ("this repo uses custom exception classes in lib/errors.py"), not an observation ("the user seemed interested in error handling"). The repeated-evidence policy helps here because meta-observations rarely repeat exactly.

Uncontrolled writes. Without write policies, the system stores something on every turn. After 100 sessions, the memory store has thousands of entries, most of them low-value. Retrieval quality degrades because the useful facts are buried under noise.

Mitigation: This is exactly what write policies prevent. Start with the most conservative policy (user confirmation only) and relax it incrementally based on measured retrieval quality.

Security: filtering PII and credentials before storage

Any memory system that stores user interactions must filter sensitive data before writes. This isn't optional; it's a security and compliance requirement with implications far beyond the convenience factor of automation.

# memory/pii_filter.py
"""Pre-write filter for PII and credentials.

Scans content before it enters long-term memory and blocks
or redacts sensitive data. This is a defense-in-depth measure:
the write policy should also avoid storing sensitive content,
but the filter catches what the policy misses.
"""
import re


# Patterns for common sensitive data
PII_PATTERNS = {
    "email": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
    "phone": re.compile(r"\b(?:\+?1[-.\s]?)?(?:\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}\b"),
    "ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
    "credit_card": re.compile(r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b"),
}

CREDENTIAL_PATTERNS = {
    "api_key": re.compile(r"\b(?:sk-|pk-|api[_-]?key[=:\s]+)[A-Za-z0-9_-]{20,}\b", re.IGNORECASE),
    "bearer_token": re.compile(r"\bBearer\s+[A-Za-z0-9_.-]+\b"),
    "password_assignment": re.compile(r"(?:password|passwd|pwd)\s*[=:]\s*\S+", re.IGNORECASE),
    "aws_key": re.compile(r"\bAKIA[0-9A-Z]{16}\b"),
    "private_key_header": re.compile(r"-----BEGIN (?:RSA |EC )?PRIVATE KEY-----"),
}


def scan_for_sensitive_data(content: str) -> list[dict]:
    """Scan content for PII and credential patterns.

    Returns a list of findings with type, pattern matched, and location.
    """
    findings = []

    for name, pattern in {**PII_PATTERNS, **CREDENTIAL_PATTERNS}.items():
        for match in pattern.finditer(content):
            findings.append({
                "type": "pii" if name in PII_PATTERNS else "credential",
                "pattern": name,
                "match": match.group()[:20] + "..." if len(match.group()) > 20 else match.group(),
                "position": match.start(),
            })

    return findings


def filter_before_storage(content: str) -> tuple[str, list[dict]]:
    """Filter sensitive data from content before memory storage.

    Returns (filtered_content, findings).
    If findings are present, the content was modified.
    """
    findings = scan_for_sensitive_data(content)

    if not findings:
        return content, []

    filtered = content
    for finding in sorted(findings, key=lambda f: f["position"], reverse=True):
        pattern_name = finding["pattern"]
        pattern = {**PII_PATTERNS, **CREDENTIAL_PATTERNS}[pattern_name]
        filtered = pattern.sub(f"[REDACTED_{pattern_name.upper()}]", filtered)

    return filtered, findings


def safe_memory_store(
    content: str,
    user_id: str,
    store_fn: callable,
    metadata: dict | None = None,
) -> dict:
    """Store memory only after PII/credential filtering.

    This wraps the actual store function with a safety layer.
    If sensitive data is found, it's redacted before storage
    and the findings are logged.
    """
    filtered_content, findings = filter_before_storage(content)

    if findings:
        # Log the filtering event (without the sensitive data)
        print(f"  PII filter: {len(findings)} sensitive items redacted before storage")
        for f in findings:
            print(f"    - {f['type']}: {f['pattern']}")

    result = store_fn(
        content=filtered_content,
        user_id=user_id,
        metadata={
            **(metadata or {}),
            "pii_filtered": len(findings) > 0,
            "pii_findings_count": len(findings),
        },
    )

    return result

Wire the filter into the memory pipeline so it runs on every write, regardless of the write policy:

# memory/long_term.py (updated store_memory)

from memory.pii_filter import safe_memory_store
from memory.write_policy import (
    policy_save_after_success,
    policy_save_after_user_confirmation,
    policy_save_after_repeated_evidence,
    WriteDecision,
)


def store_memory_with_policy(
    content: str,
    user_id: str,
    write_decision: WriteDecision,
    metadata: dict | None = None,
) -> dict | None:
    """Store memory only if write policy approves, with PII filtering.

    The pipeline: write policy check -> PII filter -> store.
    """
    if not write_decision.should_store:
        return None

    result = safe_memory_store(
        content=content,
        user_id=user_id,
        store_fn=store_memory,
        metadata={
            **(metadata or {}),
            "write_reason": write_decision.reason,
            "write_confidence": write_decision.confidence,
        },
    )

    return result

Integrating memory with the orchestration pipeline

Long-term memory adds a retrieval step at the start of the pipeline. Before routing, fetch relevant memories and include them in the context:

# orchestration/graph.py (add memory retrieval node)

from memory.long_term import recall_memories


def retrieve_long_term_context(state: dict) -> dict:
    """Retrieve relevant long-term memories for the current question.

    The retrieved context is stored in state["long_term_context"].
    Downstream, call_specialist() appends this to the specialist's
    system prompt so the model can use cross-session facts when reasoning.
    See the specialist-design lesson for that wiring.
    """
    user_id = state.get("user_id", "default")
    results = recall_memories(
        query=state["question"],
        user_id=user_id,
        limit=5,
    )

    # Mem0's search() returns {"results": [...]}.
    # Pull the list from the response object.
    memories = results.get("results", []) if isinstance(results, dict) else results

    if memories:
        memory_context = "Relevant context from previous sessions:\n"
        for mem in memories:
            memory_context += f"- {mem.get('memory', mem.get('text', ''))}\n"
        state["long_term_context"] = memory_context
    else:
        state["long_term_context"] = ""

    return state

# Add the memory retrieval node before routing:
# graph.set_entry_point("retrieve_memories")
# graph.add_node("retrieve_memories", retrieve_long_term_context)
# graph.add_edge("retrieve_memories", "router")

Measuring long-term memory's impact

Long-term memory should improve quality on questions where stored facts are relevant. Create a benchmark that tests this:

{"id": "ltm-001", "question": "What testing framework does this project use?", "gold_answer": "pytest with conftest.py fixtures", "memory_relevant": true, "stored_fact": "This project uses pytest with shared fixtures in conftest.py"}
{"id": "ltm-002", "question": "How is authentication handled?", "gold_answer": "JWT tokens via the auth middleware", "memory_relevant": true, "stored_fact": "Authentication uses JWT tokens validated in middleware/auth.py"}
{"id": "ltm-003", "question": "What does the sort function on line 42 do?", "gold_answer": "...", "memory_relevant": false}

Run the benchmark with and without memory:

# Without long-term memory
python harness/run_harness.py --pipeline orchestrated --no-ltm
python harness/graders/answer_grader.py harness/runs/no-ltm-latest.jsonl

# With long-term memory (pre-seed the relevant facts)
python harness/run_harness.py --pipeline orchestrated --with-ltm
python harness/graders/answer_grader.py harness/runs/with-ltm-latest.jsonl

# Compare
python harness/compare_runs.py \
    harness/runs/no-ltm-latest-graded.jsonl \
    harness/runs/with-ltm-latest-graded.jsonl

What to look for:

Memory-relevant questions: Accuracy should improve. The system has the answer in context before it even retrieves.
Memory-irrelevant questions: Accuracy should stay the same or slightly improve (the stored context may provide helpful background). If it drops, memories are adding noise, so tighten the retrieval limit or improve write selectivity.
Retrieval quality: Check whether memory retrievals displace useful document retrievals. If the context budget fills with memories, there's less room for evidence.

Exercises

Set up Mem0 and store 10 facts about your target repository (testing framework, architecture patterns, key configuration decisions). Query the memory store with 5 questions and check whether the right facts are retrieved.
Implement the four write policies. Test each one with a set of candidate facts and verify that the policies produce sensible store/skip decisions. Which policy is most conservative? Which would you use as a default?
Run the PII filter on 10 sample memory candidates, including at least 2 that contain email addresses and 1 that contains an API key pattern. Verify that sensitive data is redacted before storage.
Integrate long-term memory into the orchestration pipeline. Run the benchmark with and without memory on 20 questions (10 memory-relevant, 10 not). Does memory improve the relevant subset without degrading the rest?
Simulate context rot: store 50 memories (half useful, half trivial observations). Run the benchmark and compare retrieval quality against a clean store with only the 25 useful memories. How much does the noise affect results?

Completion checkpoint

You have:

A long-term memory layer that stores and retrieves facts across sessions
Write policies that control memory promotion (at least two policies implemented and tested)
PII and credential filtering that runs before every memory write
Evidence from the benchmark showing that memory improves quality on relevant questions
A context rot simulation showing the impact of memory pollution on retrieval quality

Reflection prompts

What's the riskiest memory your system could store? How would your write policy and PII filter handle it?
If you had to choose one write policy for production, which would it be and why? What does that choice trade off?
Context rot is gradual. How would you set up monitoring to detect it before it noticeably degrades quality?

Connecting to the project

We've built three memory layers (thread, workflow, and long-term) and added write policies and security filtering to keep them useful and safe. Combined with the orchestration and specialist design from earlier in this module, the system now has multi-agent coordination with memory that persists across sessions.

The system is also more complex than it's ever been. More agents, more state, more retrieval sources. That complexity comes with a cost: latency, token usage, and the maintenance burden of keeping everything aligned. Module 8 addresses that cost directly. We'll look at optimization techniques — distillation, fine-tuning, and the broader optimization taxonomy — but only for systems that are stable and measurable. The evals, traces, and benchmarks we built in Module 6 ensure we can tell whether optimization actually helps. We won't optimize until we can prove the need.

What's next

Optimization Ladder. At this point you have a real system, not a demo. The next lesson helps you decide which optimizations are justified and which are just expensive distractions.

References

Start here

Mem0 documentation — the memory layer we use in this lesson, with guides for setup, storage backends, and retrieval configuration

Build with this

Zep documentation — alternative memory layer with temporal awareness and graph-structured memory
Letta documentation — memory-first agent framework where persistent state is the core design principle

Deep dive

Anthropic: Building effective agents — the state management and memory patterns that inform write policy design
LangGraph: Persistence and memory — LangGraph's approach to short-term state and checkpointing, which complements long-term memory