How to Build a Production-Ready LLM Memory System with Mem0, pgvector, and LangChain

Learn how to build a persistent LLM memory system with Mem0, pgvector, and LangChain for smarter, personalized AI conversations.

How to Build a Production-Ready LLM Memory System with Mem0, pgvector, and LangChain

I’ve been building conversational AI for a while now, and one thing kept driving me crazy: every interaction started from scratch. Users would tell me their name, their dietary restrictions, their favorite book—and the next time they came back, the system acted like it had never met them. I knew about buffers and session state, but those felt like putting a Band-Aid on a broken leg. The problem wasn’t just storing a few recent messages; it was remembering who a person is across days, across topics, across contexts. That’s when I realized we need a proper memory system for LLMs. Not just a list of past chats, but a structured, persistent, layered memory that actually grows with the user.

So I set out to build a production-ready memory system using three components: Mem0 for automatic fact extraction and retrieval, PostgreSQL with pgvector for persistent vector storage, and LangChain to tie everything together with the LLM. The result is something that remembers like a human—not by keeping every word, but by distilling what matters.

Let me walk you through how I did it, why certain choices matter, and where you can avoid the mistakes I made.


First, think about how human memory works. You don’t remember every conversation verbatim. You remember facts (semantic memory) and experiences (episodic memory), and you use your working memory to hold the current context. Most AI systems only have working memory—the last few messages. That’s like having a brain that flushes all long-term memories every time you blink. You end up repeating yourself forever.

What we need are three layers: working memory (current chat history), episodic memory (summaries of past sessions stored as vectors), and semantic memory (extracted facts about the user—preferences, traits, knowledge). Each layer serves a different purpose. Working memory handles immediate coherence. Episodic memory provides context from last week. Semantic memory tells the system that you’re allergic to peanuts or that you love sci-fi. Combined, they make the AI feel like it truly knows you.

I started with the data storage. PostgreSQL with the pgvector extension is perfect for this. It’s a battle-proven relational database that also handles vector similarity search. I used Docker Compose to spin up a PostgreSQL instance with pgvector enabled, along with Redis for caching. The setup is straightforward: a docker-compose.yml file that defines both services, with health checks and persistent volumes. Nothing fancy, but reliable.

volumes:
  pgdata:

services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: memory_user
      POSTGRES_PASSWORD: memory_pass
      POSTGRES_DB: llm_memory
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U memory_user -d llm_memory"]

Why PostgreSQL over a purpose-built vector database? Because your memory system will also need to store metadata, user IDs, timestamps, and maybe even relational links between memories. Having everything in one database avoids synchronization headaches and simplifies backups. The vector search is fast enough for most production use cases when you index properly.

Now, the heart of the system is Mem0. It’s a framework for automatic memory extraction. You feed it a conversation, and it extracts factual statements, updates existing memories, and stores them as vectors. I hooked it up to my PostgreSQL instance by providing connection details. Mem0 uses the text-embedding-3-small model from OpenAI to convert text to vectors, but you can swap that out.

from mem0 import Memory

config = {
    "vector_store": {
        "provider": "pgvector",
        "config": {
            "connection_string": "postgresql+psycopg://memory_user:memory_pass@localhost/llm_memory",
            "collection_name": "user_memories"
        }
    },
    "embeddings": {
        "provider": "openai",
        "config": {
            "model": "text-embedding-3-small"
        }
    }
}

memory = Memory.from_config(config)

With that client, storing a memory is as simple as calling memory.add(messages, user_id="user_123"). Mem0 automatically analyzes the conversation, extracts entities, facts, and preferences, and stores them with proper scoping.

What about retrieving the right memories at the right time? That’s where the retrieval optimizer comes into play. I built a hybrid approach: first, search for semantic memories using cosine similarity on the vector embedding of the current user query. Second, if the result score is below a threshold, fall back to episodic memories from recent sessions. Finally, always include the last 10 working memory messages. This hybrid retrieval ensures that the most relevant facts surface without bloating the prompt.

I recall a specific user session where this made a huge difference. Someone asked me for book recommendations. Without the memory system, I’d say, “Sure, what genre do you like?” With it, I retrieved a memory from two weeks ago: “User mentioned loving dystopian novels with strong female leads.” So I immediately recommended “The Power” by Naomi Alderman. The user was astonished. That moment convinced me that this architecture isn’t just a nice feature—it’s necessary.

Let me show you the LangChain integration. I created a custom chain that first retrieves relevant memories, then injects them into the system prompt.

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

class MemoryAwareChain:
    def __init__(self, memory_client, user_id):
        self.memory = memory_client
        self.user_id = user_id
        self.llm = ChatOpenAI(model="gpt-4o-mini")

    def run(self, query):
        # Retrieve relevant memories
        results = self.memory.search(query, user_id=self.user_id, limit=5)
        memory_context = "\n".join([m["text"] for m in results["results"]])

        prompt = PromptTemplate(
            input_variables=["memory", "query"],
            template=(
                "You are a helpful assistant. Here are memories about this user:\n"
                "{memory}\n\n"
                "Now answer the user's query: {query}"
            )
        )
        chain = LLMChain(llm=self.llm, prompt=prompt)
        return chain.run(memory=memory_context, query=query)

Notice how we keep the prompt clean. The memory context is injected as facts, not raw conversation history. This reduces token usage and keeps the LLM focused.

How do we handle memory conflicts? Over time, a user might say, “I used to like coffee, but now I prefer tea.” The system needs to update, not append. Mem0 handles this by checking for similarity with existing memories before insertion. If a new memory is similar but contradictory, it updates the existing one. I also added a timestamp and a confidence score to each memory, so that when retrieving, we can weigh recent memories more heavily.

For privacy and scaling, I scoped all memories to a unique user ID. Each user has their own collection of vectors. When a user deletes their account, we can wipe their entire memory set with a single SQL command. I also implemented a TTL (time to live) for episodic memories—summaries of old sessions get removed after 90 days if not updated. This prevents the vector database from bloating.

One personal tweak I made: I added a “memory review” endpoint where users can read what the system remembers about them and delete individual memories. Some people find the idea of an AI remembering everything creepy. Giving them control builds trust.

Now, testing this system taught me something important. Don’t trust synthetic benchmarks. I ran a bunch of test conversations with fictional users, and the memory retrieval looked perfect. But live users had messy speech, incomplete sentences, and contradictions. My extraction rate dropped. I fixed it by fine-tuning the extraction prompt that Mem0 uses internally, adjusting the system role to be more conservative: only extract statements that are unambiguous facts.

Another lesson: never use the user’s actual identity in the memory text. Instead of storing “Alice is allergic to peanuts,” store “User prefers peanut-free foods.” This anonymizes the memory slightly, making it easier to handle data portability.

The final architecture looks like this: FastAPI serves endpoints for chat, memory review, and account deletion. Each chat request triggers the memory-aware chain, which first retrieves relevant memories, then calls the LLM with both the memory context and the current query. After the LLM responds, I run a background task that calls memory.add with the exchange to extract new facts. That way, every conversation becomes part of the knowledge base.

What about cost? Embedding extraction and vector search are cheap. The LLM call is the expensive part. By reducing the number of tokens per call—only injecting relevant memories instead of full histories—I actually saved money compared to naive prompt stuffing. For a typical user with 50 past interactions, the memory context fits in 300 tokens. That’s a tenth of what a full history dump would require.

I sincerely believe that memory systems are the missing piece in making AI feel personal. Without them, every chatbot is a stranger. With them, it becomes a companion that grows with you. And this stack—Mem0, PostgreSQL with pgvector, LangChain—is battle-tested enough to deploy to production today. I’ve been running it for months with hundreds of users, and the feedback is consistent: “It feels like you actually remember me.”

If you’ve been struggling with your own bot forgetting everything, try this approach. Start small: just add Mem0 to a single endpoint, and you’ll already see a difference. Then layer in the retrieval, the TTL, and the review endpoint. You’ll never want to go back to a memoryless LLM.

I’d love to hear how you’re handling LLM memory in your own projects. What challenges have you faced? Drop a comment below. And if this article helped you, hit like and share it with someone else building conversational AI. Your feedback honestly makes me want to write more of these deep dives.


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

// Our Network

More from our team

Explore our publications across finance, culture, tech, and beyond.

// More Articles

Similar Posts