Skip to main content

What is retrieval-augmented generation (RAG)?

TL;DR

Retrieval-augmented generation is a pattern where an LLM answers questions by first retrieving relevant content from a proprietary source, then generating a response that is grounded in that retrieved content. RAG solves the most common enterprise LLM problem: the base model does not know the company's internal documents, policies, or historical records. Rather than retraining, the system retrieves from a vector index at query time and injects the results into the prompt.

The short version

  • RAG grounds LLM answers in proprietary content by retrieving relevant chunks at query time.
  • The core components are an embedding model, a vector store, a retriever, and an LLM.
  • RAG beats fine-tuning when data changes often or citations matter; most enterprise deployments start here.

The longer explanation

The problem RAG solves

The base LLM does not know your company. It has not read your contracts, your compliance guidance, your historical case resolutions, or your product documentation. Out of the box, asking it "what is our policy on X" gets a plausible-sounding answer that is wrong. RAG fixes this by retrieving the actual policy at query time and giving it to the model as context.

The architecture

Four components matter:

  1. Embedding model. Converts text into a numerical vector that captures semantic meaning. Popular choices: OpenAI's embedding models, open-source alternatives like BGE and E5, domain-tuned embeddings for specialized content.
  2. Chunking strategy. Documents are broken into chunks before embedding. The chunking policy (size, overlap, boundary detection) materially affects retrieval quality.
  3. Vector store. Stores embeddings and supports similarity search. Common choices: Pinecone (managed), Weaviate, Qdrant, pgvector (Postgres-hosted), Milvus. Our TWSS CS Agent uses Postgres + pgvector in most deployments for ease of operation.
  4. Retriever and re-ranker. A retriever returns candidate chunks based on vector similarity; an optional re-ranker (often a smaller LLM or a cross-encoder) scores the candidates and picks the best for the final prompt.

RAG versus fine-tuning

RAG wins when:

  • The source content changes often.
  • Different users should see different subsets (row-level security).
  • Citations and source attribution are required.
  • The base model already has the skill; it just needs the facts.

Fine-tuning wins when:

  • The model needs a new capability that cannot be expressed as retrieval.
  • The task has a stylistic or structural pattern the model struggles to produce reliably.
  • Latency or cost at inference is a priority and you can bake the behavior into the weights.

Most enterprise deployments are RAG-first. Fine-tuning, when it happens, is layered on top after the retrieval-grounded version is in production.

Common failure modes

  • Poor chunking. Chunks that split sentences or paragraphs mid-thought produce incoherent retrieval. Chunking strategy is the single biggest lever after embedding choice.
  • Stale index. The vector store is not updated when the source content changes; the system answers from yesterday's policy.
  • Weak retrieval on multi-hop questions. The question requires synthesizing two separate documents; vanilla RAG retrieves each but does not reason across them. Solutions include query rewriting and multi-step retrieval.
  • Evaluation blind spots. Without grounding evaluation in production, hallucination creeps in silently as the corpus or query distribution shifts.

How Thoughtwave approaches this

Our TWSS CS Agent, TWSS Finance AI/ML, and our self-hosted TWSS Commercial Credit AI all use RAG as a foundational layer, adapted to the client's data-residency and audit requirements. For deeper context, see our AI & Generative AI service and the accelerators portfolio.

Frequently asked questions

When should I use RAG versus fine-tuning?
RAG when your data changes often, when you need citations, or when you need different users to see different subsets of content. Fine-tuning when you need to teach the model a new skill or style that retrieval cannot express. In practice, most enterprise deployments start with RAG and add fine-tuning only when a specific task cannot be handled by retrieval alone.
What is a vector database and do I need one?
A vector database stores embeddings (numerical representations of text or other content) and supports fast similarity search. Examples include Pinecone, Weaviate, Qdrant, and Postgres with pgvector. You need one for any non-trivial RAG deployment. The right choice depends on scale, hosting posture (managed vs self-hosted), and whether you are co-locating with an existing database.
How do we stop RAG from hallucinating?
Three layers. Retrieval quality (better chunking, better embeddings, re-ranking) so the right content reaches the model. Prompt design that instructs the model to refuse when the retrieved context does not support the answer. Evaluation that measures grounding — whether the answer is actually supported by the cited sources — and fails builds when grounding drops.
Can RAG work with source citations?
Yes. Good RAG implementations include source citations in the response, with the retrieved chunks surfaced to the user. This is especially important in regulated industries where an auditor needs to verify that a specific answer came from a specific document. Our TWSS CS Agent and TWSS Finance AI/ML both include citation surfacing as a core feature.

Related resources

RT
Ramesh Thumu

Founder & President, Thoughtwave Software

Reviewed by Thoughtwave Editorial

Last updated April 22, 2026