The short version
- RAG grounds LLM answers in proprietary content by retrieving relevant chunks at query time.
- The core components are an embedding model, a vector store, a retriever, and an LLM.
- RAG beats fine-tuning when data changes often or citations matter; most enterprise deployments start here.
The longer explanation
The problem RAG solves
The base LLM does not know your company. It has not read your contracts, your compliance guidance, your historical case resolutions, or your product documentation. Out of the box, asking it "what is our policy on X" gets a plausible-sounding answer that is wrong. RAG fixes this by retrieving the actual policy at query time and giving it to the model as context.
The architecture
Four components matter:
- Embedding model. Converts text into a numerical vector that captures semantic meaning. Popular choices: OpenAI's embedding models, open-source alternatives like BGE and E5, domain-tuned embeddings for specialized content.
- Chunking strategy. Documents are broken into chunks before embedding. The chunking policy (size, overlap, boundary detection) materially affects retrieval quality.
- Vector store. Stores embeddings and supports similarity search. Common choices: Pinecone (managed), Weaviate, Qdrant, pgvector (Postgres-hosted), Milvus. Our TWSS CS Agent uses Postgres + pgvector in most deployments for ease of operation.
- Retriever and re-ranker. A retriever returns candidate chunks based on vector similarity; an optional re-ranker (often a smaller LLM or a cross-encoder) scores the candidates and picks the best for the final prompt.
RAG versus fine-tuning
RAG wins when:
- The source content changes often.
- Different users should see different subsets (row-level security).
- Citations and source attribution are required.
- The base model already has the skill; it just needs the facts.
Fine-tuning wins when:
- The model needs a new capability that cannot be expressed as retrieval.
- The task has a stylistic or structural pattern the model struggles to produce reliably.
- Latency or cost at inference is a priority and you can bake the behavior into the weights.
Most enterprise deployments are RAG-first. Fine-tuning, when it happens, is layered on top after the retrieval-grounded version is in production.
Common failure modes
- Poor chunking. Chunks that split sentences or paragraphs mid-thought produce incoherent retrieval. Chunking strategy is the single biggest lever after embedding choice.
- Stale index. The vector store is not updated when the source content changes; the system answers from yesterday's policy.
- Weak retrieval on multi-hop questions. The question requires synthesizing two separate documents; vanilla RAG retrieves each but does not reason across them. Solutions include query rewriting and multi-step retrieval.
- Evaluation blind spots. Without grounding evaluation in production, hallucination creeps in silently as the corpus or query distribution shifts.
How Thoughtwave approaches this
Our TWSS CS Agent, TWSS Finance AI/ML, and our self-hosted TWSS Commercial Credit AI all use RAG as a foundational layer, adapted to the client's data-residency and audit requirements. For deeper context, see our AI & Generative AI service and the accelerators portfolio.