RAG addresses a core limitation of large language models: they only know what was in their training data. A RAG system retrieves relevant passages from external sources — support docs, knowledge bases, regulations, product specs — and passes them to the model alongside the user's question, so answers are grounded in current, citable material.
The typical pipeline has four stages: ingest (documents are chunked, embedded, and stored in a vector database), retrieve (the query is embedded and matched against the index), re-rank (the best candidates are reordered), and generate (the model answers from the retrieved context, often with citations). Chunking granularity and embedding model choice drive accuracy at every stage.
RAG fits when data changes faster than you can retrain, when citations and traceability matter (legal, healthcare, finance), and when a bounded document set covers most queries. Tasks that need the model to reason over data rather than recall it point to fine-tuning or an agent with tool use. In production, evaluation is the make-or-break work: measure retrieval precision and answer faithfulness separately, on representative queries.