Large language models are neural networks trained on enormous text corpora — books, articles, code, and web content. They learn statistical patterns of language and predict the next token in a sequence, which lets them generate coherent text, answer questions, summarize documents, and follow instructions when prompted or fine-tuned correctly.

Modern LLMs sit on top of the transformer architecture introduced in the 2017 paper Attention Is All You Need. The training process produces a fixed set of weights — billions of numbers that encode everything the model knows. At inference time, you send the model tokens and it returns more tokens, one at a time, each chosen from a probability distribution over the entire vocabulary.
Models like Claude, GPT, Gemini, and Llama vary across three axes that matter for production: capability, latency, and cost-per-token. Smaller open-weights models run cheap on your own GPUs but plateau on complex reasoning. Frontier closed models reason well across long contexts but charge per call and rate-limit. The right choice depends on whether you need general-purpose intelligence, fast response, or tight cost control — most production systems route between several models depending on the task.
LLMs are the foundation under most current AI applications, but they are not a complete system. They have no persistent memory between calls, no native access to your data, and no way to take actions in the world. To make them useful in production, teams pair them with retrieval (RAG) for fresh facts, with agents for multi-step workflows, and with fine-tuning when consistent style or domain behavior matters more than general capability.
When evaluating an LLM for a specific use case, look past the marketing benchmarks. What matters is how the model behaves on your prompts, your data, and your tolerance for hallucination — measured with an evaluation harness you actually run, not vibes.