Large Language Models: Probability, Not Personhood

A large language model (LLM) is a neural network trained to predict the next token—often a subword unit—in a sequence, given the preceding tokens. Repeat that objective across billions of tokens scraped from books, code, forums, and licensed corpora, and you obtain a system that can generate fluent prose, summarize documents, translate between languages, and even mimic dialogue styles. The surface behavior can feel startlingly “human,” but the training target was never “be insightful” or “be honest.” It was: assign high probability to continuations that resemble the training distribution. That distinction matters for every downstream use.

If you prefer a hub-style overview before this linear explainer, open the Language & multimodal AI topic page; it links back here and to safety and deployment themes in one pass.

Autoregression: one step at a time

At generation time, models typically sample or search through the space of possible next tokens, append the choice, feed the extended context back in, and repeat. That recursive loop means errors can compound: an early plausible-but-wrong choice can steer the rest of the answer into confident nonsense. This is one structural reason LLMs can “hallucinate” facts even when individual layers implement sophisticated attention patterns over context.

Tokenization: the hidden interface

Models do not see “words” the way dictionaries do. Text is segmented into subword tokens via algorithms such as Byte-Pair Encoding or SentencePiece. Rare words become several tokens; typos can explode the token count; some languages tokenize less efficiently than others in a given vocabulary. That matters for cost (you pay per token), for fairness (uneven tokenization can skew performance across languages), and for attacks (odd Unicode can split defenses). When you evaluate a model, note the tokenizer and vocabulary size—they are part of the system, not peripheral plumbing.

Decoding: greedy search is not the whole story

At each step, you could take the single highest-probability token (greedy decoding), but that often produces repetitive or brittle text. Sampling with temperature flattens or sharpens the distribution: high temperature increases diversity at the cost of coherence; low temperature approaches greedy behavior. Top-k and nucleus (top-p) sampling truncate the tail of unlikely tokens to reduce degenerate loops. These knobs change fluency and factuality profiles without changing weights at all—they are part of the inference-time policy that product teams must own.

Context windows and the illusion of memory

Transformers attend within a finite context window—thousands to millions of tokens in frontier systems, depending on architecture and engineering. Anything outside that window is inaccessible unless re-injected explicitly (for example, via retrieval or tool calls). So when a chatbot “remembers” earlier turns in a session, that is session state managed by the application, not a magical long-term autobiographical memory inside the weights.

Emergence without intention

Scale changes behavior. Small models struggle with arithmetic or code; larger ones show surprisingly strong performance on tasks that do not appear as explicit labels in training data. Researchers debate how to interpret “emergence”: some argue it reflects measurement thresholds; others point to genuine qualitative shifts in internal computation. Either way, emergence does not imply intent. Capabilities can arise as byproducts of compressing statistical regularities in text.

Fine-tuning, instructions, and preference optimization

Base models trained only on next-token prediction are awkward assistants: they continue text rather than follow instructions. Supervised fine-tuning on demonstration data teaches format and task shape; preference optimization (often involving human or model-based rankings) nudges outputs toward helpfulness and away from disallowed content within the limits of the reward proxy. Each stage introduces new objectives and new ways to game them—which is why alignment discussions on this site return to specification and oversight. For a dedicated treatment, read Alignment: Goals, Feedback, and Failure Modes after this piece.

Fluency versus reliability

Human readers often conflate articulate writing with correctness. LLMs optimize for plausibility conditioned on prompt and prior tokens, not for truth in the world. Retrieval-augmented generation, citation-aware training, and external tools can reduce—but not eliminate—failure modes. For high-stakes domains (medicine, law, security), treat outputs as drafts that require verification, not as authorities.

Responsible framing

Anthropomorphic interfaces (“the model thinks…”) can help novices build intuition, but they also encourage over-trust. Prefer precise language: the system sampled a continuation that minimized a learned prediction loss under a prompt and decoding strategy. That may sound clinical, but it keeps the accountability chain visible: data, design, deployment policy, and human oversight—not a synthetic soul.

Continue with Alignment: Goals, Feedback, and Failure Modes to see how training objectives connect to societal risk, and keep Safety & alignment hub open if you are building a reading list for policy colleagues.