“LLMs, how do they work?”
It may seem like a strange1 question to ask. After all, large language models (LLMs) have become so ubiquitous so quickly that it’s hard to find someone who isn’t interacting with one regularly. They form the cornerstone of modern AI, powering everything from consumer-facing chatbots to advanced analysis tools. They can write poetry, answer complex questions, and even build software.
But how do they work?
I believe it’s critical to develop a strong intuition for how LLMs operate in order to work with them effectively. Unfortunately, most people are quickly deterred by all the complex math that usually accompanies any such explanation. However, just as you don’t need to understand exactly how a car’s engine works to be a skilled driver, or know the details of Google’s algorithm to craft an effective query, you also don’t need to grok transformer models in order to be productive with ChatGPT. What you do need is an understanding of how the system behaves as a whole.
It’s not as complicated as you might think. See if you can complete this sentence: The cat sat on the _____.
Did you think of mat
, windowsill
, or maybe keyboard
? Believe it or not, this post is mostly about designing a system that can do the same. By the end, I hope you’ll see how a simple idea like word prediction can scale up to create AI’s capable of engaging in complex conversations, answering questions, and even writing code.
And we’ll only use as much math as you’d be comfortable discussing at a dinner party.2
This post is based on my talk, “How to Succeed in AI (Without Really Crying).”
Probability
To understand how LLMs work, we need to start with probability.
I know, you’re already bored. I love statistics, and I’m already bored. But at their core, LLMs are nothing more than fancy probability engines.3
Probability is a tool for quantifying randomness and uncertainty. Having a probabilistic nature is the source of an LLM’s power… and its unpredictability. It’s what makes it possible to generate novel, creative, actionable responses, and also makes LLMs very difficult to train or debug.
Whether you’re a layperson, practitioner, or researcher, the entire process of working with LLMs is an exercise in forming and manipulating their latent probability distributions into giving you the outputs you want.
Therefore, there are three key concepts that, if understood intuitively, will give you a firm grasp on how LLMs work:
- Probability Distributions
- Conditional Probability Distributions
- Sampling from Probability Distributions
The statistics nerds among you may argue that we should be talking about “joint probability distributions”.
The statistics nerds among you are welcome to write their own blog posts.
Probability Distributions
Let’s dive right in to probability with something that’s familiar to most people: flipping a coin.
When you flip a fair coin, there’s a 50% chance it will land on heads and a 50% chance it will land on tails. This simple scenario is nonetheless a complete example of a probability distribution. Let’s break it down:
- There are two possible outcomes: heads or tails.
- Each outcome has an equal likelihood of occurring.
- The probabilities of all possible outcomes add up to 100%.
We can visualize the distribution of outcomes like this:
This distribution tells us everything we need to know about a coin toss before it happens. Note that it doesn’t tell us what will happen on any particular flip, but rather what to expect over many flips. Another way of saying this is that on any flip, the likelihood of heads is equal to the likelihood of tails. For our purposes today, that likelihood — or relative chance of an outcome — is what we’re most interested in.
But what if there are more than two outcomes?
Consider a 6-sided die. Each side has a lower absolute probability of coming up than the side of a coin — a 16.67% chance, to be precise — but all of them are equally likely. Therefore, from a probability perspective, a six-sided die is sort of like a scaled-up coin toss: it represents a distribution of equally likely outcomes.
We can push this even further by considering a roulette wheel, which has 38 outcomes, each one just as (un)likely as any other.
All of the distributions we’ve discussed so far are called uniform probability distributions and if you look at their charts, you can see why: since every outcome is equally likely, their probability distribution is flat.
Uniform distributions are an easy and important way to understand the nature of probability, but we all know that LLMs aren’t just a giant roulette wheel. We need more powerful tools to understand them.
Let’s take a step towards complexity by considering probability models that aren’t uniformly distributed. One of the most familiar is the normal distribution, or bell curve. Consider the following chart of the distribution of adult human heights:
The peak of the curve represents the average height, about 5’5”. Heights close to the average are most common, which is why the curve is highest in the middle. As we move away from the average, the likelihood of seeing those heights decreases, which is why the curve tapers off and forms the “bell” shape that gives it its colloquial name.
Conditional Probability Distributions
All models are wrong, but some are useful.
— George Box
So is this bell curve a “good” model? It has some nice properties, but it’s far from perfect. For example, it suggests that the most likely height for a randomly selected person to have is 5’5”. But if you met a 6-year old child who was 5’5”, would you think they were completely average? Of course not. And so there’s clearly something wrong with our model.
Real-world probabilities often depend on additional factors. For example, if we know someone’s age, gender, or other demographic information, our assessment of their probable height could change dramatically: the distribution of heights of 6-year old boys is markedly different from middle-aged women. One way of discussing this rich family of related probabilities is that they are conditional probability distributions, meaning that they reflect additional information or knowledge versus the base or naive distribution.
Here, for example, are the conditional height distributions for adult men and women:
You can see that there are two distributions, one for each piece of conditional knowledge. If we know someone’s gender, we can use the corresponding distribution to make more accurate predictions.
Similarly, here are conditional height distributions for 10-year olds and 50-year olds. You can imagine that there is a continuous stream of corresponding distributions for every other age.
Conditional distributions allow us to not only model outcomes based on our empirical observations of the world, but to incorporate other findings into those models in a precise way. What’s especially interesting is that the conditional factors do not have to have a causal relationship on the observed outcomes; they only need to be correlated with them.
To illustrate this important concept, consider the distribution of heights of professional basketball players. The distribution of heights, conditional on the knowledge that someone is a professional basketball player, is quite different from the distribution of heights in the general population.
The peak has shifted significantly to the right, indicating that professional basketball players are, on average, much taller than the general population. But here’s the crucial point: being tall doesn’t cause someone to become a professional basketball player, nor does being a professional basketball player cause someone to grow taller. There’s simply a strong correlation between being tall and being a professional basketball player.
This non-causal yet highly informative relationship is key to understanding how LLMs work. These models don’t understand causality in the way humans do. Instead, they excel at recognizing and leveraging correlations in data. When an LLM generates text, it’s not reasoning about cause and effect; it’s making predictions based on patterns and correlations it has observed in its training data.
For instance, if an LLM has been trained on a dataset that includes many descriptions of basketball players, it might learn to associate words like “player,” “NBA,” or “court” with a higher likelihood of words related to tallness. This doesn’t mean the model understands why basketball players tend to be tall; it just knows that these concepts frequently co-occur.
Sampling
The last thing we need to understand before we move on to language is sampling.
Let’s go back to our roulette wheel for a moment. When you play roulette, you’re using a ball to sample outcomes from the probability distribution of the wheel. Each spin is an independent event that produces an outcome based on the underlying probabilities. To sample digitally, we replace the ball with a random number generator.
Sampling is how we turn our probability distributions into actual outcomes. It’s the bridge between our model of the world (the distribution) and events in the world (specific outcomes). Given some understanding of the relative likelihoods of different outcomes, we can produce novel outcomes from the distribution that satisfy its rules and constraints.
Importantly, sampling allows us to generate outcomes that reflect the overall structure of the distribution, even if we’ve never seen that exact outcome before. For instance, if we sample heights from our earlier distribution, we might get a height of 5’11” - a specific value that may not have been in our original dataset, but one that fits the pattern we’ve modeled.
This process of sampling is crucial for LLMs. When generating text, these models don’t simply choose the most probable word every time. Instead, they sample from their probability distributions, which allows for creativity and variability in their responses. For now, it’s important to note that sampling lets you convert a distribution into an outcome, a concept we’ll explore further when we dive into how LLMs generate text.
Training
Before we dive into language models, let’s briefly touch on what it means to “train” a model for a probability distribution. For our purposes, think of training as the process of tweaking a mathematical formula to make it fit a set of observed outcomes as closely as possible.
One of the reasons the normal distribution is so useful is that its model only requires two parameters: the average (mean) height and how much heights typically vary from this average (standard deviation). With these two numbers, we can recreate that familiar bell curve.
But what about more intricate distributions, like our conditional probabilities? Well, it gets a bit more complicated, but the core idea is the same: we’re trying to create a mathematical model that can accurately represent the distribution we see in our data. For now, just know that it’s possible to build these models, even for very complex distributions, and training is the process of solving for their parameters.
Language Models
We’ve spent considerable time building an intuition for probability distributions, conditional probabilities, and sampling. Now, let’s apply these concepts to the core of Large Language Models: modeling language itself.
Distributions of Words
Just as we modeled the distribution of heights in a population, we can model the distribution of words in a language. At first, this might seem strange - words aren’t numbers like heights, after all. But remember, probability distributions are simply about the likelihood of different outcomes, and words are just another type of outcome we can measure.
Suppose we took a large corpus of text data and made a graph of every word that appeared in it against its normalized frequency of appearance, ordered by that frequency. We’d get something like this:
This is a probability distribution of words! Just like our height distribution, it shows us the relative likelihood of different outcomes. However, there’s a crucial difference: while heights formed a continuous distribution, words are discrete entities. There’s no such thing as a word that’s halfway between ‘cat’ and ‘dog’. In this sense, our word distribution is more like our roulette wheel: each word is a distinct possibility with its own probability of occurrence.
In this distribution, you’ll notice:
- Common words like
is
,the
, anda
are the most likely to appear. - Everyday nouns and verbs like
street
,yellow
, andclimb
occupy the middle ground. - There is a long tail of rare or specialized words like
oxidize
orperipatetic
.
However, we can’t just sample from this distribution and generate intelligible prose. Iterated draws from this distribution are infinitely more likely to generate the “sentence” a a the a yellow run a the catalyst a is the the street
than anything resembling Shakespeare.
Conditional Distributions
Remember how our height predictions improved when we considered additional factors like age or profession? The same principle applies to words, but to an even greater degree. The probability of a word appearing is heavily dependent on the words, syntax, and semantics that come before it. This is where conditional probability becomes crucial in language modeling.
Let’s go back to the simple example we started this post with:
The cat sat on the _____
Given this context, you can make a pretty good guess about what the next word could be:
mat
is highly probableroof
is likelypiano
is also possible, though less commonmyrmidon
is extremely improbablethe
wouldn’t even make sense grammatically
Obviously, this is a very different distribution than the “naive” or unconditional distribution of words. Producing these conditional distributions is the heart of language modeling and the core internal operation of an LLM. A properly trained model can output a distribution like this one for any provided context, or “prompt.” As the prompt evolves, so too would the model’s assessment of conditional likelihoods.
Generating Text
Now that we understand how individual words can be modeled as probability distributions and even account for context, how do we use this to generate coherent text? This is where sampling comes into play.
Sampling from an LLM is similar to when we drew values from a more simple probability distribution, with a catch: we don’t want to just pick one word; we want to generate an entire sentence or paragraph! To do this, we sample iteratively from a conditional distribution of words, adding the result of each draw to the context for the next draw.4
The LLM nerds among you may notice I haven’t mentioned “tokens.”
In practice, modern LLMs don’t work directly with whole words, but rather with tokens that represent groups of characters, including punctuation. There’s a variety of reasons for this, including efficiency of encoding and flexibility in handling rare words and misspellings, but the core principles of building and sampling from a distribution remain the same. Anywhere I refer to “words” in this post, you can mentally substitute “tokens” if you prefer.
Here is a simple description of the process:
- The LLM starts with an initial context (which could be empty or provided by a prompt).
- Based on this context, it calculates the conditional probability distribution for the next word.
- It samples a word from this distribution.
- It adds this word to the context and repeats the process.5
Let’s illustrate this by continuing our previous example with the cat. The initial context is:
The cat sat on the _____
Suppose our model samples roof
from the distribution we proposed earlier. Now our context becomes:
The cat sat on the roof _____
The model would then calculate a new probability distribution for the next word. This distribution might favor words like and
, of
, or watching
. Let’s say it chooses of
. The updated context is:
The cat sat on the roof of _____
We repeat the process, computing a new conditional distribution for this context. This time it might heavily favor words like the
, a
, or her
. Each choice influences the next, and so on.
Here’s what it looks like in practice:
This is really how LLMs work! An iterative process of sampling and updating the context is fundamentally how LLMs generate text. It’s analogous to repeatedly sampling heights from our height distribution, but with each sample influencing the distribution for the next one.
However, this approach introduces a significant challenge: compounding errors. Once the model makes a “mistake” or chooses an unlikely word, that choice becomes part of the context for all future words. This can cause the model to veer into increasingly improbable territory, potentially devolving into gibberish after a few words or sentences.
Early language models often struggled with this issue. As they started to drift away from highly probable word sequences, they would tip increasingly into a low-probability, high-entropy regime. In a sense, language models are self-reinforcing: the more they favor a certain style, topic, or format, the more likely they are to continue doing so. Conversely, the more they veer into nonsense, the more likely nonsense becomes.
This self-reinforcing nature has interesting implications. For instance, once a model outputs a specific idea or format, it can be difficult to tell it to stop doing that.6 In fact, telling a model NOT to think of something almost always makes it output that very thing. It’s the digital equivalent of the classic “don’t think of an elephant” thought experiment.
The sampling process necessarily introduces an element of randomness, which is crucial for creativity and diversity in the outputs. If the model always chose the most probable word, its outputs would be repetitive and unnatural. The degree of randomness in sampling can be adjusted through a parameter that is often called “temperature”:
- Low temperature: The model is more likely to choose high-probability words. This results in more predictable, potentially more coherent, but possibly less creative text.
- High temperature: This introduces more randomness, allowing the model to more frequently choose lower-probability words. This can lead to more creative but potentially less coherent outputs.
Modern LLMs have become much better at maintaining coherence over longer stretches of text, thanks to advances in model architecture, training techniques, and the sheer scale of the models. However, the fundamental challenge of compounding errors remains, and it’s one of the reasons why LLMs can sometimes produce outputs that start strong but become increasingly nonsensical or off-topic as they continue.
From Chance to Chat
Chat interfaces have become the dominant way for most people to interact with LLMs, capturing the public imagination and showcasing these models’ capabilities. But how do we get from generating individual words to engaging in full-fledged conversations? The answer lies in cleverly applying the principles we’ve discussed so far.
Here’s how it works:
- When you start a chat, your initial message becomes the first piece of context.
- The model generates a response based on this context, just as we described earlier.
- For your next message, the model doesn’t just look at what you’ve just said. Instead, it considers everything that’s been said so far - your initial message, its first response, and your new message.
- This process repeats for each turn of the conversation. The context grows longer, incorporating each new message and response.
This approach allows the model to maintain consistency and context throughout a conversation. It can refer back to earlier parts of the chat, answer follow-up questions, and generally behave in a way that feels more like a coherent dialogue than isolated text generation.
However, this method also introduces some challenges:
-
Context Length Limits: LLMs have a maximum amount of text they can process at once (often referred to as the “context window”). For very long conversations, the earliest parts might get cut off when this limit is reached.
-
Computational Cost: As the conversation grows, generating each new response requires processing more and more text, which can slow down the model’s responses and increase computational costs.
-
Consistency vs. Creativity: The model might become overly constrained by the conversation history, potentially leading to less diverse or creative responses over time.
Despite these challenges, this simple yet effective approach to chat is what powers the conversational AI interfaces we interact with daily. By treating the entire conversation as a growing context for probabilistic text generation, LLMs can engage in surprisingly coherent and context-aware dialogues.
The Company Words Keep
We’ve seen how LLMs generate text by iteratively sampling from probability distributions. But where do these distributions come from? How does the model know which words are likely to follow others?
Earlier, we touched briefly on training, the exercise of discovering the intricate distributions that allow an LLM to predict the next word with such nuance.
My colleague Adam (who is, incidentally, the only person still reading this post) has an excellent way of capturing the intuition behind training:
“You know a word by the company it keeps.”
This means that an LLM’s understanding of a word is entirely based on how that word appears in relation to other words. Surprisingly, at no time does it learn its definition, etymology, or any other intrinsic property in an explicit way.7 The goal of training is to build a sophisticated model of these latent relationships to make accurate predictions about which words are likely to appear next.
To illustrate this principle in a simple sense, consider the word “bank.” In isolation, it could refer to a financial institution or the side of a river. During training, the model might encounter sentences like:
- “He deposited money in the bank.”
- “The river overflowed its banks after heavy rain.”
- “The bank approved her loan application.”
- “We had a picnic on the grassy bank by the stream.”
How can an LLM learn to distinguish between these meanings? Well, pretty much the same way you do.
Over billions of examples, the model builds a nuanced understanding of how the word “bank” relates to other words. It learns that when “bank” appears near words like “money,” “deposit,” or “loan,” it’s likely referring to a financial institution. When it’s near words like “river,” “stream,” or “grassy,” it’s more likely referring to a riverside. This understanding is encoded in the parameters of the model’s implicit probability distribution, and those parameters are often referred to as “weights.”
This “company it keeps” principle is crucial. The model doesn’t have explicit definitions or rules about what words mean. Instead, it builds a rich, multidimensional model of how words relate to each other in various contexts.
The actual mathematics of how training works is beyond the scope of this post (and, probably, most dinners you’ll attend). But conceptually, you can think of it as the model adjusting its internal parameters to better predict the next word in a sequence, given all the words that came before it. It does this over and over, for billions of examples, gradually refining its ability to capture the patterns and relationships in language.
What emerges from this process is not a set of rules or definitions, but a vast, interconnected web of probabilities. Given any sequence of words, the model can use this web to calculate the probability distribution of what might come next. This is why models require extraordinary amounts of data, compute, and time to train - they’re building an incredibly complex probabilistic model of language itself.
Understanding training in this way helps explain some of the quirks and limitations of LLMs:
- Correlation, not causation: LLMs excel at recognizing patterns and correlations in language, but they don’t understand causality. This is why they can sometimes produce outputs that seem logical but are factually incorrect.
- Bias in, bias out: If the training data contains biases or inaccuracies, these will be reflected in the model’s outputs. The model doesn’t have a way to fact-check its training data.
- Hallucination: When asked about topics it hasn’t seen much of in its training data, an LLM might generate plausible-sounding but incorrect information. This is because it’s trying to produce probable sequences of words based on limited relevant context.
- Difficulty with explicit rules: Because LLMs learn implicitly from patterns rather than explicit rules, they can sometimes struggle with tasks that require strict adherence to specific formats or guidelines.
By understanding LLMs as probability engines trained on vast amounts of text data, we can better appreciate both their capabilities and their limitations. This perspective is crucial for using them effectively and responsibly in real-world applications.
Thinking with Probabilities
Now that we understand LLMs as fancy probability engines, let’s explore how this perspective can help us use them more effectively. A lot of common LLM techniques are really just clever ways of nudging these probability distributions. Here are a few examples of ideas and techniques you may have heard of, and how they are actually all just playing with probability:
Talk like a pirate: It’s the classic “hello world” of proving your LLM works: getting it to talk like a pirate. By now you should realize that the model doesn’t have a separate “pirate mode” - it’s just shifting its word probabilities to favor “Arrr” and “matey” over more standard English.
Prompt engineering: In general, all of prompt engineering is all about putting the model in a better “probability regime.” When we craft a good prompt, we’re not just asking a clear question - we’re subtly shaping the likelihood of different kinds of responses. This is why prompts that work well for LLMs might look different from how we’d phrase things to a human or even to a search engine.
Chain-of-thought: One of the most powerful techniques in using LLMs is as simple as asking the model to “think step by step.” But why does this work? Remember, our LLMs are making probabilistic leaps from input to output. Sometimes, the leap from question to answer is just too big - the correct answer might be logical, but not probable given the input. By asking for step-by-step reasoning, we’re allowing the model to make a series of smaller, more probable jumps. Each step flows more naturally from the last, leading to a better final answer.
Fine-tuning: Sometimes, we want to push our models even further in a particular direction. That’s where fine-tuning comes in. Fine-tuning is like giving the model a specialized crash course. We start with a model that has broad knowledge (it’s seen tons of text on all sorts of topics), and then we show it a bunch of examples in our area of interest. This nudges the model’s entire probability distribution, making it more likely to use certain words or concepts by default.
RAG (Retrieval-Augmented Generation): This powerful technique has a simple but effective idea: before the model generates a response, we fetch some relevant information and add it to the input. This biases the model’s output probabilities towards using this specific, relevant information. It’s a bit like giving the model a cheat sheet for the particular question you’re asking.
Translation: Using statistics and correlative probabilities to model the relationship between words in different languages is not new; in fact, about a decade ago it provided a revolutionary step forward in high-quality machine translation. As considerably more powerful general-purpose models, LLMs inherit this ability to model and predict words across languages. You now know enough to think of this probabilistically: given a sentence and an instruction to translate it, a properly-trained model should determine that the most probable outcome is the translation.
Tipping your LLM: Consider the trick of saying “I’ll tip you $20” to an AI assistant. This doesn’t work because the model is actually expecting payment. Instead, it’s putting the model into a state where it’s more likely to produce high-effort, high-quality responses. It’s learned that contexts involving rewards often come with expectations of better performance.
ReAct agents: This idea of guiding the model’s reasoning process is also behind more complex systems like ReAct agents. These are setups where we give the model a specific format to follow, usually involving steps like “Think, Act, Observe.” By being precise about what we expect, we make it more likely for the model to use tools effectively or to check its own work.
Code generation: When it comes to generating specific types of content, like code, we can push this idea of biasing probabilities even further. When we tell a model to “write Python code,” we’re not activating some separate coding module. Instead, we’re shifting the model into a state where it’s much more likely to produce text that looks like Python - lots of indentation, specific keywords, that sort of thing.
Structured output generation: For highly structured outputs like JSON, some systems even artificially limit which tokens (chunks of text) the model is allowed to produce. This ensures the output follows the correct format, essentially forcing the model to color within the lines we’ve drawn.
Recitation: When the public first became aware of LLMs, there was a sustained and false belief that the models somehow maintained a copy of the entire internet, which was remixed or regurgitated on demand. Perhaps this was easier for some people to believe than models being capable of synthesizing novel outputs. The most common evidence for this belief was that models could perfectly recite known documents, like the first three paragraphs of Alice in Wonderland. By now, I hope you appreciate that for a sufficiently trained model, this is neither surprising nor particularly impressive. After all, the most probable response to “What are the first three paragraphs of Alice in Wonderland?” is, of course, the first three paragraphs of Alice in Wonderland.
All of these techniques, from simple prompt tweaks to complex system designs, are really just ways of playing with probabilities. We’re constantly asking ourselves: how can we make the output we want more likely? How can we guide the model towards better reasoning, more accurate information, or more useful formats?
Coda
We’ve journeyed from coin flips to complex language models, all through the lens of probability. I hope that you have developed a solid intuition for how LLMs actually work:
- They’re built on sophisticated probability distributions of language.
- They generate text by iteratively sampling from these distributions.
- Their “knowledge” is really just a vast web of word relationships and correlations.
Understanding LLMs as probability engines rather than knowledge databases is crucial for using them effectively and responsibly. It helps us set realistic expectations, interpret their outputs appropriately, and design better ways of leveraging their capabilities.
As we continue to develop and refine these models, keeping this probabilistic perspective in mind will be key. It reminds us that while LLMs are incredibly powerful tools that can revolutionize how we interact with information and solve problems, they’re fundamentally playing a very advanced game of “what word comes next?”
They’re not magic, they’re not sentient, and they’re definitely not going to rise up and kill us all.
Probably.
Footnotes
-
Granted, if you regularly discuss math at your dinner parties, this post might not be for you. ⤴️
-
with really fancy marketing. ⤴️
-
The complexity of producing a new probability distribution for every word is why LLM inference is expensive and time-consuming. ⤴️
-
At some point it decides to stop, but the details of that are way beyond what we’re covering here. ⤴️
-
My kingdom for a way to prevent LLMs from resorting to bullet points all the time. ⤴️
-
It’s quite likely that a dictionary would be included in a model’s training data. However, it would not receive any special attention or processing, though of course the close proximity of a word and its dictionary definition would result in a much stronger relationship between the two. ⤴️