Modern large language models (like ChatGPT) learn to generate new samples by modeling the data distribution of natural text. However, the underlying methodology has largely remained stagnant over the last century: although different architectures have been developed, models are all based on autoregressive modeling (i.e. next token prediction). In this blog post, I will talk about our work on Score Entropy Discrete Diffusion models, an alternative probablistic modeling technique that achieves highly competitive performance (at the scale of GPT-2) while introducing distinct algorithmic benefits. Our empirical results challenge the longstanding dominance of autoregressive modeling and can potentially pave the way for an alternative class of language models built from radically different principles.
At the core of modern generative AI systems lies the field of probabilistic modeling. The idea behind probabilistic modeling is surprisingly straightforward and consists of three steps:
Collect a dataset
Learn a parameterized density
Use
For a data domain such as natural language (which this blog post focuses on), things seem like they should be easy. In particular, our data is discrete, meaning that the support of
Unfortunately, real world datasets are much more complex than this. For something like natural language, we are working with something like the Library of Babel, where all possible combinations of characters must be analyzed. This makes things quite a bit more challenging.
To overcome this challenge, we use a parameterized model (like a neural network) to map values to probabilities
Since neural networks
But, for this type of parameterization, computing
Luckily for us, there is a way to make this problem tractable. In almost all practical applications, our data support
This motivates autoregressive modeling
Autoregressive modeling is perhaps the simplest workable approach imaginable. Yet, despite the numerous advances in the field of generative models, no competitive alternative has been presented for discrete data. Ultimately, the core obstacle is actually the discrete nature of the data itself. Rather than simplifying our space, this structure breaks the fundamentals of calculus, which, incidentally, our neural networks rely on.
To illustrate this point, consider a classic GAN
Now, suppose the GAN generates discrete text. While one could define a generator and discriminator using similar principles, a fundamental error occurs. Unlike images, which are continuous and thus enable backpropagation to compute gradients, text is a discrete value which can’t be differentiated. As such, computing the gradient learning signal for the generator is a cumbesome process (often relying on crude approximations and/or estimations).
Other approaches largely run into similar problems, leaving autoregressive modeling as the only viable contender. Without fundamental advancements to the core probabilistic framework, generative models for discrete data have instead progressed by repeatedly iterating on the neural network architecture. In particular, modern methods largely revolve around the transformer model
As the field of autoregressive transformers reaches a natural maturity, many have started to believe that more data, better hyperparameters, and larger architectures are all you need for AGI. Yet, there are still fundamental shortcomings of this autoregressive framework
These shortcomings motivate us to define an alternative probabilistic model, which we coin the Score Entropy Discrete Diffusion (SEDD) model.
We saw that the main failure of energy based models was the computation of the normalizing constant
These ratios collectively form a quantity known as the concrete score
Importantly, we wrote out our neural network approximation as
With our neural network
to define the score entropy:
Score entropy generalizes cross entropy by allowing one to learn arbitarary
In practice, we can combine the score entropy loss for all values of
While score entropy is a nice theoretical construct, it requires too much prior information to work. In particular, if we already knew
Following standard techniques in score matching
In particular, this replaces the unknown
While the previous section showed that we can use score entropy to learn the concrete score from data (corresponding to learning the probabilistic model), we still don’t know how to generate samples. To do this, we will borrow the core idea from diffusion models
To do this, we must first define what a “adding noise” to a discrete text sample means. While for continuous space, such a perturbation comes about naturally by adding Gaussian noise, discrete spaces are forced to “jump” directly between different elements.
Mathematically, such evolutions are known as discrete diffusion processes. We can describe such a process with a differential equation that evolves our “clean” data distribution
Remember that, since our support is discrete,
In particular, we see that
which would mean every two elements are equally likely. In particular, since we work with sequences, we choose to perturb at each position independently so we don’t destory information entirely.
Importantly, as
Here, we can reconstruct the jumping strategy but in reverse, seeing that this process intuitively pushes towards higher probability regions:
The
With such a learned neural network
When compared with autoregressive models, our approach can use complete global context during generation, resulting in better overall generation. In particular, autoregressive models like GPT-2
Furthermore, we can trade off compute for quality, something that isn’t possible for autoregressive models, enabling us to generate good samples with far fewer iterations. By graphing out the generation quality vs number of sampling steps, we can readily see a direct performance tradeoff.
One other cool thing is that we can prompt our generation similar to autoregressive generation. In fact, this is even more flexible, since we can prompt from any position, not just from left to right. The setup is very simple: given prompt tokens at
While naive, this is actually principled! Suppose we are given an input
In particular, the conditional and unconditional scores are the same, so we can reuse our base score netork
When we test with a variety of prompting schemes, our model is capable of generating high quality text, matching the best autoregressive sampling strategies using both standard (left to right) and nonstandard (fill in the middle) prompting. Note that neither of these prompting strategies are explicitly learned, making our results zero-shot.
Method | MAUVE Score (↑) |
---|---|
GPT-2 w/ nucleus sampling | 0.955 |
SEDD w/ standard prompting | 0.957 |
SEDD w/ infilling prompting | 0.942 |
For language models, one important indicator of performance is perplexity, which is a function of the model’s log probability.
Perplexity is directly correlated with improved performance (most notably, it improves predictably as model size grows), so it’s natural that we should want to measure it for our generative model. For us, although we learn
For our models, we find that our approach challenges autoregressive models on likelihoods for the first time (for example beating GPT-2 on a majority of the test tasks):
Method | LAMBADA PPL | Wikitext2 PPL | PTB PPL | Wikitext103 PPL | 1BW PPL |
---|---|---|---|---|---|
GPT-2 Small | 45.04 | 42.43 | 138.43 | 41.60 | 75.20 |
SEDD Small | ≤50.92 | ≤41.84 | ≤114.24 | ≤40.62 | ≤79.29 |
GPT-2 Medium | 35.66 | 31.80 | 123.14 | 31.39 | 55.72 |
SEDD Medium | ≤42.77 | ≤31.04 | ≤87.12 | ≤29.98 | ≤61.19 |
This blog post has presented our recent work on score entropy discrete diffusion models, an alternative probabilistic technique for discrete data that challenges the longstanding dominance of autoregressive models. Our work is part of the burgeoning field of diffusion models and, in particular, directly generalizes prior work on score matching to discrete spaces. While our work “pieces together the puzzle” for this type of approach (enabling, for the first time, highly competitive results on large-scale tasks), I would be remiss if I didn’t mention other discrete diffusion methods that laid out a sizeable part of the theoretical framework
If you are interested in more mathematical details or getting started with SEDD, our paper is out on arxiv and our code can be found on github.