Speculative Decoding @ ICML

Decode faster from existing off-the-shelf auto-regressive models, without retraining, while guaranteeing the exact same output distribution.

Last year my collaborators, Matan Kalman and Yossi Matias, and I published a cool paper where we introduced a generalization of speculative execution to the stochastic setting, which we call speculative sampling. Speculative sampling is applicable in general, but we also introduced a simple technique that applies it to decoding from auto-regressive models, like Transformers, which we call speculative decoding. Speculative decoding enables inferring from LLMs faster, without a trade-off - i.e. we decode faster while guaranteeing exactly the same output distribution, so we don’t need to make quality sacrifices. This is possible by using existing spare compute capacity. We put the paper on Arxiv last year, but I only just now gave a short talk explaining the main ideas in ICML, which you can find here.




Share this story