Multi-token prediction in Gemma 4

Multi-token prediction in Gemma 4

Why speculative decoding?

The technical reality is that standard LLM inference is tied to memory bandwidth, creating a significant latency bottleneck. The processor spends most of its time moving billions of parameters from VRAM to the computing devices just to generate a single token. This leads to underutilized computing and high latency, especially on consumer-grade hardware.

Speculative decoding decouples token generation from verification. By pairing a heavy target model (eg Gemma 4 31B) with a lightweight drafter (the MTP model), we can use idle computation to “predict” multiple future tokens at once with the drafter in less time than it takes the target model to process just one token. The target model then verifies all these proposed tokens in parallel.

How speculative decoding works

Standard large language models generate text autoregressively, producing exactly one token at a time. Although efficient, this process dedicates the same amount of computation to predicting an obvious continuation (like predicting “words” after “Actions speak louder than…”) as it does to solving a complex logic puzzle.

MTP mitigates this inefficiency through speculative decoding, a technique introduced by Google researchers in Fast ending from Transformers via speculative decoding. If the target model matches the draft, it accepts the entire sequence in a single forward pass—and even generates an extra symbol in the process. This means that your application can print the entire prepared sequence plus a token in the time it normally takes to generate a single one.

Unlocks faster AI from the edge to the workstation

For developers, inference speed is often the primary bottleneck for production deployment. Whether you’re building coding assistants, autonomous agents that require rapid multi-step planning, or responsive mobile applications that run solely on the device, every millisecond counts.

By pairing a Gemma 4 model with its corresponding character, developers can achieve:

  • Improved responsiveness: Drastically reduce latency for near-real-time chat, immersive voice applications, and agent workflows.
  • Supercharged local development: Run our 26B MoE and 31B Dense models on PCs and consumer GPUs at unprecedented speed, powering seamless, complex offline coding and agent workflows.
  • Improved performance on the device: Maximize the usability of our E2B and E4B models on edge devices by generating output faster, which in turn preserves valuable battery life.
  • No quality degradation: Because the primary Gemma 4 model retains final verification, you get identical cross-boundary reasoning and accuracy, just delivered significantly faster.

Leave a Reply

Your email address will not be published. Required fields are marked *