Google has unveiled a significant upgrade to its Gemma 4 AI models, promising enhanced local processing power. These models can now generate text much faster by predicting future tokens before they’re needed, reducing the computational load significantly.
This advancement hinges on Multi-Token Prediction (MTP), an experimental feature that speeds up token generation through speculative decoding. Unlike traditional autoregressive methods where each token is generated sequentially and independently, MTP anticipates what comes next, cutting down on unnecessary computations.
While the technology behind Gemma 4 shares similarities with Google’s Gemini AI, it's designed to run efficiently on consumer-grade hardware such as GPUs. This shift towards local processing opens up new possibilities for users who want more privacy and control over their data without relying on cloud services.
MTP optimizes this process by using lightweight drafters that share key memory caches with the main model, reducing redundant calculations and speeding up token generation. By doing so, it minimizes the time spent moving parameters between VRAM and compute units, making the entire process more efficient and faster.







