Another day, another groundbreaking AI from Google. This time, we’re looking at DiffusionGemma, a new member of the Gemma 4 open model family that promises to revolutionize local AI performance.
Unlike most autoregressive models that generate text left-to-right one token at a time, DiffusionGemma operates differently. It produces entire blocks of text in parallel, akin to an image generation model denoising from static content. This approach not only makes the process faster but also more efficient on local hardware like gaming GPUs and Nvidia DGX systems.
With 26 billion parameters, but only 3.8 billion activated during inference for a sleek fit into high-end GPU memory, DiffusionGemma shows impressive computational efficiency. It can spit out around 700 tokens per second on an RTX 5090 and over 1,000 tokens per second with a single Nvidia H100 AI accelerator—four times the output of similarly sized autoregressive models.
This shift in text generation methodology is expected to significantly enhance non-linear tasks such as inline editing, molecular sequencing, and mathematical graphing. Google’s tuning of DiffusionGemma for solving complex puzzles like Sudoku demonstrates its ability to continuously self-correct large sets of tokens, making it a formidable contender in the AI landscape.







