Run 12B LLMs at 120 Tokens/Second on Consumer-Grade 12GB GPUs

TL;DR

Developers can now run 12B parameter models at 120 tokens/second on consumer 12GB GPUs. This is achieved by combining Google's Quantization-Aware Training (QAT) models with GGUF quantization and speculative decoding, making powerful local LLM inference accessible and affordable.

Context

Running large, highly accurate models typically requires expensive, enterprise-grade hardware, creating a high barrier to entry for individual developers and small teams. The challenge was to achieve production-level inference speeds for models over 10B parameters on widely available consumer GPUs, which would normally be too slow for practical applications.

The approach

This setup starts with Google's Gemma 4 12B model, specifically the variant designed for Quantization-Aware Training (QAT), which prepares the model for lower-precision formats without significant accuracy loss. Using tools from Unsloth, the model is converted to the GGUF format, which is optimized for frameworks like llama.cpp.

The key performance gain comes from speculative decoding, also known as Multi-Token Prediction. This technique uses a smaller, faster 'draft' model—in this case, Google's QAT assistant model—to generate several tokens in advance. The larger 12B model then validates this draft sequence in a single step, rather than generating tokens one by one. The entire inference process is managed by llama.cpp, resulting in speeds of 120 tokens per second on a standard 12GB GPU.

Why it worked

The breakthrough wasn't a single technique but the synergy of three distinct optimizations. QAT ensured the model's accuracy remained high after quantization. GGUF provided an efficient, hardware-agnostic format for execution. Finally, speculative decoding traded a small amount of computational overhead from the draft model for a massive reduction in the number of forward passes required by the large model. This combination directly addresses the memory and computational bottlenecks that typically prevent large models from running effectively on consumer hardware.

Apply it yourself

To replicate this for your own projects, find a base model with a QAT-trained variant and a corresponding smaller draft model. Use a framework like llama.cpp to handle GGUF conversion and to configure speculative decoding. This approach allows you to test and deploy powerful models locally without needing to invest in costly cloud instances or dedicated AI hardware.