Loading…
Loading…
Written by Max Zeshut
Founder at Agentmelt
A technique that reduces the precision of an AI model's numerical weights—typically from 32-bit floating point to 8-bit or 4-bit integers—to shrink the model's memory footprint and increase inference speed. Quantization makes it possible to run large language models on smaller GPUs, edge devices, and consumer hardware. A 70B-parameter model that normally requires 140GB of GPU memory can run in 35GB with 4-bit quantization, with only a 2–5% drop in quality for most tasks.