Loading…
Loading…
Written by Max Zeshut
Founder at Agentmelt
A technique that reduces the numerical precision of an AI model's weights—typically from 16-bit floating point to 8-bit or 4-bit integers—to shrink model size and speed up inference with minimal quality loss. A 70B-parameter model that requires 140GB of GPU memory at full precision fits in 35GB at 4-bit quantization, enabling deployment on a single consumer GPU instead of a multi-GPU server. Quantization is the key enabler for self-hosted and edge-deployed AI agents.
A company self-hosts a 70B model for on-premise document analysis. At full precision, they'd need 4× A100 GPUs ($60K+). With 4-bit quantization, the model runs on a single A100 with less than 2% quality degradation on their document extraction benchmarks.