Which models can run at the edge?

Models up to ~13B parameters run well on consumer GPUs and Apple Silicon (M-series Macs). Quantized models (4-bit, 8-bit) extend this range. For phones and embedded devices, models under 3B parameters are practical. The trade-off is capability: edge models are less capable than cloud models like GPT-4 or Claude, but for focused tasks (classification, extraction, short generation) they perform well enough.

Edge Inference

Written by Max Zeshut

Founder at Agentmelt

Running AI models directly on local devices (phones, laptops, IoT hardware, on-premise servers) rather than sending requests to cloud-hosted APIs. Edge inference eliminates network latency, works offline, and keeps sensitive data on-device—critical for healthcare agents handling patient data, voice agents needing sub-100ms responses, and manufacturing agents operating in facilities without reliable internet. Trade-offs include limited model size (smaller models only) and higher hardware requirements.

Пример

A voice agent in a dental office runs speech-to-text and response generation on a local GPU server. Patient conversations never leave the building, latency is under 200ms, and the agent works even when the internet goes down.

Часто задаваемые вопросы

Which models can run at the edge?: Models up to ~13B parameters run well on consumer GPUs and Apple Silicon (M-series Macs). Quantized models (4-bit, 8-bit) extend this range. For phones and embedded devices, models under 3B parameters are practical. The trade-off is capability: edge models are less capable than cloud models like GPT-4 or Claude, but for focused tasks (classification, extraction, short generation) they perform well enough.

Связанные ниши

Назад в глоссарий

Loading…