Groq's Language Processing Unit (LPU) is a custom silicon architecture purpose-built for LLM inference. Where GPUs process 30-100 tokens/second on a 70B model, Groq's LPUs deliver 500-1000+ tokens/second — a 10-20x speedup. For real-time applications like voice agents, live chat, and interactive coding, this speed difference fundamentally changes the UX.
Groq runs open models (Llama, Mixtral, Qwen, DeepSeek) on their custom LPU hardware. The API is OpenAI-compatible, so integration is a drop-in replacement. The LPU architecture uses a single-core, software-scheduled design that eliminates batching overhead and memory contention that limit GPU throughput. The tradeoff is that Groq can only host open models with published weights — you can't fine-tune or use closed models.
Llama 3.1 70B: $0.59 input / $0.79 output per M tokens. Llama 3.1 8B: $0.05/$0.08. Mixtral 8x7B: $0.24/$0.24. Free tier available for experimentation. Pay-as-you-go, no contract.
Builders of voice AI, real-time chat applications, live coding assistants, and anywhere latency matters more than model choice. Widely used for conversational AI prototypes.