What it's used for

Groq delivers ultra-low-latency LLM inference using custom-designed Language Processing Unit (LPU) hardware that generates tokens significantly faster than GPU-based alternatives. Where a typical GPU inference provider might deliver 30-80 tokens/sec, Groq routinely achieves 300-800+ tokens/sec for supported models.

Real-time chat — build conversational AI that responds almost instantly, with time-to-first-token under 100ms
Agent workflows — run multi-step LLM chains where each step completes in milliseconds, dramatically reducing end-to-end latency
Batch processing — process large volumes of text (summarization, extraction, classification) at speeds that make real-time pipelines feasible
Voice AI — power speech-to-text-to-LLM-to-speech pipelines where low latency is critical for natural conversation
OpenAI-compatible API — drop-in replacement for OpenAI SDK with zero code changes required

Developers building latency-sensitive applications choose Groq because speed is its core differentiator. When your AI assistant needs to feel instant, or when you are chaining 5+ LLM calls in an agent loop, Groq's LPU advantage compounds at every step.

Groq supports popular open models including Llama 3, Mixtral, Gemma, and Whisper (for speech-to-text). The model selection is curated rather than exhaustive, focusing on models that benefit most from LPU acceleration.

Getting started

Create an account — sign up at console.groq.com and generate an API key from the API Keys section.
Install the Groq SDK:
```
pip install groq
```

Run your first query:

from groq import Groq

client = Groq(api_key='your-api-key')
response = client.chat.completions.create(
    model='llama3-70b-8192',
    messages=[{'role': 'user', 'content': 'Explain photosynthesis'}],
    temperature=0.7
)
print(response.choices[0].message.content)

Or use the OpenAI SDK — point it at Groq's endpoint:

from openai import OpenAI

client = OpenAI(
    api_key='your-groq-key',
    base_url='https://api.groq.com/openai/v1'
)

Use Whisper for speech-to-text:

with open('audio.mp3', 'rb') as f:
    transcription = client.audio.transcriptions.create(
        model='whisper-large-v3',
        file=f
    )

Pricing: Very competitive. Llama 3 8B: $0.05/M tokens. Llama 3 70B: $0.59/M tokens. Mixtral 8x7B: $0.24/M tokens. Free tier includes generous rate limits (30 requests/min on most models) — enough for development and prototyping. Full pricing details.

Tip: Groq's speed advantage is most noticeable on longer outputs. For short completions (< 50 tokens), the latency difference vs. GPU providers is minimal. For maximum throughput, use streaming mode and start processing tokens as they arrive.

Groq

What it's used for

Getting started

Commonly paired with

No case studies yet

AI leaders using Groq

Jonathan Ross

Related tools in General

Need a Groq expert?