Groq

Groq

Ultra-fast LPU inference chips

General Infrastructure

What it's used for

Groq delivers ultra-low-latency LLM inference using custom-designed Language Processing Unit (LPU) hardware that generates tokens significantly faster than GPU-based alternatives. Where a typical GPU inference provider might deliver 30-80 tokens/sec, Groq routinely achieves 300-800+ tokens/sec for supported models.

  • Real-time chat — build conversational AI that responds almost instantly, with time-to-first-token under 100ms
  • Agent workflows — run multi-step LLM chains where each step completes in milliseconds, dramatically reducing end-to-end latency
  • Batch processing — process large volumes of text (summarization, extraction, classification) at speeds that make real-time pipelines feasible
  • Voice AI — power speech-to-text-to-LLM-to-speech pipelines where low latency is critical for natural conversation
  • OpenAI-compatible API — drop-in replacement for OpenAI SDK with zero code changes required

Developers building latency-sensitive applications choose Groq because speed is its core differentiator. When your AI assistant needs to feel instant, or when you are chaining 5+ LLM calls in an agent loop, Groq's LPU advantage compounds at every step.

Groq supports popular open models including Llama 3, Mixtral, Gemma, and Whisper (for speech-to-text). The model selection is curated rather than exhaustive, focusing on models that benefit most from LPU acceleration.

Getting started

  1. Create an account — sign up at console.groq.com and generate an API key from the API Keys section.
  2. Install the Groq SDK:
    pip install groq
  3. Run your first query:
    from groq import Groq
    
    client = Groq(api_key='your-api-key')
    response = client.chat.completions.create(
        model='llama3-70b-8192',
        messages=[{'role': 'user', 'content': 'Explain photosynthesis'}],
        temperature=0.7
    )
    print(response.choices[0].message.content)
  4. Or use the OpenAI SDK — point it at Groq's endpoint:
    from openai import OpenAI
    
    client = OpenAI(
        api_key='your-groq-key',
        base_url='https://api.groq.com/openai/v1'
    )
  5. Use Whisper for speech-to-text:
    with open('audio.mp3', 'rb') as f:
        transcription = client.audio.transcriptions.create(
            model='whisper-large-v3',
            file=f
        )

Pricing: Very competitive. Llama 3 8B: $0.05/M tokens. Llama 3 70B: $0.59/M tokens. Mixtral 8x7B: $0.24/M tokens. Free tier includes generous rate limits (30 requests/min on most models) — enough for development and prototyping. Full pricing details.

Tip: Groq's speed advantage is most noticeable on longer outputs. For short completions (< 50 tokens), the latency difference vs. GPU providers is minimal. For maximum throughput, use streaming mode and start processing tokens as they arrive.

No case studies yet

Be the first to share a Groq case study and get discovered by clients.

Submit a case study

Thought leaders

AI leaders using Groq

Follow for insights, tutorials, and thought leadership

Related tools in General

Need a Groq expert?

Submit a brief and we'll match you with vetted specialists who have proven Groq experience.

Submit a brief — it's free