Replicate

Replicate

Run any open model via API

General Infrastructure

What it's used for

Replicate lets you run open-source AI models via a simple API without setting up any GPU infrastructure. With one API call, you can run Llama, Stable Diffusion, Whisper, and thousands of other community-contributed models — each packaged as a versioned, reproducible container.

  • Instant model access — run any model from the Replicate model catalog with a single API call, no GPU setup needed
  • Image & video generation — generate images with SDXL, Flux, and other models, with built-in support for ControlNet, LoRA, and img2img workflows
  • Custom model deployment — package your own fine-tuned models using Cog (open-source) and deploy them with automatic GPU scaling
  • Streaming & webhooks — stream LLM output token-by-token and receive async webhook notifications when long-running predictions complete
  • Model fine-tuning — fine-tune SDXL and language models directly on Replicate with training API endpoints

Developers, product teams, and AI hobbyists use Replicate because it is the fastest way to go from "I want to try this model" to a working API call. The community model catalog means someone has likely already packaged and optimized the model you need.

Replicate is especially popular for image and video generation use cases, where the model ecosystem moves fast and teams need to experiment with new architectures (Flux, Stable Video Diffusion) without re-building inference infrastructure each time.

Getting started

  1. Create an account — sign up at replicate.com and get your API token from account settings.
  2. Install the Python client:
    pip install replicate
  3. Set your API token:
    export REPLICATE_API_TOKEN=r8_your_token_here
  4. Run a model — generate an image with one call:
    import replicate
    
    output = replicate.run(
        'stability-ai/sdxl:latest',
        input={'prompt': 'An astronaut riding a horse on Mars'}
    )
    print(output)  # Returns a URL to the generated image
  5. Run an LLM — stream text generation:
    for event in replicate.stream(
        'meta/llama-2-70b-chat',
        input={'prompt': 'Explain quantum computing simply'}
    ):
        print(str(event), end='')
  6. Deploy your own model — install Cog and push your model:
    pip install cog
    cog init  # Creates predict.py and cog.yaml
    cog push r8.im/your-username/your-model

Pricing: Pay per second of compute. Costs vary by model and hardware — typically $0.00115/sec for a T4 GPU, $0.0023/sec for an A40, $0.003/sec for an A100. No minimum spend. Full pricing details.

Tip: Use the Explore page to discover popular models and check their cold-start times. For production use, consider Replicate Deployments to keep models warm with dedicated hardware and guaranteed capacity.

No case studies yet

Be the first to share a Replicate case study and get discovered by clients.

Submit a case study

Thought leaders

AI leaders using Replicate

Follow for insights, tutorials, and thought leadership

Related tools in General

Need a Replicate expert?

Submit a brief and we'll match you with vetted specialists who have proven Replicate experience.

Submit a brief — it's free