Baseten

Baseten

ML model deployment platform

General Infrastructure

What it's used for

Baseten is a model deployment platform that makes it straightforward to serve ML models — both LLMs and traditional ML models — as production-grade, auto-scaling API endpoints. Its open-source packaging format, Truss, provides a standardized way to bundle models with their dependencies, pre/post-processing logic, and GPU requirements.

  • One-command deployment — package a model in Truss format and deploy with truss push to get a live API endpoint in minutes
  • Auto-scaling — endpoints scale up under load and scale to zero when idle, so you only pay for actual inference time
  • GPU provisioning — Baseten handles A100, H100, and T4 allocation automatically based on your model's requirements
  • Pre-built model library — deploy popular models (Llama, Whisper, Stable Diffusion) from the Baseten model library without writing any packaging code
  • Chains — orchestrate multi-model pipelines (e.g., transcription then summarization) as a single deployable unit
  • Streaming & async — support for streaming LLM responses and asynchronous batch processing

ML engineers and platform teams use Baseten when they need a simpler alternative to Kubernetes-based model serving (like KServe or Seldon). It is especially popular with teams that have many models to deploy and want a consistent packaging and deployment workflow across all of them.

Baseten differentiates from pure API providers (Together, Fireworks) by supporting any model type — not just LLMs. You can deploy computer vision models, audio models, recommendation systems, and traditional ML models alongside your LLM endpoints.

Getting started

  1. Create an account — sign up at baseten.co and get your API key from the settings page.
  2. Install Truss:
    pip install truss
  3. Create a Truss — initialize a new model package:
    truss init my-model
    This creates a directory with model/model.py and config.yaml.
  4. Define your model — edit model/model.py:
    class Model:
        def __init__(self, **kwargs):
            self._model = None
    
        def load(self):
            from transformers import pipeline
            self._model = pipeline('text-generation', model='meta-llama/Llama-2-7b-hf', device=0)
    
        def predict(self, request):
            prompt = request.get('prompt', '')
            return self._model(prompt, max_new_tokens=100)
  5. Deploy — push to Baseten's infrastructure:
    truss push my-model
    Baseten builds your container, provisions a GPU, and returns an endpoint URL.
  6. Call your endpoint:
    import requests
    response = requests.post(
        'https://app.baseten.co/models/YOUR_MODEL_ID/predict',
        headers={'Authorization': 'Api-Key YOUR_KEY'},
        json={'prompt': 'Hello world'}
    )

Pricing: Pay per second of GPU time. T4: ~$0.60/hr. A10G: ~$1.10/hr. A100 40GB: ~$3.15/hr. Scale-to-zero means no charges when idle. Free tier includes starter credits. Full pricing details.

Tip: Use Baseten's model library to deploy popular open models instantly without writing any Truss code — perfect for testing. When you need customization, fork the Truss and modify the predict function to add your own pre/post-processing.

No case studies yet

Be the first to share a Baseten case study and get discovered by clients.

Submit a case study

Related tools in General

Need a Baseten expert?

Submit a brief and we'll match you with vetted specialists who have proven Baseten experience.

Submit a brief — it's free