Modal

Modal

Serverless GPU cloud for ML

General Infrastructure

What it's used for

Modal is a serverless GPU cloud that lets you run Python functions on cloud GPUs without managing any infrastructure. You define your environment and dependencies entirely in code, and Modal handles container building, GPU scheduling, auto-scaling, and cold-start optimization behind the scenes.

  • Serverless GPU functions — decorate any Python function with @app.function(gpu='A100') and it runs on cloud GPUs with zero infra management
  • Fine-tuning jobs — run model fine-tuning with automatic checkpointing and volume mounts for datasets
  • Batch inference — process thousands of inputs in parallel with automatic scaling and pay only for active compute time
  • Model serving — deploy models as web endpoints with @app.cls that stay warm and autoscale based on traffic
  • Scheduled jobs — run recurring ML pipelines (retraining, embeddings updates) on a cron schedule
  • Custom containers — define Docker-like environments in Python with modal.Image — no Dockerfile needed

ML engineers, indie hackers, and AI startups love Modal for its developer experience — the feedback loop from local code to running on an H100 is seconds, not hours. It eliminates the DevOps burden of Kubernetes, Docker registries, and GPU orchestration.

Modal is particularly strong for bursty workloads that do not justify reserved GPU instances. You pay per-second of compute with no idle costs, making it significantly cheaper than always-on instances for workloads that run intermittently.

Getting started

  1. Install the Modal package:
    pip install modal
  2. Authenticate — run the setup command which opens your browser for login:
    modal setup
  3. Write a GPU function — create a Python file (e.g., app.py):
    import modal
    
    app = modal.App('my-ml-app')
    image = modal.Image.debian_slim().pip_install('torch', 'transformers')
    
    @app.function(gpu='A100', image=image)
    def generate(prompt: str):
        from transformers import pipeline
        pipe = pipeline('text-generation', model='meta-llama/Llama-2-7b-hf', device=0)
        return pipe(prompt, max_new_tokens=100)
  4. Run it — execute locally and it runs on cloud GPUs:
    modal run app.py
  5. Deploy as an endpoint — add a web endpoint decorator and deploy:
    modal deploy app.py

Pricing: Pay per second of GPU time. A100 40GB costs ~$1.10/hr, H100 costs ~$3.95/hr. Free tier includes $30/month in compute credits — enough for significant experimentation. No charge for idle time. Full pricing details.

Tip: Use modal.Volume to persist model weights and datasets across function calls — this avoids re-downloading large models every time a container cold-starts. Also use @app.cls(keep_warm=1) for endpoints that need instant response times.

No case studies yet

Be the first to share a Modal case study and get discovered by clients.

Submit a case study

For hire

Modal specialists

Thought leaders

AI leaders using Modal

Follow for insights, tutorials, and thought leadership

Related tools in General

Need a Modal expert?

Submit a brief and we'll match you with vetted specialists who have proven Modal experience.

Submit a brief — it's free