What it's used for

vLLM is a high-throughput, memory-efficient inference and serving engine for LLMs. It uses PagedAttention to manage GPU memory efficiently, enabling 2-4x higher throughput than naive implementations and making it the standard choice for self-hosted LLM deployments.

PagedAttention — efficient KV cache management for higher throughput
OpenAI-compatible API — drop-in replacement for OpenAI's API format
Continuous batching — maximizes GPU utilization across concurrent requests
Wide model support — Llama, Mistral, Qwen, Gemma, and 50+ architectures

ML engineers use vLLM whenever they need to serve open-source models in production at scale. It's the default serving engine for many AI companies and cloud providers.

Getting started

Install vLLM:
```
pip install vllm
```

Start a server:

vllm serve meta-llama/Llama-3.2-3B-Instruct

Call the OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model": "meta-llama/Llama-3.2-3B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'

vLLM is free and open-source (Apache 2.0). Requires NVIDIA GPU with CUDA support for production use.

vLLM

What it's used for

Getting started

Commonly paired with

No case studies yet

Related tools in General

Need a vLLM expert?