What is vLLM?

AI Infrastructure

Last updated: July 30, 2026

A high-throughput open-source library for serving LLMs with PagedAttention for efficient GPU memory management.

vLLM uses PagedAttention to manage GPU memory like virtual memory pages, dramatically improving throughput for concurrent requests. It supports continuous batching, tensor parallelism, and many open-weights models, making it a popular self-hosting choice.

Related Terms

Model Serving Inference GPU Cluster

What is vLLM?

Related Terms

Further Reading

Ready to assemble your AI squad?