What is vLLM?
AI InfrastructureLast updated:
A high-throughput open-source library for serving LLMs with PagedAttention for efficient GPU memory management.
vLLM uses PagedAttention to manage GPU memory like virtual memory pages, dramatically improving throughput for concurrent requests. It supports continuous batching, tensor parallelism, and many open-weights models, making it a popular self-hosting choice.