What is PagedAttention?
AI InfrastructureLast updated:
A memory management technique that partitions the KV cache into pages for efficient GPU memory utilization during inference.
PagedAttention, introduced by vLLM, treats the key-value cache like virtual memory pages, eliminating fragmentation and enabling dynamic memory sharing across concurrent requests. It can improve serving throughput by 2-4x compared to naive implementations.