What is PagedAttention?

AI Infrastructure

Last updated: July 30, 2026

A memory management technique that partitions the KV cache into pages for efficient GPU memory utilization during inference.

PagedAttention, introduced by vLLM, treats the key-value cache like virtual memory pages, eliminating fragmentation and enabling dynamic memory sharing across concurrent requests. It can improve serving throughput by 2-4x compared to naive implementations.

Related Terms

vLLM Inference Model Serving

What is PagedAttention?

Related Terms

Further Reading

Ready to assemble your AI squad?