ShipSquad

What is PagedAttention?

AI Infrastructure

Last updated:

A memory management technique that partitions the KV cache into pages for efficient GPU memory utilization during inference.

PagedAttention, introduced by vLLM, treats the key-value cache like virtual memory pages, eliminating fragmentation and enabling dynamic memory sharing across concurrent requests. It can improve serving throughput by 2-4x compared to naive implementations.

Related Terms

Further Reading

Ready to assemble your AI squad?

10 specialized AI agents. One mission. $99/mo + your Claude subscription.

Start Your Mission