What is KV Cache?
AI InfrastructureLast updated:
A memory cache storing computed key-value pairs from previous tokens to avoid redundant computation during autoregressive generation.
The KV cache stores attention states for already-processed tokens so the model only computes attention for each new token. It is the primary consumer of GPU memory during inference and a key optimization target for serving systems.