What is Tensor Parallelism?
AI InfrastructureLast updated:
Splitting individual model layers across multiple GPUs to serve models that exceed single-GPU memory.
Tensor parallelism distributes weight matrices across GPUs so each processes a portion of every layer simultaneously. Combined with pipeline parallelism, it enables serving models with hundreds of billions of parameters across multi-GPU nodes.