What is Tensor Parallelism?

AI Infrastructure

Last updated: August 1, 2026

Splitting individual model layers across multiple GPUs to serve models that exceed single-GPU memory.

Tensor parallelism distributes weight matrices across GPUs so each processes a portion of every layer simultaneously. Combined with pipeline parallelism, it enables serving models with hundreds of billions of parameters across multi-GPU nodes.

Related Terms

GPU Cluster Model Serving Inference

What is Tensor Parallelism?

Related Terms

Further Reading

Ready to assemble your AI squad?