NVLink vs PCIe for AI GPU Servers — Bandwidth, Use Cases, and Platform Selection
When selecting an NVIDIA data center GPU server, one of the most consequential decisions is whether to use NVLink SXM GPUs (H100 SXM5, H200 SXM5, B200 SXM5) or PCIe GPUs (H100 PCIe, H200 PCIe, L40S). The difference is not just a form factor choice — NVLink and PCIe provide fundamentally different GPU-to-GPU communication bandwidth, which determines whether multi-GPU training jobs are communication-bound or compute-bound. Getting this decision wrong in either direction wastes either money (over-buying NVLink for workloads that don't need it) or training time (using PCIe when NVLink bandwidth is the bottleneck).
What NVLink Is
NVLink is NVIDIA's proprietary high-bandwidth chip-to-chip interconnect. In SXM GPU servers, NVLink connects the GPU dies on a specialized baseboard called an NVSwitch fabric, allowing all 8 GPUs in a server to communicate with each other at full bandwidth simultaneously. In the H100 SXM5 generation, NVLink 4.0 provides 900 GB/s bidirectional bandwidth per GPU. In B200 SXM5, NVLink 5.0 provides 1,800 GB/s per GPU. This bandwidth is available between any pair of GPUs in the server without contention.
NVLink SXM GPUs are physically different from PCIe GPUs — they use a different substrate (the SXM socket on the baseboard) and cannot be installed in standard PCIe slots. NVLink servers (DGX H100, Supermicro SYS-821GE-TNHR, Dell XE9680) are purpose-built platforms where the NVSwitch fabric is integrated into the server baseboard.
What PCIe GPU Connectivity Provides
PCIe (Peripheral Component Interconnect Express) is the standard server expansion bus. PCIe Gen4 x16 provides ~64 GB/s bidirectional bandwidth per slot; PCIe Gen5 x16 provides ~128 GB/s. H100 PCIe and L40S are standard PCIe add-in cards that install in any compatible PCIe server slot. Multi-GPU PCIe servers connect GPUs through the server's PCIe switch fabric — typically providing 50–100 GB/s peer-to-peer GPU communication bandwidth, compared to NVLink's 900 GB/s.
Bandwidth Comparison
- H100 SXM5 NVLink 4.0: 900 GB/s bidirectional GPU-to-GPU bandwidth per GPU
- H200 SXM5 NVLink 4.0: 900 GB/s bidirectional (same NVLink generation as H100 SXM5)
- B200 SXM5 NVLink 5.0: 1,800 GB/s bidirectional GPU-to-GPU bandwidth per GPU
- H100 PCIe Gen5: ~128 GB/s per slot, ~50–80 GB/s effective peer-to-peer GPU bandwidth via PCIe switch
- L40S PCIe Gen4: ~64 GB/s per slot
NVLink provides 7–14× more GPU-to-GPU bandwidth than PCIe for equivalent generations. This difference only matters for workloads that require frequent large data transfers between GPUs.
When NVLink SXM Matters: Distributed Training
In distributed LLM training using tensor parallelism or pipeline parallelism across multiple GPUs, the training framework (PyTorch, Megatron-LM, DeepSpeed) continuously exchanges gradients and activations between GPUs during forward and backward passes. The communication operations — specifically all-reduce across all GPUs — require every GPU to send and receive data from every other GPU simultaneously.
For large model training (70B+ parameters), the all-reduce communication volume per training step can reach hundreds of gigabytes. With NVLink at 900 GB/s per GPU, 8 GPUs can exchange 7.2 TB/s of aggregate data — fast enough that communication typically completes in under 1 ms and does not idle the compute units. With PCIe at 64–128 GB/s per GPU, the same all-reduce takes 7–14× longer — meaning the GPU compute units sit idle waiting for communication to complete. This is communication bottleneck, and it directly increases total training time.
Specific workloads where NVLink SXM is required or strongly recommended:
- LLM pre-training with tensor parallelism across all 8 GPUs (Llama 70B, 405B, GPT-4 class models)
- Training with very large per-step batch sizes that require frequent large all-reduce operations
- Pipeline parallelism across GPUs where activation tensors are large
- Any training job where NCCL all-reduce communication occupies more than 10–15% of total step time on PCIe
When PCIe GPUs Are Sufficient: Inference and Smaller Training
For AI inference — serving a trained model to users — the GPU serves individual requests independently. Each request's forward pass happens entirely within one GPU (for models that fit in one GPU) or across a small number of GPUs for tensor-parallel inference. The inter-GPU communication volume per inference request is small relative to the compute per request. PCIe bandwidth is almost always sufficient for inference:
- Inference of models that fit in a single GPU (7B–34B at FP16 on a single L40S or H100 PCIe) requires no GPU-to-GPU communication at all
- Inference of 70B models across 2 GPUs (tensor parallel degree 2) requires moderate GPU-to-GPU communication; PCIe bandwidth is typically sufficient and throughput is dominated by compute time
- Inference serving with high request concurrency is typically memory-bandwidth-bound (time to load model weights from HBM) rather than inter-GPU communication-bound
Other workloads where PCIe GPUs are appropriate:
- Fine-tuning 7B–13B models with LoRA/QLoRA where the active parameter count is small and gradient communication volume is low
- Data-parallel training of smaller models where each GPU trains on an independent data shard and only averages gradients periodically (low communication frequency)
- Embedding generation and vector indexing pipelines where GPU processes independent batches without inter-GPU coordination
- Model evaluation and benchmarking where training throughput is not the goal
Cost Implications
NVLink SXM server platforms (DGX H100, Supermicro SYS-821GE-TNHR, Dell XE9680) are significantly more expensive than equivalent-GPU-count PCIe platforms because they require a custom NVSwitch baseboard, specialized power delivery, and more complex cooling infrastructure:
- An 8× H100 SXM5 server (e.g., Supermicro SYS-821GE-TNHR) costs substantially more than an 8× H100 PCIe server (e.g., Supermicro SYS-420GP-TNR) for the same GPU count
- A 4× L40S PCIe server (e.g., Supermicro SYS-221GE or SYS-111E with 2× L40S) costs a fraction of an H100 SXM server and is appropriate for inference serving
- For inference workloads, over-purchasing NVLink SXM capacity provides no performance benefit and wastes capital
Decision Framework
Use the following criteria to select between NVLink SXM and PCIe:
- Training models 70B+ parameters with tensor/pipeline parallelism across all GPUs → NVLink SXM required
- Training models 7B–34B with data parallelism across multiple servers → PCIe per server, InfiniBand between servers
- Inference of models that fit in a single GPU (up to 48 GB) → PCIe (L40S, H100 PCIe) sufficient
- Inference of 70B models (requires 35–80 GB VRAM depending on quantization) → PCIe (dual L40S 96 GB, H200 PCIe 141 GB) sufficient
- Inference of 70B models at maximum throughput for high concurrency → H100 SXM or H200 SXM for better memory bandwidth
- Local AI workstation for small team → RTX Ada (PCIe) or NVIDIA DGX Spark; NVLink SXM unnecessary
InfiniBand: GPU-to-GPU Across Servers
NVLink connects GPUs within a single server. When training spans multiple servers (multi-node distributed training), GPUs communicate between servers via InfiniBand (or RoCEv2 Ethernet). NVIDIA ConnectX-7 provides 400G NDR InfiniBand per server for H100/H200 clusters; ConnectX-8 provides 800G for B200 clusters. InfiniBand bandwidth across servers (400 Gbps = ~50 GB/s) is much lower than intra-server NVLink bandwidth (900 GB/s per GPU), which is why training frameworks minimize inter-server communication and maximize intra-server NVLink communication through careful model partitioning.
Related Resources
- H100 vs H200 vs B200 vs B300 Comparison
- NVIDIA Supplier — Full GPU Portfolio
- Supermicro GPU Server Platforms
- GPU Infrastructure
- AI Server Supplier
- AI Workstation for Small Teams
Frequently Asked Questions
What is the difference between H100 SXM and H100 PCIe?
H100 SXM5 uses the NVLink 4.0 interconnect providing 900 GB/s bidirectional GPU-to-GPU bandwidth; H100 PCIe uses standard PCIe Gen5 providing ~128 GB/s per slot. Both use the same GH100 die with identical FP8 compute throughput, but H100 SXM5 has higher memory bandwidth (3.35 TB/s vs 2 TB/s on PCIe) due to HBM configuration differences. H100 SXM5 also has a higher TDP (700W vs 350W for PCIe) and requires specialized SXM server platforms. For training large models with tensor parallelism, SXM5 is required. For inference, PCIe is sufficient and less expensive.
Can I use H100 PCIe for LLM training?
Yes, with important caveats. H100 PCIe is suitable for data-parallel training of small models (7B–13B) where each GPU processes independent data and communication is infrequent. For tensor-parallel or pipeline-parallel training of 70B+ models where all 8 GPUs must exchange large tensors every training step, PCIe bandwidth becomes a significant bottleneck and training time increases proportionally. The practical threshold: if NCCL communication occupies more than 15% of your step time on PCIe hardware, NVLink SXM will materially reduce total training duration.
Is NVIDIA L40S good for AI inference?
NVIDIA L40S (48 GB GDDR6, Ada Lovelace, PCIe) is an excellent inference GPU for models up to 70B at quantization. It fits in standard PCIe server slots, requires no liquid cooling, and costs significantly less than H100 per GPU. For serving 7B–34B models to teams of 10–200 concurrent users, dual L40S in a 1U server (96 GB combined) provides strong throughput. For maximum inference throughput of 70B models at scale, H100 SXM or H200 SXM provides better memory bandwidth.
