Haink SolutionsSoftware & AIKnowledgeAbout Contact sales

NVLink vs PCIe for AI GPU Servers — Bandwidth, Use Cases, and Platform Selection

When selecting an NVIDIA data center GPU server, one of the most consequential decisions is whether to use NVLink SXM GPUs (H100 SXM5, H200 SXM5, B200 SXM5) or PCIe GPUs (H100 PCIe, H200 PCIe, L40S). The difference is not just a form factor choice — NVLink and PCIe provide fundamentally different GPU-to-GPU communication bandwidth, which determines whether multi-GPU training jobs are communication-bound or compute-bound. Getting this decision wrong in either direction wastes either money (over-buying NVLink for workloads that don't need it) or training time (using PCIe when NVLink bandwidth is the bottleneck).

What NVLink Is

NVLink is NVIDIA's proprietary high-bandwidth chip-to-chip interconnect. In SXM GPU servers, NVLink connects the GPU dies on a specialized baseboard called an NVSwitch fabric, allowing all 8 GPUs in a server to communicate with each other at full bandwidth simultaneously. In the H100 SXM5 generation, NVLink 4.0 provides 900 GB/s bidirectional bandwidth per GPU. In B200 SXM5, NVLink 5.0 provides 1,800 GB/s per GPU. This bandwidth is available between any pair of GPUs in the server without contention.

NVLink SXM GPUs are physically different from PCIe GPUs — they use a different substrate (the SXM socket on the baseboard) and cannot be installed in standard PCIe slots. NVLink servers (DGX H100, Supermicro SYS-821GE-TNHR, Dell XE9680) are purpose-built platforms where the NVSwitch fabric is integrated into the server baseboard.

What PCIe GPU Connectivity Provides

PCIe (Peripheral Component Interconnect Express) is the standard server expansion bus. PCIe Gen4 x16 provides ~64 GB/s bidirectional bandwidth per slot; PCIe Gen5 x16 provides ~128 GB/s. H100 PCIe and L40S are standard PCIe add-in cards that install in any compatible PCIe server slot. Multi-GPU PCIe servers connect GPUs through the server's PCIe switch fabric — typically providing 50–100 GB/s peer-to-peer GPU communication bandwidth, compared to NVLink's 900 GB/s.

Bandwidth Comparison

NVLink provides 7–14× more GPU-to-GPU bandwidth than PCIe for equivalent generations. This difference only matters for workloads that require frequent large data transfers between GPUs.

When NVLink SXM Matters: Distributed Training

In distributed LLM training using tensor parallelism or pipeline parallelism across multiple GPUs, the training framework (PyTorch, Megatron-LM, DeepSpeed) continuously exchanges gradients and activations between GPUs during forward and backward passes. The communication operations — specifically all-reduce across all GPUs — require every GPU to send and receive data from every other GPU simultaneously.

For large model training (70B+ parameters), the all-reduce communication volume per training step can reach hundreds of gigabytes. With NVLink at 900 GB/s per GPU, 8 GPUs can exchange 7.2 TB/s of aggregate data — fast enough that communication typically completes in under 1 ms and does not idle the compute units. With PCIe at 64–128 GB/s per GPU, the same all-reduce takes 7–14× longer — meaning the GPU compute units sit idle waiting for communication to complete. This is communication bottleneck, and it directly increases total training time.

Specific workloads where NVLink SXM is required or strongly recommended:

When PCIe GPUs Are Sufficient: Inference and Smaller Training

For AI inference — serving a trained model to users — the GPU serves individual requests independently. Each request's forward pass happens entirely within one GPU (for models that fit in one GPU) or across a small number of GPUs for tensor-parallel inference. The inter-GPU communication volume per inference request is small relative to the compute per request. PCIe bandwidth is almost always sufficient for inference:

Other workloads where PCIe GPUs are appropriate:

Cost Implications

NVLink SXM server platforms (DGX H100, Supermicro SYS-821GE-TNHR, Dell XE9680) are significantly more expensive than equivalent-GPU-count PCIe platforms because they require a custom NVSwitch baseboard, specialized power delivery, and more complex cooling infrastructure:

Decision Framework

Use the following criteria to select between NVLink SXM and PCIe:

InfiniBand: GPU-to-GPU Across Servers

NVLink connects GPUs within a single server. When training spans multiple servers (multi-node distributed training), GPUs communicate between servers via InfiniBand (or RoCEv2 Ethernet). NVIDIA ConnectX-7 provides 400G NDR InfiniBand per server for H100/H200 clusters; ConnectX-8 provides 800G for B200 clusters. InfiniBand bandwidth across servers (400 Gbps = ~50 GB/s) is much lower than intra-server NVLink bandwidth (900 GB/s per GPU), which is why training frameworks minimize inter-server communication and maximize intra-server NVLink communication through careful model partitioning.

Related Resources

Frequently Asked Questions

What is the difference between H100 SXM and H100 PCIe?

H100 SXM5 uses the NVLink 4.0 interconnect providing 900 GB/s bidirectional GPU-to-GPU bandwidth; H100 PCIe uses standard PCIe Gen5 providing ~128 GB/s per slot. Both use the same GH100 die with identical FP8 compute throughput, but H100 SXM5 has higher memory bandwidth (3.35 TB/s vs 2 TB/s on PCIe) due to HBM configuration differences. H100 SXM5 also has a higher TDP (700W vs 350W for PCIe) and requires specialized SXM server platforms. For training large models with tensor parallelism, SXM5 is required. For inference, PCIe is sufficient and less expensive.

Can I use H100 PCIe for LLM training?

Yes, with important caveats. H100 PCIe is suitable for data-parallel training of small models (7B–13B) where each GPU processes independent data and communication is infrequent. For tensor-parallel or pipeline-parallel training of 70B+ models where all 8 GPUs must exchange large tensors every training step, PCIe bandwidth becomes a significant bottleneck and training time increases proportionally. The practical threshold: if NCCL communication occupies more than 15% of your step time on PCIe hardware, NVLink SXM will materially reduce total training duration.

Is NVIDIA L40S good for AI inference?

NVIDIA L40S (48 GB GDDR6, Ada Lovelace, PCIe) is an excellent inference GPU for models up to 70B at quantization. It fits in standard PCIe server slots, requires no liquid cooling, and costs significantly less than H100 per GPU. For serving 7B–34B models to teams of 10–200 concurrent users, dual L40S in a 1U server (96 GB combined) provides strong throughput. For maximum inference throughput of 70B models at scale, H100 SXM or H200 SXM provides better memory bandwidth.

© 2026 Haink. All rights reserved.Hong Kong · Dubai · Beijing · Delaware (USA)