GPU Server Buying Guide 2026 — How to Choose the Right GPU Server for AI

Choosing the right GPU server is a significant capital decision with multi-year implications. The wrong choice — GPU too small for the model, insufficient memory for inference, no liquid cooling in a facility without DLC infrastructure — is expensive to correct. This guide provides a structured decision framework for selecting GPU server hardware based on the actual workload, facility, and budget.

Step 1: Define the Primary Workload

GPU server design priorities differ fundamentally between training and inference. Answer this before comparing hardware:

AI Training (including fine-tuning)

Training requires maximum compute throughput (FLOPS) and GPU-to-GPU communication bandwidth. Prioritize: SXM form factor GPUs (NVLink), multi-GPU servers (8 GPUs per node), and InfiniBand networking for multi-node clusters. The binding constraints are FLOPS and inter-GPU bandwidth, not memory capacity (for most fine-tuning use cases).

AI Inference (production model serving)

Inference requires GPU memory capacity to hold the model and KV cache, and throughput (tokens/second) at the target concurrency. The binding constraint is memory — a model that doesn't fit in GPU memory cannot run. Prioritize memory capacity per GPU and memory bandwidth. PCIe form factor GPUs are acceptable for inference; SXM is not required.

Mixed Training + Inference

If the same hardware will be used for both training and inference at different times (common in enterprise deployments), 8× H100 SXM5 or 8× H200 SXM5 is the most versatile configuration — NVLink and InfiniBand provide training performance; 80–141 GB per GPU provides adequate inference memory for 70B models.

Step 2: Determine Model Size

Model size in GPU memory (FP16 / BF16 precision): 7B model ≈ 14 GB; 13B ≈ 26 GB; 34B ≈ 68 GB; 70B ≈ 140 GB; 180B ≈ 360 GB; 405B ≈ 810 GB. Add 20–40% overhead for KV cache, activation storage, and optimizer states (for training). Quantized inference (INT4/GPTQ) reduces memory by approximately 4×: 70B INT4 ≈ 35 GB.

Model Size	FP16 Memory	Minimum GPU (inference)	Minimum GPU (fine-tune)
7B	14 GB	1× L40S (48 GB)	2× H100 SXM5
34B	68 GB	1× H100 SXM5 (80 GB)	4× H100 SXM5
70B	140 GB	1× H200 (141 GB) or 2× H100	8× H100 SXM5
405B	810 GB	6× H200 or 3× B300	32+ H100 SXM5

Step 3: Choose the GPU Generation

NVIDIA H100 SXM5 — Buy When:

You need hardware quickly (4–8 week lead time vs 12–20 weeks for B200)
Training 7B–70B models where H100 FLOPS are sufficient for your timeline
Existing data center infrastructure supports air cooling only
Budget is the primary constraint — H100 systems are lower cost than B200
Your training code is already optimized for H100/Hopper architecture

NVIDIA H200 SXM5 — Buy When:

Primary use case is inference of 70B+ models (H200's 141 GB eliminates the 2-GPU requirement)
Training runs are memory-bandwidth-bound or long-context (128K+ token sequences)
You want a drop-in upgrade over H100 without changing server infrastructure

NVIDIA B200 SXM5 — Buy When:

Building a new training cluster where training speed is the priority
Data center has or can install direct liquid cooling infrastructure
The 2.3× FLOPS improvement over H100 meaningfully impacts your training timeline
Budget allows the 40–60% premium over equivalent H100 configurations
You want the most forward-looking hardware for a 4+ year asset lifecycle

NVIDIA B300 SXM — Buy When:

Pre-training frontier models at 1T+ parameter scale
Serving 405B+ models in production at the highest concurrency
Memory per GPU (not FLOPS) is the limiting constraint
DLC infrastructure is already in place

NVIDIA L40S PCIe — Buy When:

Inference-only workload (no training requirement)
Budget is primary concern and liquid cooling is not available
Serving 7B–70B models (at INT4 for 70B) for small-to-medium production APIs
Standard rackmount PCIe server is sufficient (no SXM baseboard required)

Step 4: Choose the Server Platform

For SXM GPU servers (H100/H200/B200/B300), the server platform (the chassis, baseboard, CPU, memory, and NVLink interconnect) is as important as the GPU. Validated platforms available through Haink:

NVIDIA DGX H100 / H200 / B200: Factory-integrated, NVIDIA-validated, includes 2 TB DDR5 system RAM, 30 TB NVMe storage, 8× ConnectX-7/8 InfiniBand NICs. The reference platform — premium cost, zero integration risk.
Supermicro SYS-821GE-TNHR (H100/H200 SXM5): Most commonly deployed 8-GPU SXM server platform in the market. Validated with all major InfiniBand configurations. Lower cost than DGX with equivalent GPU performance.
Supermicro ARS-821GL-NHR (B200/B300 SXM): Primary B200/B300 platform from Supermicro. Supports DLC cold plates.
Dell PowerEdge XE9680 (H100/H200 SXM5): Enterprise-grade platform with Dell ProSupport coverage. Preferred by enterprises requiring single-vendor hardware support.
HPE Cray XD670 (H100/H200 SXM5): HPE's enterprise AI server with HPE Pointnext support. Suitable for organizations already in the HPE ecosystem.
Lenovo ThinkSystem SR680a V3 (H100 SXM5): Strong option for deployments in Mainland China where Lenovo's local support and supply chain are advantageous.

Step 5: Size the InfiniBand Network

For a single 8-GPU node: no InfiniBand needed between servers. For 2–4 nodes: 1 InfiniBand leaf switch (NVIDIA QM9700, 64-port NDR 400G) handles the full cluster with room to grow. For 8–16 nodes: 2 leaf switches + 1 spine switch in a non-blocking fat-tree. For 32+ nodes: full 3-tier fat-tree design with spine/core layers. If RoCEv2 is chosen instead of InfiniBand: Arista 7800R3 or NVIDIA Spectrum-4 400GbE switches configured for lossless PFC.

Step 6: Assess Facility Requirements

Before finalizing the hardware order, confirm: power capacity per rack (H100 server = 10.2 kW; B200 server = 14–16 kW; 4-server rack = 40–65 kW minimum); cooling type available (air for H100/H200, DLC mandatory for B200/B300 at full load); physical space (GPU servers are deep — 900mm depth typical; verify rack depth); and network infrastructure (overhead cable trays for InfiniBand copper/fiber).

Step 7: Factor Lead Times into Planning

Current GPU server lead times (from Haink, mid-2026): H100 SXM5 systems — 4–10 weeks. H200 SXM5 systems — 6–12 weeks. B200 SXM5 systems — 12–20 weeks. B300 SXM systems — 14–24 weeks. L40S PCIe servers — 3–8 weeks. InfiniBand switches — 4–8 weeks. Order hardware with enough lead time that it arrives before your target deployment date, not on it.

GPU Server Checklist

Summarized buying checklist before finalizing a GPU server order:

Workload confirmed: training / inference / mixed?
Model size and memory requirement calculated?
GPU generation selected (H100 / H200 / B200 / B300 / L40S)?
Server platform selected (DGX / Supermicro / Dell / HPE / Lenovo)?
InfiniBand or RoCEv2 specified for multi-node clusters?
Storage sized for dataset and checkpoint requirements?
Facility power density sufficient for selected hardware?
Cooling type compatible (air OK for H100; DLC required for B200/B300)?
Lead time factored into deployment planning?
Support contract (DGX Care, HPE Pointnext, Dell ProSupport) included?