GPU Server Buying Guide 2026 — How to Choose the Right GPU Server for AI
Choosing the right GPU server is a significant capital decision with multi-year implications. The wrong choice — GPU too small for the model, insufficient memory for inference, no liquid cooling in a facility without DLC infrastructure — is expensive to correct. This guide provides a structured decision framework for selecting GPU server hardware based on the actual workload, facility, and budget.
Step 1: Define the Primary Workload
GPU server design priorities differ fundamentally between training and inference. Answer this before comparing hardware:
AI Training (including fine-tuning)
Training requires maximum compute throughput (FLOPS) and GPU-to-GPU communication bandwidth. Prioritize: SXM form factor GPUs (NVLink), multi-GPU servers (8 GPUs per node), and InfiniBand networking for multi-node clusters. The binding constraints are FLOPS and inter-GPU bandwidth, not memory capacity (for most fine-tuning use cases).
AI Inference (production model serving)
Inference requires GPU memory capacity to hold the model and KV cache, and throughput (tokens/second) at the target concurrency. The binding constraint is memory — a model that doesn't fit in GPU memory cannot run. Prioritize memory capacity per GPU and memory bandwidth. PCIe form factor GPUs are acceptable for inference; SXM is not required.
Mixed Training + Inference
If the same hardware will be used for both training and inference at different times (common in enterprise deployments), 8× H100 SXM5 or 8× H200 SXM5 is the most versatile configuration — NVLink and InfiniBand provide training performance; 80–141 GB per GPU provides adequate inference memory for 70B models.
Step 2: Determine Model Size
Model size in GPU memory (FP16 / BF16 precision): 7B model ≈ 14 GB; 13B ≈ 26 GB; 34B ≈ 68 GB; 70B ≈ 140 GB; 180B ≈ 360 GB; 405B ≈ 810 GB. Add 20–40% overhead for KV cache, activation storage, and optimizer states (for training). Quantized inference (INT4/GPTQ) reduces memory by approximately 4×: 70B INT4 ≈ 35 GB.
| Model Size | FP16 Memory | Minimum GPU (inference) | Minimum GPU (fine-tune) |
|---|---|---|---|
| 7B | 14 GB | 1× L40S (48 GB) | 2× H100 SXM5 |
| 34B | 68 GB | 1× H100 SXM5 (80 GB) | 4× H100 SXM5 |
| 70B | 140 GB | 1× H200 (141 GB) or 2× H100 | 8× H100 SXM5 |
| 405B | 810 GB | 6× H200 or 3× B300 | 32+ H100 SXM5 |
Step 3: Choose the GPU Generation
NVIDIA H100 SXM5 — Buy When:
- You need hardware quickly (4–8 week lead time vs 12–20 weeks for B200)
- Training 7B–70B models where H100 FLOPS are sufficient for your timeline
- Existing data center infrastructure supports air cooling only
- Budget is the primary constraint — H100 systems are lower cost than B200
- Your training code is already optimized for H100/Hopper architecture
NVIDIA H200 SXM5 — Buy When:
- Primary use case is inference of 70B+ models (H200's 141 GB eliminates the 2-GPU requirement)
- Training runs are memory-bandwidth-bound or long-context (128K+ token sequences)
- You want a drop-in upgrade over H100 without changing server infrastructure
NVIDIA B200 SXM5 — Buy When:
- Building a new training cluster where training speed is the priority
- Data center has or can install direct liquid cooling infrastructure
- The 2.3× FLOPS improvement over H100 meaningfully impacts your training timeline
- Budget allows the 40–60% premium over equivalent H100 configurations
- You want the most forward-looking hardware for a 4+ year asset lifecycle
NVIDIA B300 SXM — Buy When:
- Pre-training frontier models at 1T+ parameter scale
- Serving 405B+ models in production at the highest concurrency
- Memory per GPU (not FLOPS) is the limiting constraint
- DLC infrastructure is already in place
NVIDIA L40S PCIe — Buy When:
- Inference-only workload (no training requirement)
- Budget is primary concern and liquid cooling is not available
- Serving 7B–70B models (at INT4 for 70B) for small-to-medium production APIs
- Standard rackmount PCIe server is sufficient (no SXM baseboard required)
Step 4: Choose the Server Platform
For SXM GPU servers (H100/H200/B200/B300), the server platform (the chassis, baseboard, CPU, memory, and NVLink interconnect) is as important as the GPU. Validated platforms available through Haink:
- NVIDIA DGX H100 / H200 / B200: Factory-integrated, NVIDIA-validated, includes 2 TB DDR5 system RAM, 30 TB NVMe storage, 8× ConnectX-7/8 InfiniBand NICs. The reference platform — premium cost, zero integration risk.
- Supermicro SYS-821GE-TNHR (H100/H200 SXM5): Most commonly deployed 8-GPU SXM server platform in the market. Validated with all major InfiniBand configurations. Lower cost than DGX with equivalent GPU performance.
- Supermicro ARS-821GL-NHR (B200/B300 SXM): Primary B200/B300 platform from Supermicro. Supports DLC cold plates.
- Dell PowerEdge XE9680 (H100/H200 SXM5): Enterprise-grade platform with Dell ProSupport coverage. Preferred by enterprises requiring single-vendor hardware support.
- HPE Cray XD670 (H100/H200 SXM5): HPE's enterprise AI server with HPE Pointnext support. Suitable for organizations already in the HPE ecosystem.
- Lenovo ThinkSystem SR680a V3 (H100 SXM5): Strong option for deployments in Mainland China where Lenovo's local support and supply chain are advantageous.
Step 5: Size the InfiniBand Network
For a single 8-GPU node: no InfiniBand needed between servers. For 2–4 nodes: 1 InfiniBand leaf switch (NVIDIA QM9700, 64-port NDR 400G) handles the full cluster with room to grow. For 8–16 nodes: 2 leaf switches + 1 spine switch in a non-blocking fat-tree. For 32+ nodes: full 3-tier fat-tree design with spine/core layers. If RoCEv2 is chosen instead of InfiniBand: Arista 7800R3 or NVIDIA Spectrum-4 400GbE switches configured for lossless PFC.
Step 6: Assess Facility Requirements
Before finalizing the hardware order, confirm: power capacity per rack (H100 server = 10.2 kW; B200 server = 14–16 kW; 4-server rack = 40–65 kW minimum); cooling type available (air for H100/H200, DLC mandatory for B200/B300 at full load); physical space (GPU servers are deep — 900mm depth typical; verify rack depth); and network infrastructure (overhead cable trays for InfiniBand copper/fiber).
Step 7: Factor Lead Times into Planning
Current GPU server lead times (from Haink, mid-2026): H100 SXM5 systems — 4–10 weeks. H200 SXM5 systems — 6–12 weeks. B200 SXM5 systems — 12–20 weeks. B300 SXM systems — 14–24 weeks. L40S PCIe servers — 3–8 weeks. InfiniBand switches — 4–8 weeks. Order hardware with enough lead time that it arrives before your target deployment date, not on it.
GPU Server Checklist
Summarized buying checklist before finalizing a GPU server order:
- Workload confirmed: training / inference / mixed?
- Model size and memory requirement calculated?
- GPU generation selected (H100 / H200 / B200 / B300 / L40S)?
- Server platform selected (DGX / Supermicro / Dell / HPE / Lenovo)?
- InfiniBand or RoCEv2 specified for multi-node clusters?
- Storage sized for dataset and checkpoint requirements?
- Facility power density sufficient for selected hardware?
- Cooling type compatible (air OK for H100; DLC required for B200/B300)?
- Lead time factored into deployment planning?
- Support contract (DGX Care, HPE Pointnext, Dell ProSupport) included?
