Haink
Hardware

GPU Server Buying Guide 2026 — How to Choose the Right GPU Server for AI

Choosing the right GPU server is a significant capital decision with multi-year implications. The wrong choice — GPU too small for the model, insufficient memory for inference, no liquid cooling in a facility without DLC infrastructure — is expensive to correct. This guide provides a structured decision framework for selecting GPU server hardware based on the actual workload, facility, and budget.

Step 1: Define the Primary Workload

GPU server design priorities differ fundamentally between training and inference. Answer this before comparing hardware:

AI Training (including fine-tuning)

Training requires maximum compute throughput (FLOPS) and GPU-to-GPU communication bandwidth. Prioritize: SXM form factor GPUs (NVLink), multi-GPU servers (8 GPUs per node), and InfiniBand networking for multi-node clusters. The binding constraints are FLOPS and inter-GPU bandwidth, not memory capacity (for most fine-tuning use cases).

AI Inference (production model serving)

Inference requires GPU memory capacity to hold the model and KV cache, and throughput (tokens/second) at the target concurrency. The binding constraint is memory — a model that doesn't fit in GPU memory cannot run. Prioritize memory capacity per GPU and memory bandwidth. PCIe form factor GPUs are acceptable for inference; SXM is not required.

Mixed Training + Inference

If the same hardware will be used for both training and inference at different times (common in enterprise deployments), 8× H100 SXM5 or 8× H200 SXM5 is the most versatile configuration — NVLink and InfiniBand provide training performance; 80–141 GB per GPU provides adequate inference memory for 70B models.

Step 2: Determine Model Size

Model size in GPU memory (FP16 / BF16 precision): 7B model ≈ 14 GB; 13B ≈ 26 GB; 34B ≈ 68 GB; 70B ≈ 140 GB; 180B ≈ 360 GB; 405B ≈ 810 GB. Add 20–40% overhead for KV cache, activation storage, and optimizer states (for training). Quantized inference (INT4/GPTQ) reduces memory by approximately 4×: 70B INT4 ≈ 35 GB.

Model Size FP16 Memory Minimum GPU (inference) Minimum GPU (fine-tune)
7B 14 GB 1× L40S (48 GB) 2× H100 SXM5
34B 68 GB 1× H100 SXM5 (80 GB) 4× H100 SXM5
70B 140 GB 1× H200 (141 GB) or 2× H100 8× H100 SXM5
405B 810 GB 6× H200 or 3× B300 32+ H100 SXM5

Step 3: Choose the GPU Generation

NVIDIA H100 SXM5 — Buy When:

  • You need hardware quickly (4–8 week lead time vs 12–20 weeks for B200)
  • Training 7B–70B models where H100 FLOPS are sufficient for your timeline
  • Existing data center infrastructure supports air cooling only
  • Budget is the primary constraint — H100 systems are lower cost than B200
  • Your training code is already optimized for H100/Hopper architecture

NVIDIA H200 SXM5 — Buy When:

  • Primary use case is inference of 70B+ models (H200's 141 GB eliminates the 2-GPU requirement)
  • Training runs are memory-bandwidth-bound or long-context (128K+ token sequences)
  • You want a drop-in upgrade over H100 without changing server infrastructure

NVIDIA B200 SXM5 — Buy When:

  • Building a new training cluster where training speed is the priority
  • Data center has or can install direct liquid cooling infrastructure
  • The 2.3× FLOPS improvement over H100 meaningfully impacts your training timeline
  • Budget allows the 40–60% premium over equivalent H100 configurations
  • You want the most forward-looking hardware for a 4+ year asset lifecycle

NVIDIA B300 SXM — Buy When:

  • Pre-training frontier models at 1T+ parameter scale
  • Serving 405B+ models in production at the highest concurrency
  • Memory per GPU (not FLOPS) is the limiting constraint
  • DLC infrastructure is already in place

NVIDIA L40S PCIe — Buy When:

  • Inference-only workload (no training requirement)
  • Budget is primary concern and liquid cooling is not available
  • Serving 7B–70B models (at INT4 for 70B) for small-to-medium production APIs
  • Standard rackmount PCIe server is sufficient (no SXM baseboard required)

Step 4: Choose the Server Platform

For SXM GPU servers (H100/H200/B200/B300), the server platform (the chassis, baseboard, CPU, memory, and NVLink interconnect) is as important as the GPU. Validated platforms available through Haink:

  • NVIDIA DGX H100 / H200 / B200: Factory-integrated, NVIDIA-validated, includes 2 TB DDR5 system RAM, 30 TB NVMe storage, 8× ConnectX-7/8 InfiniBand NICs. The reference platform — premium cost, zero integration risk.
  • Supermicro SYS-821GE-TNHR (H100/H200 SXM5): Most commonly deployed 8-GPU SXM server platform in the market. Validated with all major InfiniBand configurations. Lower cost than DGX with equivalent GPU performance.
  • Supermicro ARS-821GL-NHR (B200/B300 SXM): Primary B200/B300 platform from Supermicro. Supports DLC cold plates.
  • Dell PowerEdge XE9680 (H100/H200 SXM5): Enterprise-grade platform with Dell ProSupport coverage. Preferred by enterprises requiring single-vendor hardware support.
  • HPE Cray XD670 (H100/H200 SXM5): HPE's enterprise AI server with HPE Pointnext support. Suitable for organizations already in the HPE ecosystem.
  • Lenovo ThinkSystem SR680a V3 (H100 SXM5): Strong option for deployments in Mainland China where Lenovo's local support and supply chain are advantageous.

Step 5: Size the InfiniBand Network

For a single 8-GPU node: no InfiniBand needed between servers. For 2–4 nodes: 1 InfiniBand leaf switch (NVIDIA QM9700, 64-port NDR 400G) handles the full cluster with room to grow. For 8–16 nodes: 2 leaf switches + 1 spine switch in a non-blocking fat-tree. For 32+ nodes: full 3-tier fat-tree design with spine/core layers. If RoCEv2 is chosen instead of InfiniBand: Arista 7800R3 or NVIDIA Spectrum-4 400GbE switches configured for lossless PFC.

Step 6: Assess Facility Requirements

Before finalizing the hardware order, confirm: power capacity per rack (H100 server = 10.2 kW; B200 server = 14–16 kW; 4-server rack = 40–65 kW minimum); cooling type available (air for H100/H200, DLC mandatory for B200/B300 at full load); physical space (GPU servers are deep — 900mm depth typical; verify rack depth); and network infrastructure (overhead cable trays for InfiniBand copper/fiber).

Step 7: Factor Lead Times into Planning

Current GPU server lead times (from Haink, mid-2026): H100 SXM5 systems — 4–10 weeks. H200 SXM5 systems — 6–12 weeks. B200 SXM5 systems — 12–20 weeks. B300 SXM systems — 14–24 weeks. L40S PCIe servers — 3–8 weeks. InfiniBand switches — 4–8 weeks. Order hardware with enough lead time that it arrives before your target deployment date, not on it.

GPU Server Checklist

Summarized buying checklist before finalizing a GPU server order:

  • Workload confirmed: training / inference / mixed?
  • Model size and memory requirement calculated?
  • GPU generation selected (H100 / H200 / B200 / B300 / L40S)?
  • Server platform selected (DGX / Supermicro / Dell / HPE / Lenovo)?
  • InfiniBand or RoCEv2 specified for multi-node clusters?
  • Storage sized for dataset and checkpoint requirements?
  • Facility power density sufficient for selected hardware?
  • Cooling type compatible (air OK for H100; DLC required for B200/B300)?
  • Lead time factored into deployment planning?
  • Support contract (DGX Care, HPE Pointnext, Dell ProSupport) included?

Related Resources