GPU Cluster Deployment — Planning and Building an Enterprise AI Cluster
Deploying a GPU cluster for AI training or inference involves more than purchasing GPU servers. A production-grade cluster requires hardware selection aligned to the workload, a network architecture that does not bottleneck distributed training, storage sized to feed GPU memory, physical infrastructure that can handle multi-kilowatt rack densities, and a software stack that orchestrates the hardware efficiently. Haink supplies GPU cluster hardware across Hong Kong, Dubai, and Mainland China and coordinates end-to-end procurement for enterprises building AI clusters from the ground up.
Step 1: Define the Workload
GPU cluster design starts with the workload, not the hardware. Key questions before specifying hardware:
- Training or inference? Training requires SXM GPUs with NVLink and InfiniBand. Inference can use PCIe GPUs in standard servers.
- Model size? A 7B model fine-tuning workload fits on a single 8-GPU node. Pre-training a 70B model requires 8–32 nodes. A 405B model requires 64+ nodes.
- Training duration constraint? A tighter deadline means more GPUs running in parallel — not just more powerful GPUs.
- Concurrency for inference? Inference serving scales horizontally — more users means more server replicas, not necessarily bigger GPUs.
- Future growth? A cluster that will scale from 8 to 64 nodes needs an InfiniBand fabric designed for that scale from day one, even if only 8 nodes are installed initially.
Step 2: Select the GPU Platform
GPU selection drives every downstream decision. For AI training clusters built in 2025–2026, the choice is between:
- NVIDIA H100 SXM5 (80 GB): Proven, broad software ecosystem, mature supply chain, shorter lead times. Best for organizations that need cluster operational quickly or have existing H100-optimized training code.
- NVIDIA H200 SXM5 (141 GB): Same compute as H100, 76% more memory. Best for memory-bound workloads or inference of 70B+ models.
- NVIDIA B200 SXM5 (192 GB): 2.3× more FP8 TFLOPS than H100, NVLink 5.0. Best for new cluster builds prioritizing training throughput. Requires DLC.
- NVIDIA B300 SXM (288 GB): Maximum memory per GPU, highest throughput. Best for frontier model training at 1T+ scale or highest-concurrency inference. Requires DLC.
Standard AI training node: 8 GPUs per server, 2 CPU sockets, 1–2 TB system RAM, 8 InfiniBand or 400GbE NICs, local NVMe storage for dataset staging. Haink-supplied servers: Supermicro SYS-821GE (H100/H200), Supermicro ARS-821GL (B200/B300), Dell PowerEdge XE9680 (H100/H200), HPE Cray XD670 (H100/H200).
Step 3: Design the Network Architecture
The network is the most consequential architectural decision for a multi-node training cluster. An undersized network degrades GPU utilization — GPUs idle waiting for gradient communication. Designing for non-blocking, full-bisection bandwidth across all nodes at full scale is the goal.
InfiniBand Fat-Tree Topology
A two-tier fat-tree InfiniBand fabric (leaf switches + spine switches) provides full bisection bandwidth: every GPU can communicate with every other GPU at the full link rate simultaneously. Each node connects to one leaf switch; leaf switches uplink to spine switches. For a 16-node cluster (128 GPUs) using NDR 400G InfiniBand: 2 leaf switches (NVIDIA QM9700, 64-port NDR) and 1 spine switch provide a non-blocking fabric. For 64-node (512 GPU) clusters, 3-tier fat-tree topologies with core switches are required. NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) InfiniBand switches handle in-network all-reduce operations, removing compute load from GPUs during gradient synchronization.
RoCEv2 400GbE Alternative
For clusters of 4–16 nodes where InfiniBand budget is a constraint, 400GbE RoCEv2 (RDMA over Converged Ethernet) provides similar RDMA semantics at lower switch cost. RoCEv2 requires lossless Ethernet configuration (PFC — Priority Flow Control) and careful QoS design to maintain training efficiency. Arista 7800R3, Cisco Nexus 9300/9500, and NVIDIA Spectrum-4 platforms support lossless 400GbE for RoCE AI clusters.
Step 4: Size the Storage
Storage must be specified to exceed GPU I/O demands. A 16-node H100 cluster (128 GPUs) can consume training data at 20–100 GB/s sustained — standard NAS cannot sustain this. Options:
- Parallel file system (WEKA, GPFS): Multiple all-flash NVMe storage nodes aggregated into a parallel namespace. Scales linearly with nodes added. Best for large clusters (16+ nodes).
- Shared NVMe-oF (NVMe over Fabrics): Networked NVMe storage arrays (NetApp AFF, Pure Storage FlashArray) accessed over RDMA Ethernet or FC. Simpler to manage than parallel file systems. Suitable for 4–16 node clusters.
- Local NVMe staging: Pre-stage dataset batches on each server's local NVMe (typically 30–50 TB per node in GPU servers) before each training step. Works for small datasets; impractical for trillion-token pre-training datasets.
Model checkpoint storage needs: a 70B parameter model checkpoint in BF16 is approximately 140 GB. Training runs checkpoint every few hundred steps — checkpoint storage of 10–50 TB is typical for large training jobs.
Step 5: Assess Facility Requirements
Before hardware arrives, the data center facility must be qualified:
- Power per rack: H100 SXM5 servers draw 10.2 kW each. A 4-node rack (4× 8-GPU servers) requires 40–50 kW including networking and storage. B200/B300 clusters with DLC: 50–80 kW per rack. Standard data center racks support 10–20 kW — high-density GPU clusters require pre-contracted high-density power zones (30–50 kW/rack).
- Cooling: H100 can run in air-cooled racks with high-CFM rear-door heat exchangers. B200 and B300 require direct liquid cooling (DLC) cold plate loops or immersion cooling. Facility must provision chilled water or coolant supply to each rack before server delivery.
- Physical space: An 8-node InfiniBand cluster occupies 8 GPU server racks + 1–2 switch racks + 1–2 storage racks = 10–12 racks total. Overhead cable trays for InfiniBand copper or fiber cabling.
- Out-of-band management network: Every GPU server requires IPMI/BMC access on a separate management network for firmware updates, power cycling, and health monitoring without going through the production network.
Step 6: Software Stack Deployment
GPU cluster hardware without the right software stack does not run efficiently. The essential software layers:
- OS and drivers: Ubuntu 22.04 LTS is the standard for AI training clusters. NVIDIA GPU drivers (current stable release), CUDA toolkit, cuDNN, NCCL, and InfiniBand OFED (OpenFabrics) stack must be installed and tested on all nodes.
- Container runtime: NVIDIA Container Toolkit enables Docker/Podman containers to access GPUs. All training workloads are containerized for reproducibility and dependency isolation.
- Cluster orchestration: Kubernetes (with NVIDIA GPU Operator) or Slurm handles job scheduling, GPU allocation, and multi-tenant cluster sharing. NVIDIA Base Command Manager is a turnkey cluster management solution for GPU clusters.
- Training frameworks: PyTorch with FSDP (Fully Sharded Data Parallel) or Megatron-LM for distributed training. DeepSpeed for memory-optimized training. Hugging Face Transformers + Accelerate for fine-tuning workflows.
- Monitoring: DCGM (Data Center GPU Manager) provides per-GPU utilization, temperature, ECC error, and NVLink bandwidth metrics. Prometheus + Grafana dashboards for cluster-wide monitoring.
Step 7: Burn-In and Validation
Before production use, a new GPU cluster must be validated: all-GPU stress test (running NCCL allreduce at full bandwidth to verify InfiniBand connectivity and detect any cabling or switch configuration errors), storage throughput benchmarks, power-on/off cycle testing, and a training dry run with a known-good checkpoint to validate that training loss decreases correctly on the new cluster. ECC errors detected during burn-in indicate GPU or memory defects that should be addressed under warranty before the cluster enters production use.
Haink GPU Cluster Supply and Deployment Support
Haink sources all cluster hardware components — GPU servers, InfiniBand switches and cables, storage systems, and management networking — for AI cluster deployments in Hong Kong, Dubai, and Mainland China. For enterprise clients who need coordination beyond hardware supply, Haink works with local installation partners in each market for rack-and-stack, InfiniBand cabling, and initial OS/driver deployment.
Related Resources
- Private AI Infrastructure
- AI Training Infrastructure
- AI Inference Infrastructure
- NVLink vs PCIe GPU
- Liquid Cooling for AI Servers
- AI Cluster Architecture
- H100 vs H200 vs B200 vs B300
- NVIDIA GPU Portfolio
- Supermicro GPU Servers
Frequently Asked Questions
How long does it take to build and deploy a GPU cluster?
A 4–8 node cluster: 8–16 weeks from purchase order to operational, including hardware lead time (4–10 weeks), shipping, facility preparation, rack-and-stack, and software configuration. A 32+ node cluster: 16–32 weeks, including phased delivery, full InfiniBand fabric installation, parallel storage deployment, and cluster-wide burn-in. B200 and B300 systems have extended lead times due to demand — factor 12–20 weeks hardware lead time for Blackwell-generation clusters.
What is the total cost of a 32-GPU (4-node H100) cluster?
Approximate all-in costs for a 4-node × 8× H100 SXM5 cluster: GPU servers (4× DGX H100 equivalent): USD 1.4M–1.8M; InfiniBand HDR/NDR switches and cables: USD 80,000–150,000; shared storage (100 TB all-flash): USD 200,000–400,000; management networking (Cisco/Aruba): USD 20,000–40,000. Total hardware: approximately USD 1.7M–2.4M before facility costs. B200-based equivalent: approximately USD 2.5M–3.5M.
Can I start with a small cluster and expand it later?
Yes, but the network must be designed for the final size from the start. An InfiniBand leaf switch (64 ports) that is 25% utilized today can accept additional nodes up to its port count without replacement. Adding a new GPU server to an existing cluster requires one available InfiniBand port per server and the same switch firmware version. Plan the network for 2–3× current node count from day one to avoid costly switch replacement when scaling.
