Haink SolutionsKnowledgeCase StudiesAbout Contact sales

GPU Cluster Deployment — Planning and Building an Enterprise AI Cluster

Deploying a GPU cluster for AI training or inference involves more than purchasing GPU servers. A production-grade cluster requires hardware selection aligned to the workload, a network architecture that does not bottleneck distributed training, storage sized to feed GPU memory, physical infrastructure that can handle multi-kilowatt rack densities, and a software stack that orchestrates the hardware efficiently. Haink supplies GPU cluster hardware across Hong Kong, Dubai, and Mainland China and coordinates end-to-end procurement for enterprises building AI clusters from the ground up.

Step 1: Define the Workload

GPU cluster design starts with the workload, not the hardware. Key questions before specifying hardware:

Step 2: Select the GPU Platform

GPU selection drives every downstream decision. For AI training clusters built in 2025–2026, the choice is between:

Standard AI training node: 8 GPUs per server, 2 CPU sockets, 1–2 TB system RAM, 8 InfiniBand or 400GbE NICs, local NVMe storage for dataset staging. Haink-supplied servers: Supermicro SYS-821GE (H100/H200), Supermicro ARS-821GL (B200/B300), Dell PowerEdge XE9680 (H100/H200), HPE Cray XD670 (H100/H200).

Step 3: Design the Network Architecture

The network is the most consequential architectural decision for a multi-node training cluster. An undersized network degrades GPU utilization — GPUs idle waiting for gradient communication. Designing for non-blocking, full-bisection bandwidth across all nodes at full scale is the goal.

InfiniBand Fat-Tree Topology

A two-tier fat-tree InfiniBand fabric (leaf switches + spine switches) provides full bisection bandwidth: every GPU can communicate with every other GPU at the full link rate simultaneously. Each node connects to one leaf switch; leaf switches uplink to spine switches. For a 16-node cluster (128 GPUs) using NDR 400G InfiniBand: 2 leaf switches (NVIDIA QM9700, 64-port NDR) and 1 spine switch provide a non-blocking fabric. For 64-node (512 GPU) clusters, 3-tier fat-tree topologies with core switches are required. NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) InfiniBand switches handle in-network all-reduce operations, removing compute load from GPUs during gradient synchronization.

RoCEv2 400GbE Alternative

For clusters of 4–16 nodes where InfiniBand budget is a constraint, 400GbE RoCEv2 (RDMA over Converged Ethernet) provides similar RDMA semantics at lower switch cost. RoCEv2 requires lossless Ethernet configuration (PFC — Priority Flow Control) and careful QoS design to maintain training efficiency. Arista 7800R3, Cisco Nexus 9300/9500, and NVIDIA Spectrum-4 platforms support lossless 400GbE for RoCE AI clusters.

Step 4: Size the Storage

Storage must be specified to exceed GPU I/O demands. A 16-node H100 cluster (128 GPUs) can consume training data at 20–100 GB/s sustained — standard NAS cannot sustain this. Options:

Model checkpoint storage needs: a 70B parameter model checkpoint in BF16 is approximately 140 GB. Training runs checkpoint every few hundred steps — checkpoint storage of 10–50 TB is typical for large training jobs.

Step 5: Assess Facility Requirements

Before hardware arrives, the data center facility must be qualified:

Step 6: Software Stack Deployment

GPU cluster hardware without the right software stack does not run efficiently. The essential software layers:

Step 7: Burn-In and Validation

Before production use, a new GPU cluster must be validated: all-GPU stress test (running NCCL allreduce at full bandwidth to verify InfiniBand connectivity and detect any cabling or switch configuration errors), storage throughput benchmarks, power-on/off cycle testing, and a training dry run with a known-good checkpoint to validate that training loss decreases correctly on the new cluster. ECC errors detected during burn-in indicate GPU or memory defects that should be addressed under warranty before the cluster enters production use.

Haink GPU Cluster Supply and Deployment Support

Haink sources all cluster hardware components — GPU servers, InfiniBand switches and cables, storage systems, and management networking — for AI cluster deployments in Hong Kong, Dubai, and Mainland China. For enterprise clients who need coordination beyond hardware supply, Haink works with local installation partners in each market for rack-and-stack, InfiniBand cabling, and initial OS/driver deployment.

Related Resources

Frequently Asked Questions

How long does it take to build and deploy a GPU cluster?

A 4–8 node cluster: 8–16 weeks from purchase order to operational, including hardware lead time (4–10 weeks), shipping, facility preparation, rack-and-stack, and software configuration. A 32+ node cluster: 16–32 weeks, including phased delivery, full InfiniBand fabric installation, parallel storage deployment, and cluster-wide burn-in. B200 and B300 systems have extended lead times due to demand — factor 12–20 weeks hardware lead time for Blackwell-generation clusters.

What is the total cost of a 32-GPU (4-node H100) cluster?

Approximate all-in costs for a 4-node × 8× H100 SXM5 cluster: GPU servers (4× DGX H100 equivalent): USD 1.4M–1.8M; InfiniBand HDR/NDR switches and cables: USD 80,000–150,000; shared storage (100 TB all-flash): USD 200,000–400,000; management networking (Cisco/Aruba): USD 20,000–40,000. Total hardware: approximately USD 1.7M–2.4M before facility costs. B200-based equivalent: approximately USD 2.5M–3.5M.

Can I start with a small cluster and expand it later?

Yes, but the network must be designed for the final size from the start. An InfiniBand leaf switch (64 ports) that is 25% utilized today can accept additional nodes up to its port count without replacement. Adding a new GPU server to an existing cluster requires one available InfiniBand port per server and the same switch firmware version. Plan the network for 2–3× current node count from day one to avoid costly switch replacement when scaling.

Haink
info@haink.org

Winning House
72–76 Wing Lok Street
Sheung Wan, Hong Kong

© 2026 Haink. All rights reserved.  ·  Privacy Policy  ·  TermsHong Kong · Dubai · Singapore · Mainland China · Delaware (USA)