Haink SolutionsKnowledgeCase StudiesAbout Contact sales

AI GPU Cluster Architecture — Compute, Networking, Storage, and Management Layers

A production AI training cluster is not a collection of GPU servers — it is a tightly integrated system where compute, networking, storage, and management planes must be designed together to achieve the GPU utilization and throughput that justifies the hardware investment. This page describes the architecture of enterprise AI training clusters at varying scales, with reference designs from single-node to 512-GPU multi-rack deployments.

The Four Planes of AI Cluster Architecture

Every AI training cluster has four distinct network and functional planes:

Single-Node Architecture (8 GPUs)

A single 8-GPU SXM server (DGX H100, Supermicro SYS-821GE, Dell XE9680) contains:

For a single-node deployment, there is no external cluster fabric to design. The NVLink fabric is fixed hardware. A management switch (1GbE) and a connection to external storage (if not using local NVMe only) complete the single-node architecture.

Small Cluster Architecture: 4–8 Nodes (32–64 GPUs)

InfiniBand Layer

A single InfiniBand leaf switch (NVIDIA QM9700, 64 ports NDR 400G) connects all nodes. Each GPU server has 8 NDR 400G ports (one per GPU), so each server occupies 8 switch ports. A 4-node cluster uses 32 ports on the leaf switch — leaving 32 ports available for future expansion or uplinks. The leaf switch provides full bisection bandwidth: every GPU can communicate with every other GPU at full link rate simultaneously.

NVIDIA SHARP (in-network computing) on the QM9790 variant offloads allreduce operations to the switch ASIC, reducing GPU compute load during gradient synchronization by 40–60% for large clusters. For 4–8 node clusters, the benefit is modest; for 16+ node clusters, SHARP provides meaningful training throughput improvement.

Storage Layer (Small Cluster)

A small cluster can use a dedicated all-flash storage appliance (NetApp AFF, Pure Storage FlashArray) connected via 100GbE or 400GbE NFS/NVMe-oF. The storage network can share the InfiniBand fabric (storage over InfiniBand using NVMe-oF/IB) or use a separate 100GbE Ethernet storage network. For 4-node clusters, a single storage appliance with 100 TB+ all-flash capacity and 20–40 GB/s throughput is typically sufficient.

Management Layer

A separate 1GbE management switch provides BMC/IPMI access to all servers and the InfiniBand switch management port. This network is used for OS deployment, firmware updates, power management, and health monitoring — completely isolated from the training fabric. Cisco Catalyst 1000 or Aruba 2530 series switches are common for the management plane.

Medium Cluster Architecture: 16–32 Nodes (128–256 GPUs)

Two-Tier Fat-Tree InfiniBand

A 16-node cluster (128 GPUs, 128 NDR 400G ports from servers) requires two leaf switches (2 × 64-port QM9700 = 128 ports for servers) with uplinks to one or two spine switches for inter-leaf communication. A non-blocking two-tier fat-tree provides full bisection bandwidth across all 128 GPUs: at full utilization, any GPU can communicate with any other GPU at the full 400G link rate without bandwidth contention at the spine.

Cable topology: each server's 8 NICs are distributed across both leaf switches (4 NICs to leaf-1, 4 NICs to leaf-2) to avoid single-switch failure taking out half each server's bandwidth. Each leaf switch has 48 server ports and 16 uplink ports to the spine layer. Spine switch: 1 × 64-port QM9700 with 32 downlink ports to leaf switches and 32 ports available for future expansion.

Parallel Storage (Medium Cluster)

128-GPU clusters require aggregate storage throughput of 30–100 GB/s to keep GPUs fed. A single all-flash NAS appliance cannot sustain this. Parallel file systems (WEKA Data Platform, IBM Spectrum Scale GPFS, Lustre) distribute data across multiple NVMe storage nodes, aggregating bandwidth linearly. Typical medium cluster storage: 4–8 WEKA nodes (each with 12× NVMe U.2 drives, 2× 100GbE uplinks) providing 60–200 GB/s aggregate read throughput to the compute nodes.

Large Cluster Architecture: 64+ Nodes (512+ GPUs)

Three-Tier Fat-Tree

At 64 nodes (512 GPUs, 512 NDR 400G ports), a two-tier fat-tree runs out of non-blocking port capacity. Three-tier fat-tree topology is required: compute nodes connect to leaf switches; leaf switches connect to aggregation/spine switches; aggregation switches connect to core switches. Maintaining full bisection bandwidth at this scale requires careful oversubscription analysis — some 512-GPU clusters use 2:1 oversubscription at the spine layer if training workloads tolerate occasional bandwidth contention during allreduce.

Rail-Optimized Topology

NVIDIA's recommended topology for large AI training clusters is "rail-optimized": each of the 8 GPUs in a server connects to a different leaf switch (8 switches per "rail" across the cluster). All servers' GPU-0 ports connect to switch-1, all GPU-1 ports to switch-2, etc. This topology maximizes allreduce performance for data-parallel training: gradient synchronization between all instances of GPU-0 across all servers goes through a single switch with no inter-switch hop, minimizing latency.

GB200 NVL72 Rack-Scale Architecture

NVIDIA GB200 NVL72 replaces the traditional cluster architecture for the largest deployments. Each NVL72 rack contains 72 B200 GPU dies (36 Grace CPU + 72 GPU) connected by a single NVLink 5.0 fabric with 130 TB/s total bandwidth — the entire rack operates as a single NVLink domain. Multiple NVL72 racks connect via 800G InfiniBand (ConnectX-8 Super) for inter-rack communication. This architecture eliminates the InfiniBand bottleneck within a rack (replaced by NVLink) while retaining InfiniBand for inter-rack communication in clusters of multiple NVL72 units.

Reference Architecture Summary

Scale Nodes GPUs IB Topology Storage
Entry 1 8 None (NVLink only) Local NVMe or NFS
Small 4–8 32–64 1 leaf switch All-flash NAS
Medium 16–32 128–256 2-tier fat-tree Parallel FS (WEKA/GPFS)
Large 64+ 512+ 3-tier fat-tree / rail Large-scale parallel FS

Related Resources

Haink
info@haink.org

Winning House
72–76 Wing Lok Street
Sheung Wan, Hong Kong

© 2026 Haink. All rights reserved.  ·  Privacy Policy  ·  TermsHong Kong · Dubai · Singapore · Mainland China · Delaware (USA)