AI GPU Cluster Architecture — Compute, Networking, Storage, and Management Layers
A production AI training cluster is not a collection of GPU servers — it is a tightly integrated system where compute, networking, storage, and management planes must be designed together to achieve the GPU utilization and throughput that justifies the hardware investment. This page describes the architecture of enterprise AI training clusters at varying scales, with reference designs from single-node to 512-GPU multi-rack deployments.
The Four Planes of AI Cluster Architecture
Every AI training cluster has four distinct network and functional planes:
- Compute plane (NVLink): GPU-to-GPU communication within a single server, provided by NVLink fabric (NVLink 4.0 on H100/H200, NVLink 5.0 on B200/B300). Non-programmable — this is hardware fabric, not a network you design.
- High-speed interconnect plane (InfiniBand or RoCE): GPU-to-GPU communication between servers — the distributed training fabric. This is the most performance-critical plane you design. Typically 400G InfiniBand NDR or 400GbE RoCEv2.
- Storage plane: Connecting GPU servers to shared dataset storage and checkpoint storage. Typically 100GbE or InfiniBand shared with the interconnect plane depending on cluster design.
- Management plane: Out-of-band BMC/IPMI access, OS management, monitoring. Typically 1GbE or 10GbE on a completely separate physical network from the training fabric.
Single-Node Architecture (8 GPUs)
A single 8-GPU SXM server (DGX H100, Supermicro SYS-821GE, Dell XE9680) contains:
- 8 GPU dies connected via NVLink 4.0 (H100/H200) or NVLink 5.0 (B200/B300) — 900 GB/s or 1,800 GB/s bidirectional between any two GPUs
- 2 CPU sockets connected to GPUs via PCIe Gen5 (for system memory access and host-device data transfers)
- 8 InfiniBand or Ethernet NICs — each GPU has a dedicated NIC for external cluster communication
- Local NVMe storage (30 TB in DGX) for dataset staging
For a single-node deployment, there is no external cluster fabric to design. The NVLink fabric is fixed hardware. A management switch (1GbE) and a connection to external storage (if not using local NVMe only) complete the single-node architecture.
Small Cluster Architecture: 4–8 Nodes (32–64 GPUs)
InfiniBand Layer
A single InfiniBand leaf switch (NVIDIA QM9700, 64 ports NDR 400G) connects all nodes. Each GPU server has 8 NDR 400G ports (one per GPU), so each server occupies 8 switch ports. A 4-node cluster uses 32 ports on the leaf switch — leaving 32 ports available for future expansion or uplinks. The leaf switch provides full bisection bandwidth: every GPU can communicate with every other GPU at full link rate simultaneously.
NVIDIA SHARP (in-network computing) on the QM9790 variant offloads allreduce operations to the switch ASIC, reducing GPU compute load during gradient synchronization by 40–60% for large clusters. For 4–8 node clusters, the benefit is modest; for 16+ node clusters, SHARP provides meaningful training throughput improvement.
Storage Layer (Small Cluster)
A small cluster can use a dedicated all-flash storage appliance (NetApp AFF, Pure Storage FlashArray) connected via 100GbE or 400GbE NFS/NVMe-oF. The storage network can share the InfiniBand fabric (storage over InfiniBand using NVMe-oF/IB) or use a separate 100GbE Ethernet storage network. For 4-node clusters, a single storage appliance with 100 TB+ all-flash capacity and 20–40 GB/s throughput is typically sufficient.
Management Layer
A separate 1GbE management switch provides BMC/IPMI access to all servers and the InfiniBand switch management port. This network is used for OS deployment, firmware updates, power management, and health monitoring — completely isolated from the training fabric. Cisco Catalyst 1000 or Aruba 2530 series switches are common for the management plane.
Medium Cluster Architecture: 16–32 Nodes (128–256 GPUs)
Two-Tier Fat-Tree InfiniBand
A 16-node cluster (128 GPUs, 128 NDR 400G ports from servers) requires two leaf switches (2 × 64-port QM9700 = 128 ports for servers) with uplinks to one or two spine switches for inter-leaf communication. A non-blocking two-tier fat-tree provides full bisection bandwidth across all 128 GPUs: at full utilization, any GPU can communicate with any other GPU at the full 400G link rate without bandwidth contention at the spine.
Cable topology: each server's 8 NICs are distributed across both leaf switches (4 NICs to leaf-1, 4 NICs to leaf-2) to avoid single-switch failure taking out half each server's bandwidth. Each leaf switch has 48 server ports and 16 uplink ports to the spine layer. Spine switch: 1 × 64-port QM9700 with 32 downlink ports to leaf switches and 32 ports available for future expansion.
Parallel Storage (Medium Cluster)
128-GPU clusters require aggregate storage throughput of 30–100 GB/s to keep GPUs fed. A single all-flash NAS appliance cannot sustain this. Parallel file systems (WEKA Data Platform, IBM Spectrum Scale GPFS, Lustre) distribute data across multiple NVMe storage nodes, aggregating bandwidth linearly. Typical medium cluster storage: 4–8 WEKA nodes (each with 12× NVMe U.2 drives, 2× 100GbE uplinks) providing 60–200 GB/s aggregate read throughput to the compute nodes.
Large Cluster Architecture: 64+ Nodes (512+ GPUs)
Three-Tier Fat-Tree
At 64 nodes (512 GPUs, 512 NDR 400G ports), a two-tier fat-tree runs out of non-blocking port capacity. Three-tier fat-tree topology is required: compute nodes connect to leaf switches; leaf switches connect to aggregation/spine switches; aggregation switches connect to core switches. Maintaining full bisection bandwidth at this scale requires careful oversubscription analysis — some 512-GPU clusters use 2:1 oversubscription at the spine layer if training workloads tolerate occasional bandwidth contention during allreduce.
Rail-Optimized Topology
NVIDIA's recommended topology for large AI training clusters is "rail-optimized": each of the 8 GPUs in a server connects to a different leaf switch (8 switches per "rail" across the cluster). All servers' GPU-0 ports connect to switch-1, all GPU-1 ports to switch-2, etc. This topology maximizes allreduce performance for data-parallel training: gradient synchronization between all instances of GPU-0 across all servers goes through a single switch with no inter-switch hop, minimizing latency.
GB200 NVL72 Rack-Scale Architecture
NVIDIA GB200 NVL72 replaces the traditional cluster architecture for the largest deployments. Each NVL72 rack contains 72 B200 GPU dies (36 Grace CPU + 72 GPU) connected by a single NVLink 5.0 fabric with 130 TB/s total bandwidth — the entire rack operates as a single NVLink domain. Multiple NVL72 racks connect via 800G InfiniBand (ConnectX-8 Super) for inter-rack communication. This architecture eliminates the InfiniBand bottleneck within a rack (replaced by NVLink) while retaining InfiniBand for inter-rack communication in clusters of multiple NVL72 units.
Reference Architecture Summary
| Scale | Nodes | GPUs | IB Topology | Storage |
|---|---|---|---|---|
| Entry | 1 | 8 | None (NVLink only) | Local NVMe or NFS |
| Small | 4–8 | 32–64 | 1 leaf switch | All-flash NAS |
| Medium | 16–32 | 128–256 | 2-tier fat-tree | Parallel FS (WEKA/GPFS) |
| Large | 64+ | 512+ | 3-tier fat-tree / rail | Large-scale parallel FS |
