Haink
Hardware

AI Training Infrastructure Supplier — Multi-Node GPU Clusters for LLM Training and Fine-Tuning

Haink supplies AI training infrastructure to enterprises and research institutions in Hong Kong, Dubai, UAE, and Mainland China — GPU servers, InfiniBand networking, and parallel storage configured for large language model training, fine-tuning, and pre-training workloads. Training infrastructure is the most demanding category of AI hardware: it requires the highest GPU-to-GPU communication bandwidth, the most storage throughput, and the greatest power density of any enterprise compute deployment. Haink sources AI training hardware from NVIDIA, Supermicro, Dell, HPE, and Mellanox/NVIDIA InfiniBand.

What Makes AI Training Infrastructure Different

Training a large AI model involves distributed computation across multiple GPUs — sometimes hundreds or thousands of GPUs in parallel. At each training step, gradients computed on each GPU must be aggregated (all-reduce operation) across all GPUs in the cluster before the next step begins. The latency and bandwidth of this communication is a fundamental bottleneck: a GPU cluster where GPUs spend significant time waiting for gradient communication is an expensive cluster running at reduced efficiency. AI training infrastructure is therefore defined by its communication architecture — the GPU interconnect within a server (NVLink) and between servers (InfiniBand) — as much as by raw GPU FLOPS.

GPU Platforms for AI Training

NVIDIA H100 SXM5 — Proven Training Workhorse

H100 SXM5 with NVLink 4.0 (900 GB/s bidirectional) is the most deployed GPU for LLM training globally. The SXM form factor with NVLink provides 4.5× more GPU-to-GPU bandwidth than PCIe peer-to-peer — critical for allreduce communication in multi-GPU training. An 8× H100 SXM5 server (DGX H100 or equivalent) provides 640 GB total GPU memory and 3,958 TFLOPS FP8 per GPU. H100 has the broadest software ecosystem for training: all major distributed training frameworks (PyTorch FSDP, Megatron-LM, DeepSpeed, NCCL) are fully optimized for H100. For organizations with confirmed H100 allocation and acceptable training timelines, H100 SXM5 remains the most cost-effective proven training platform.

NVIDIA H200 SXM5 — Memory-Bandwidth Training

H200 SXM5 carries 141 GB HBM3e per GPU (vs H100's 80 GB) with 4.8 TB/s memory bandwidth (vs 3.35 TB/s) at identical FP8 compute. H200 improves training performance for memory-bandwidth-bound workloads — specifically models with very large activation sizes, long-sequence training (128K+ token context), or training runs that frequently hit GPU memory limits and require activation checkpointing to fit. H200 drops into the same SXM5 server baseboard as H100, making it an in-place upgrade for existing H100 clusters.

NVIDIA B200 SXM5 — Current-Generation Training Platform

B200 (Blackwell) with 9,000 TFLOPS FP8, 192 GB HBM3e, and NVLink 5.0 (1,800 GB/s) is the current-generation choice for new AI training clusters. B200 delivers approximately 2.3× more FP8 compute than H100 and 2× the NVLink communication bandwidth — reducing training time for the same model by 2–3× versus an equivalent H100 cluster. B200 introduces FP4 precision (18,000 TFLOPS) for compatible training workflows. B200 requires direct liquid cooling at full training utilization — this is the primary infrastructure consideration for B200 deployments. For new cluster builds where training speed is the priority, B200 is the recommended GPU.

NVIDIA B300 SXM — Frontier Model Training

B300 (Blackwell Ultra) with 288 GB HBM3e per GPU enables tensor parallelism degrees that B200 and H100 cannot sustain for the very largest models. At 1T+ parameter scale, model parallelism requires distributing model layers across dozens of GPUs — the more memory per GPU, the fewer GPUs required for the parallelism topology, and the less communication overhead. B300 targets organizations pre-training frontier models where GPU memory capacity, not FLOPS, is the limiting architectural constraint.

Distributed Training: How Multi-Node Clusters Work

Data Parallelism

In data-parallel training, each GPU holds a full copy of the model and processes a different batch of training data. At the end of each step, all GPUs synchronize gradients via all-reduce. This is the simplest distributed training approach and works well for models that fit in the GPU memory of a single device. The communication pattern is an all-reduce over the full gradient vector — bandwidth-intensive but latency-tolerant. InfiniBand provides the bandwidth to make this efficient at scale.

Tensor Parallelism

In tensor parallelism, individual model layers are split across multiple GPUs — each GPU holds a slice of each layer's weight matrix. This enables training (and inference) of models too large to fit on a single GPU. Communication happens inside each forward and backward pass — the latency and bandwidth requirements are much stricter than data parallelism. Tensor parallelism within an 8-GPU NVLink node is efficient (900 GB/s NVLink 4.0 on H100 or 1,800 GB/s NVLink 5.0 on B200); tensor parallelism across nodes requires InfiniBand NDR (400 Gbps) at minimum.

Pipeline Parallelism

In pipeline parallelism, consecutive model layers are assigned to different GPUs or nodes. Each GPU processes its layers sequentially in a pipeline. This approach is used for the very largest models where neither data nor tensor parallelism alone is sufficient, or where minimizing per-GPU communication is valuable. Modern LLM training typically uses a combination of data, tensor, and pipeline parallelism (3D parallelism) to maximize GPU utilization in large clusters.

InfiniBand: The Training Cluster Backbone

InfiniBand NDR (400 Gbps per port per GPU) is the standard interconnect for production AI training clusters. InfiniBand provides RDMA (Remote Direct Memory Access) — GPUs communicate directly without CPU involvement, eliminating CPU-side latency overhead. A fat-tree InfiniBand fabric (leaf switches connected to spine switches) provides non-blocking bandwidth: every GPU can communicate with every other GPU at full link speed simultaneously, which is the requirement for efficient all-reduce operations. NVIDIA QM9700/QM9790 are the NDR switch platforms for H100/B200 training clusters.

RoCEv2 (RDMA over Converged Ethernet) using 400GbE is an alternative to InfiniBand that provides similar RDMA semantics over standard Ethernet switching. RoCE clusters cost less than InfiniBand due to commodity switch pricing but require careful network engineering (lossless Ethernet with priority flow control, careful topology design) to achieve comparable training efficiency. For clusters of 8–16 nodes, RoCE is a cost-effective alternative. Above 16 nodes, InfiniBand is typically preferred.

Storage for Training Clusters

AI training storage must deliver the dataset to GPU memory faster than GPUs consume it, and must handle model checkpoint writes reliably during long training runs. A 128-GPU cluster training a 70B model runs through training data at 10–50 GB/s — standard NAS storage cannot sustain this throughput. Parallel distributed file systems (WEKA, IBM Spectrum Scale GPFS, Lustre, BeeGFS) distribute data across multiple all-flash NVMe storage nodes, aggregating bandwidth to match cluster I/O demands. Haink supplies NetApp AFF, Pure Storage FlashArray, and parallel NVMe storage configurations for AI training infrastructure.

Training Infrastructure Sizes

Fine-tuning a 7B–13B model with LoRA on a proprietary dataset requires a single 8-GPU node — this is accessible with one 8× H100 SXM5 server. Full supervised fine-tuning of a 70B model (no LoRA) requires 4–8 H100 SXM5 or 2–4 H200 GPUs. Pre-training a custom 7B model from scratch requires dozens of GPU-days on a single node, or a small 4–8 node cluster to complete training in a practical time window. Pre-training a 70B model from scratch requires 4–32 GPU nodes (32–256 GPUs) for a training run measured in weeks. Pre-training at 405B+ scale requires 256+ GPU clusters with full InfiniBand fabric and parallel storage.

Haink AI Training Infrastructure Supply

Haink sources and delivers complete AI training infrastructure stacks to enterprise data centers in Hong Kong, Dubai, and Mainland China: GPU servers (NVIDIA DGX H100/H200/B200, Supermicro SYS-821GE/ARS-821GL, Dell XE9680, HPE Cray XD670), InfiniBand networking (NVIDIA QM9700/QM9790 switches, ConnectX-7/8 HCAs), and parallel storage (NetApp AFF, Pure Storage). For enterprises without data center capacity, Haink advises on colocation facility selection in Hong Kong, Singapore, and Dubai.

Related Resources

Frequently Asked Questions

How many GPUs do I need to fine-tune a 70B model?

For LoRA fine-tuning of a 70B model in FP16: 4 NVIDIA H100 SXM5 (320 GB total, with room for optimizer states and activations at LoRA rank 16). For full supervised fine-tuning (SFT) of 70B in BF16: 8 H100 SXM5 or 4 H200 SXM5 with gradient checkpointing. For DPO or RLHF fine-tuning which requires two model instances simultaneously: 16 H100 SXM5 or 8 H200 SXM5 minimum. A single 8× H100 SXM5 node covers most enterprise fine-tuning use cases up to 70B with LoRA or QLoRA.

What is the difference between fine-tuning and pre-training infrastructure?

Fine-tuning starts from an existing pre-trained model (Llama 3.1, Mistral, Qwen 2.5) and trains it on a smaller proprietary dataset for 100–10,000 steps. Pre-training builds a model from random initialization on trillions of tokens and requires GPU-weeks to GPU-months of compute. Fine-tuning is accessible with 1–4 GPU nodes; pre-training typically requires 8–256+ GPU nodes. For most enterprises, fine-tuning an existing open-source model on proprietary data is the practical AI training workload — dedicated pre-training clusters are built by foundation model companies and hyperscalers.

Do I need InfiniBand for a small training cluster?

Within a single 8-GPU NVLink server, InfiniBand is not needed for communication between GPUs — NVLink provides the required bandwidth. InfiniBand is needed for multi-node training where GPUs in different servers need to communicate. For 2–4 node clusters, 200GbE RoCEv2 is a cost-effective alternative. For 8+ node clusters where training efficiency matters, InfiniBand NDR (400G) is recommended. For small-scale fine-tuning on a single 8-GPU node, no InfiniBand is needed at all.