
For a long time, the path to powerful Foundation Models (FMs) seemed straightforward: more compute during pre-training invariably led to better capabilities. This intuition was well-supported by early research, like Kaplan et al. (2020), which revealed predictable power-law trends as model parameters, dataset size, and training compute scaled up. Such findings justified significant investments in large-scale accelerator capacity and the distributed infrastructure needed to keep it humming efficiently.
However, the landscape has evolved dramatically, and scaling is no longer a one-dimensional problem. NVIDIA’s insightful “from one to three scaling laws” framework now highlights that beyond initial pre-training, performance increasingly scales through crucial post-training stages (think supervised fine-tuning and reinforcement learning) and through sophisticated test-time compute strategies, often dubbed “long thinking.” These include advanced search, verification, and multi-sample approaches.
Converging Infrastructure for Modern AI
These diverse scaling regimes—pre-training, post-training, and inference—are converging on similar infrastructure requirements. At their core, they demand tightly coupled accelerator compute, a high-bandwidth, low-latency network, and a robust distributed storage backend. This shift also elevates the importance of effective resource orchestration and application- and hardware-level observability to maintain cluster health and pinpoint performance issues at scale.
Another significant trend is the growing reliance of the entire FM lifecycle on a vibrant open-source software (OSS) ecosystem. This ecosystem spans everything from model development frameworks to cluster resource management and operational tooling. For instance, resource management at the cluster layer often leverages systems like Slurm and Kubernetes, while model development and distributed training commonly utilize frameworks such as PyTorch and JAX.
Monitoring and visualization, or “observability,” are frequently achieved with Prometheus for collecting metrics and Grafana for visualization and alerting, creating an essential operational layer. This layered architecture, which we’ll explore in detail, shows how underlying hardware infrastructure supports resource orchestration, which in turn enables powerful ML frameworks, with observability providing critical insights across all layers.
AWS Building Blocks for Foundation Models
This article series is designed for machine learning engineers and researchers focused on foundation model training and inference, particularly those working with OSS frameworks. We’ll delve into how AWS infrastructure—including multi-node accelerator compute, high-bandwidth/low-latency networking, distributed shared storage, and associated managed services—interacts with common OSS stacks throughout the foundation model lifecycle. Our primary goal is to establish a technical foundation for understanding system bottlenecks and scaling characteristics across pre-training, post-training, and inference.
This introductory post will survey the overall system architecture, highlighting key integration points between AWS infrastructure components and the OSS tools essential for large-scale distributed training and inference. Subsequent sections will examine each layer in detail: infrastructure, resource orchestration, the ML software stack, and observability.
Infrastructure: Compute, Network, and Storage
At the very foundation of this architecture are three tightly coupled building blocks: accelerated compute with expansive device memory, wide-bandwidth interconnect for collective communication, and scalable distributed storage for critical data and checkpoints.
Accelerated compute forms the bedrock for large-scale foundation model pre-training, post-training, and inference. AWS provides access to multiple generations of NVIDIA GPUs through its Amazon EC2 accelerated computing instances, notably the P instance family. The P5 instance family includes options like the p5.48xlarge with eight NVIDIA H100 GPUs, and newer p5e.48xlarge/p5en.48xlarge variants featuring NVIDIA H200 GPUs. Looking ahead, the P6 instance family will introduce NVIDIA Blackwell B200 and B300 architectures with instances like the p6-b200.48xlarge and p6-b300.48xlarge.
Across these generations, the key scaling dimensions are peak Tensor throughput, High Bandwidth Memory (HBM) capacity and bandwidth, and interconnect bandwidth, both within and across nodes. For a first-order approximation, peak Tensor Core throughput, measured in FLOPS, helps compare these accelerators. For instance, the NVIDIA H100 offers significant dense BF16/FP16 and FP8 Tensor operations, coupled with substantial HBM capacity and bandwidth. The forthcoming NVIDIA Blackwell B200 and B300 GPUs promise even greater advancements in these areas.
As models grow, the duration of each training step is increasingly dominated by collective communication and memory movement, rather than raw compute throughput alone. This necessitates careful attention to both scale-up and scale-out bandwidth. For multi-GPU instances, GPU communication operates in two primary regimes. Within a node, NVLink/NVSwitch provides high-bandwidth, low-latency GPU-to-GPU connectivity, allowing critical collectives like all-reduce and all-gather to execute without involving the host network stack.
For communication across nodes, Elastic Fabric Adapter (EFA) delivers OS-bypass networking. AWS leverages EFA as a fundamental component for Amazon EC2 UltraClusters, where communication-heavy collectives can span thousands of instances. EFA, a specialized network interface for Amazon EC2, provides Remote Direct Memory Access (RDMA) capability using the Scalable Reliable Datagram (SRD) protocol. By enabling applications to communicate directly with the network device, bypassing the operating system kernel, EFA dramatically reduces latency and boosts throughput for collective operations in distributed training.
Several generations of EFA are available across different instance families. EFA version 2 (EFAv2) is found on Amazon EC2 P5 and P5e instances, while EFA version 3 (EFAv3) on P5en instances reduces packet latency by approximately 35%. The latest EFA version 4 (EFAv4), available on P6 instances, delivers an additional 18% improvement in collective communication performance, showcasing continuous innovation in network performance.
At scale, both distributed training (which involves streaming massive datasets and writing multi-terabyte checkpoints) and large-scale inference (staging model weights and managing growing KV caches) necessitate a tiered storage hierarchy. This typically includes local NVMe SSDs for hot data, Lustre for shared high-throughput access, and Amazon S3 for durable, long-term persistence.
Our primary multi-GPU instances feature significant local NVMe storage, often provided as ephemeral instance store with capacities reaching tens of terabytes. Amazon FSx for Lustre offers a fully managed, high-performance distributed file system, delivering terabytes per second of throughput, millions of IOPS, and sub-millisecond latencies. Its integration with Amazon S3 via Data Repository Associations allows for lazy loading of training datasets and automatic checkpoint export, ensuring data durability and accessibility.
At the cluster level, these instances are deployed within Amazon EC2 UltraClusters, which provision thousands of accelerated instances as a tightly integrated cluster within an Availability Zone. These UltraClusters are interconnected using a petabit-scale nonblocking network, designed to handle the most demanding communication patterns.
For workloads with very high per-step communication intensity, such as expert parallelism in Mixture-of-Experts (MoE) models where all-to-all token dispatch spans many GPUs, the size of the NVLink domain can become a critical constraint. To address this, Amazon EC2 UltraServers extend the NVLink domain beyond a single EC2 instance by connecting multiple component instances through a dedicated accelerator interconnect. For example, AWS reports that P6e-GB200 UltraServers are built on the NVIDIA GB200 NVL72 platform, exposing up to 72 Blackwell GPUs and 13.4 TB of aggregate HBM3e within a single NVLink domain.
These advanced systems are built from NVIDIA Grace-Blackwell superchips, which integrate Grace CPU memory and Blackwell GPU HBM via cache-coherent NVLink-C2C. This innovative design enables direct access across CPU- and GPU-attached memory without explicit host-device copies. In practice, this can significantly extend the effective memory available to GPU workloads, allowing colder model states or KV caches to reside in CPU-attached memory, albeit with higher latency and lower bandwidth than local HBM.
Resource Orchestration: Slurm and Kubernetes
When training jobs involve hundreds or even thousands of accelerators, manual resource management quickly becomes impractical. Imagine a training job that requires 512 GPUs; this means co-scheduling 64 eight-GPU nodes simultaneously and releasing those resources cleanly upon completion or failure. This is where orchestrators like Slurm and Kubernetes shine, addressing this challenge through a powerful control-plane architecture. A centralized scheduler maintains the cluster’s state and intelligently makes allocation decisions, while worker nodes dutifully execute the assigned workloads.
Slurm (Simple Linux Utility for Resource Management) is the leading workload manager in high-performance computing (HPC) environments. Its robust, modular plugin architecture allows for extensive customization of scheduling algorithms, topology models, resource types, and accounting mechanisms, making it incredibly flexible for complex, distributed workloads.
Source: Hugging Face Blog