Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: v2
name: a4x_max_jobset_workload
description: a4x_max_jobset_workload
type: application
version: 0.1.0
appVersion: "1.16.0"
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
<!-- mdformat global-off -->
# Pretrain llama3-1-405b workloads on a4x-max GKE Node pools with Nvidia Megatron-Bridge Framework

This recipe outlines the steps for running a llama3-1-405b pretraining
workload on [a4x-max GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
[Megatron-Bridge framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge).

## Orchestration and deployment tools

For this recipe, the following setup is used:

- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
- Pretraining job configuration and deployment - A Helm chart is used to
configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the
[Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge).

## Test environment

This recipe has been optimized for and tested with the following configuration:

- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/) to create your a4x-max GKE cluster.
- Node Configuration: 16 nodes (4 GPUs per node, 64 GPUs total).
- GPU Architecture: NVIDIA Blackwell.

## Training dataset

This recipe uses a mock pretraining dataset provided by the Megatron-Bridge framework.

## Docker container image

This recipe uses the following docker images:

- `nvcr.io/nvidia/nemo:26.02`
- Note: This recipe uses `nccl-plugin-gib v1.1.1-2` which is installed inside the container.

## Run the recipe

From your client workstation, complete the following steps:

### Configure environment settings

Set the environment variables to match your environment:

```bash
export PROJECT_ID=<PROJECT_ID>
export CLUSTER_REGION=<CLUSTER_REGION>
export CLUSTER_NAME=<CLUSTER_NAME>
export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
export KUEUE_NAME=<KUEUE_NAME>
export HF_TOKEN=<YOUR_HF_TOKEN>
```

Replace the following values:

- `<PROJECT_ID>`: your Google Cloud project ID.
- `<CLUSTER_REGION>`: the region where your cluster is located.
- `<CLUSTER_NAME>`: the name of your GKE cluster.
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x-max`. Make sure to verify the name of the local queue in your cluster.
- `<YOUR_HF_TOKEN>`: Your HuggingFace token.

Set the default project:

```bash
gcloud config set project $PROJECT_ID
```

### Get the recipe

Clone the `gpu-recipes` repository and set a reference to the recipe folder.

```
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a4x-max/llama3-1-405b/megatron-bridge-gke/nemo2602/64gpus-fp8cs-seq8192-gbs64/recipe
cd $RECIPE_ROOT
```

### Get cluster credentials

```
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
```

### Configure and submit a pretraining job

#### Using 64 GPUs fp8cs precision

To execute the job with the default settings, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-max-llama3-1-405b-64gpus
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set workload.image=nvcr.io/nvidia/nemo:26.02 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME} \
--set workload.hfToken=$HF_TOKEN
```

**Examples**

- To set the number of training steps to 100, run the following command from
your client:

```bash
cd $RECIPE_ROOT
export WORKLOAD_NAME=$USER-a4x-max-llama3-1-405b-64gpus
helm install $WORKLOAD_NAME . -f values.yaml \
--set-file workload_launcher=launcher.sh \
--set workload.image=nvcr.io/nvidia/nemo:26.02 \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set volumes.gcsMounts[0].mountPath=/job-logs \
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
--set queue=${KUEUE_NAME} \
--set workload.arguments[0]="trainer.max_steps=100"
```

### Monitor the job

To check the status of pods in your job, run the following command:

```
kubectl get pods | grep $USER-a4x-max-llama3-1-405b-64gpus
```

Replace the following:

- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4x-max-llama3-1-405b-64gpus.

To get the logs for one of the pods, run the following command:

```
kubectl logs POD_NAME
```

Information about the training job's progress, including crucial details such as
loss, step count, and step time, is generated by the rank 0 process.
This process runs on the pod whose name begins with
`JOB_NAME_PREFIX-workload-0-0`.
For example: `$USER-a4x-max-llama3-1-405b-64gpus-workload-0-0-s9zrv`.

### Uninstall the Helm release

You can delete the job and other resources created by the Helm chart. To
uninstall Helm, run the following command from your client:

```bash
helm uninstall $USER-a4x-max-llama3-1-405b-64gpus
```
Loading