AI-Hypercomputer · chc170 · Apr 14, 2026 · Apr 8, 2026 · Apr 13, 2026 · Apr 13, 2026
diff --git a/...x/llama3-1-405b/megatron-bridge-gke/nemo2602/64gpus-fp8cs-seq8192-gbs64/recipe/Chart.yaml b/...x/llama3-1-405b/megatron-bridge-gke/nemo2602/64gpus-fp8cs-seq8192-gbs64/recipe/Chart.yaml
@@ -0,0 +1,20 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v2
+name: a4x_max_jobset_workload
+description: a4x_max_jobset_workload
+type: application
+version: 0.1.0
+appVersion: "1.16.0"
diff --git a/...1-405b/megatron-bridge-gke/nemo2602/64gpus-fp8cs-seq8192-gbs64/recipe/README.md b/...1-405b/megatron-bridge-gke/nemo2602/64gpus-fp8cs-seq8192-gbs64/recipe/README.md
@@ -0,0 +1,155 @@
+<!-- mdformat global-off -->
+# Pretrain llama3-1-405b workloads on a4x-max GKE Node pools with Nvidia Megatron-Bridge Framework
+
+This recipe outlines the steps for running a llama3-1-405b pretraining
+workload on [a4x-max GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
+[Megatron-Bridge framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge).
+
+## Orchestration and deployment tools
+
+For this recipe, the following setup is used:
+
+- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
+- Pretraining job configuration and deployment - A Helm chart is used to
+  configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the
+  [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge).
+
+## Test environment
+
+This recipe has been optimized for and tested with the following configuration:
+
+- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/) to create your a4x-max GKE cluster.
+- Node Configuration: 16 nodes (4 GPUs per node, 64 GPUs total).
+- GPU Architecture: NVIDIA Blackwell.
+
+## Training dataset
+
+This recipe uses a mock pretraining dataset provided by the Megatron-Bridge framework.
+
+## Docker container image
+
+This recipe uses the following docker images:
+
+- `nvcr.io/nvidia/nemo:26.02`
+- Note: This recipe uses `nccl-plugin-gib v1.1.1-2` which is installed inside the container.
+
+## Run the recipe
+
+From your client workstation, complete the following steps:
+
+### Configure environment settings
+
+Set the environment variables to match your environment:
+
+```bash
+export PROJECT_ID=<PROJECT_ID>
+export CLUSTER_REGION=<CLUSTER_REGION>
+export CLUSTER_NAME=<CLUSTER_NAME>
+export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
+export KUEUE_NAME=<KUEUE_NAME>
+export HF_TOKEN=<YOUR_HF_TOKEN>
+```
+
+Replace the following values:
+
+- `<PROJECT_ID>`: your Google Cloud project ID.
+- `<CLUSTER_REGION>`: the region where your cluster is located.
+- `<CLUSTER_NAME>`: the name of your GKE cluster.
+- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
+- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x-max`. Make sure to verify the name of the local queue in your cluster.
+- `<YOUR_HF_TOKEN>`: Your HuggingFace token.
+
+Set the default project:
+
+```bash
+gcloud config set project $PROJECT_ID
+```
+
+### Get the recipe
+
+Clone the `gpu-recipes` repository and set a reference to the recipe folder.
+
+```
+git clone https://github.com/ai-hypercomputer/gpu-recipes.git
+cd gpu-recipes
+export REPO_ROOT=`git rev-parse --show-toplevel`
+export RECIPE_ROOT=$REPO_ROOT/training/a4x-max/llama3-1-405b/megatron-bridge-gke/nemo2602/64gpus-fp8cs-seq8192-gbs64/recipe
+cd $RECIPE_ROOT
+```
+
+### Get cluster credentials
+
+```
+gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
+```
+
+### Configure and submit a pretraining job
+
+#### Using 64 GPUs fp8cs precision
+
+To execute the job with the default settings, run the following command from
+your client:
+
+```bash
+cd $RECIPE_ROOT
+export WORKLOAD_NAME=$USER-a4x-max-llama3-1-405b-64gpus
+helm install $WORKLOAD_NAME . -f values.yaml \
+--set-file workload_launcher=launcher.sh \
+--set workload.image=nvcr.io/nvidia/nemo:26.02 \
+--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+--set volumes.gcsMounts[0].mountPath=/job-logs \
+--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
+--set queue=${KUEUE_NAME} \
+--set workload.hfToken=$HF_TOKEN
+```
+
+**Examples**
+
+-   To set the number of training steps to 100, run the following command from
+    your client:
+
+    ```bash
+    cd $RECIPE_ROOT
+    export WORKLOAD_NAME=$USER-a4x-max-llama3-1-405b-64gpus
+    helm install $WORKLOAD_NAME . -f values.yaml \
+    --set-file workload_launcher=launcher.sh \
+    --set workload.image=nvcr.io/nvidia/nemo:26.02 \
+    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+    --set volumes.gcsMounts[0].mountPath=/job-logs \
+    --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
+    --set queue=${KUEUE_NAME} \
+    --set workload.arguments[0]="trainer.max_steps=100"
+    ```
+
+### Monitor the job
+
+To check the status of pods in your job, run the following command:
+
+```
+kubectl get pods | grep $USER-a4x-max-llama3-1-405b-64gpus
+```
+
+Replace the following:
+
+- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4x-max-llama3-1-405b-64gpus.
+
+To get the logs for one of the pods, run the following command:
+
+```
+kubectl logs POD_NAME
+```
+
+Information about the training job's progress, including crucial details such as
+loss, step count, and step time, is generated by the rank 0 process.
+This process runs on the pod whose name begins with
+`JOB_NAME_PREFIX-workload-0-0`.
+For example: `$USER-a4x-max-llama3-1-405b-64gpus-workload-0-0-s9zrv`.
+
+### Uninstall the Helm release
+
+You can delete the job and other resources created by the Helm chart. To
+uninstall Helm, run the following command from your client:
+
+```bash
+helm uninstall $USER-a4x-max-llama3-1-405b-64gpus
+```