This recipe outlines the steps for running a Llama-3.1-70B pretraining workload on A3 Mega GKE Node pools by using the NVIDIA NeMo framework.
For this recipe, the following setup is used:
- Orchestration - Google Kubernetes Engine (GKE)
- Job configuration and deployment - Helm chart is used to configure and deploy
the Kubernetes JobSet.
- This job encapsulates the NVIDIA NeMo Megatron GPT pretraining workload.
- The chart generates the job's manifest, adhering to best practices for using GPUDirect-TCPXO with Google Kubernetes Engine (GKE), which includes setting optimal values for NVIDIA NCCL and the TCPXO NCCL plugin.
This recipe has been optimized for and tested with the following configuration:
- GKE cluster
- A regional standard cluster version: 1.32.4-gke.1767000 or later.
- A GPU node pool with 32 a3-megagpu-8g nodes provisioned using the DENSE deployment type.
- Workload Identity Federation for GKE enabled.
- Cloud Storage FUSE CSI driver for GKE enabled.
- DCGM metrics enabled.
- Kueue and JobSet APIs installed.
- Kueue configured to support Topology Aware Scheduling .
- GPUDirect-TCPXO
component versions:
- NCCL Plugin: v1.0.3
- RxDM sidecar: v1.0.9
- NVIDIA NeMo NGC container image : 24.07
- A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs.
- Artifact Registry for storing the Nemo Container image.
To prepare the required environment, see GKE environment setup guide .
Uses a mock pretraining dataset provided by the NeMo framework.
It's recommended to use Cloud Shell as your client to complete the steps.
Cloud Shell comes pre-installed with the necessary utilities, including
kubectl, the Google Cloud SDK, and Helm.
In the Google Cloud console, start a Cloud Shell Instance .
From your client, complete the following steps:
- Set the environment variables to match your environment:
export PROJECT_ID=<PROJECT_ID>
export REGION=<CLUSTER_REGION>
export CLUSTER_REGION=<CLUSTER_REGION>
export CLUSTER_NAME=<CLUSTER_NAME>
export GCS_BUCKET=<GCS_BUCKET>
export ARTIFACT_REGISTRY=<ARTIFACT_REGISTRY>
export KUEUE_NAME=<KUEUE_NAME>Replace the following values:
<PROJECT_ID>: your Google Cloud project ID<REGION>: the region where you want to run the Cloud Build<CLUSTER_REGION>: the region where your cluster is located<CLUSTER_NAME>: the name of your GKE cluster<GCS_BUCKET>: the name of your Cloud Storage bucket. Don't include thegs://prefix.<ARTIFACT_REGISTRY>: the full name of your Artifact Registry in the following format: LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY<KUEUE_NAME>: the name of the Kueue local queue configured for TAS. The default queue created by the cluster toolkit isa3-mega. Make sure to verify the name of the local queue in your cluster.
- Set the default project:
gcloud config set project $PROJECT_IDFrom your client, clone the gpu-recipes repository and set a reference to the
recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3-1-70b/nemo-pretraining-gkeFrom your client, get the credentials for your cluster.
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGIONTo build the container, complete the following steps from your client:
-
Use Cloud Build to build and push the container image.
cd $REPO_ROOT/src/docker/nemo-24.07 gcloud builds submit --region=${REGION} \ --config cloudbuild.yml \ --substitutions _ARTIFACT_REGISTRY=$ARTIFACT_REGISTRY \ --timeout "2h" \ --machine-type=e2-highcpu-32 \ --quiet \ --async
This command outputs the build ID.
-
You can monitor the build progress by streaming the logs for the
build ID. To do this, run the following command. Replace<BUILD_ID>with your build ID.BUILD_ID=<BUILD_ID> gcloud beta builds log $BUILD_ID --region=$REGION
The default job setting is 50 training steps and bf16 precision. To execute the job with the default settings, run the following command from your client:
cd $RECIPE_ROOT
helm install -f values.yaml \
--set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \
--set-file workload_config=$REPO_ROOT/src/frameworks/a3mega/nemo-configs/llama3-1-70b-256gpus-bf16.yaml \
--set queue=${KUEUE_NAME} \
--set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \
--set workload.image=${ARTIFACT_REGISTRY}/nemo_workload:24.07 \
$USER-llama-3-1-70b-nemo \
$REPO_ROOT/src/helm-charts/a3mega/jobsetYou can overwrite any of the default
NeMo configurations
for this job. To do this, we can set the new arguments using
--set workload.arguments.
Examples
-
To set the number of training steps to 100, run the following command from your client:
cd $RECIPE_ROOT helm install -f values.yaml \ --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \ --set-file workload_config=$REPO_ROOT/src/frameworks/a3mega/nemo-configs/llama3-1-70b-256gpus-bf16.yaml \ --set queue=${KUEUE_NAME} \ --set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \ --set workload.image=${ARTIFACT_REGISTRY}/nemo_workload:24.07 \ --set "workload.arguments[0]=trainer.max_steps=100" \ $USER-llama-3-1-70b-nemo \ $REPO_ROOT/src/helm-charts/a3mega/jobset
To check the status of pods in the JobSet, run the following command from your client:
kubectl get pods -l app.kubernetes.io/instance=$USER-llama-3-1-70b-nemoTo get the logs of pods in the JobSet, run the following command from your client:
kubectl logs -l app.kubernetes.io/instance=$USER-llama-3-1-70b-nemoTo get the logs for one of the pods, run the following command from your client:
kubectl logs POD_NAMEYou can delete the job and other resources created by the Helm chart. To uninstall Helm, run the following command from your client:
helm uninstall $USER-llama-3-1-70b-nemo