Pretrain Llama-3.1-70B workloads on A3 Mega GKE Node pools

This recipe outlines the steps for running a Llama-3.1-70B pretraining workload on A3 Mega GKE Node pools by using the NVIDIA NeMo framework.

Orchestration and deployment tools

For this recipe, the following setup is used:

Orchestration - Google Kubernetes Engine (GKE)
Job configuration and deployment - Helm chart is used to configure and deploy the Kubernetes JobSet.
- This job encapsulates the NVIDIA NeMo Megatron GPT pretraining workload.
- The chart generates the job's manifest, adhering to best practices for using GPUDirect-TCPXO with Google Kubernetes Engine (GKE), which includes setting optimal values for NVIDIA NCCL and the TCPXO NCCL plugin.

Test environment

This recipe has been optimized for and tested with the following configuration:

GKE cluster
- A regional standard cluster version: 1.32.4-gke.1767000 or later.
- A GPU node pool with 32 a3-megagpu-8g nodes provisioned using the DENSE deployment type.
- Workload Identity Federation for GKE enabled.
- Cloud Storage FUSE CSI driver for GKE enabled.
- DCGM metrics enabled.
- Kueue and JobSet APIs installed.
- Kueue configured to support Topology Aware Scheduling .
- GPUDirect-TCPXO component versions:
  - NCCL Plugin: v1.0.3
  - RxDM sidecar: v1.0.9
- NVIDIA NeMo NGC container image : 24.07
A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs.
Artifact Registry for storing the Nemo Container image.

To prepare the required environment, see GKE environment setup guide .

Training Dataset

Uses a mock pretraining dataset provided by the NeMo framework.

Run the recipe

It's recommended to use Cloud Shell as your client to complete the steps. Cloud Shell comes pre-installed with the necessary utilities, including kubectl, the Google Cloud SDK, and Helm.

Launch Cloud Shell

In the Google Cloud console, start a Cloud Shell Instance .

Configure environment settings

From your client, complete the following steps:

Set the environment variables to match your environment:

export PROJECT_ID=<PROJECT_ID>
export REGION=<CLUSTER_REGION>
export CLUSTER_REGION=<CLUSTER_REGION>
export CLUSTER_NAME=<CLUSTER_NAME>
export GCS_BUCKET=<GCS_BUCKET>
export ARTIFACT_REGISTRY=<ARTIFACT_REGISTRY>
export KUEUE_NAME=<KUEUE_NAME>

Replace the following values:

<PROJECT_ID>: your Google Cloud project ID
<REGION>: the region where you want to run the Cloud Build
<CLUSTER_REGION>: the region where your cluster is located
<CLUSTER_NAME>: the name of your GKE cluster
<GCS_BUCKET>: the name of your Cloud Storage bucket. Don't include the gs:// prefix.
<ARTIFACT_REGISTRY>: the full name of your Artifact Registry in the following format: LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY
<KUEUE_NAME>: the name of the Kueue local queue configured for TAS. The default queue created by the cluster toolkit is a3-mega. Make sure to verify the name of the local queue in your cluster.

Set the default project:

gcloud config set project $PROJECT_ID

Get the recipe

From your client, clone the gpu-recipes repository and set a reference to the recipe folder.

git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3-1-70b/nemo-pretraining-gke

Get cluster credentials

From your client, get the credentials for your cluster.

gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION

Build and push a docker container image to Artifact Registry

To build the container, complete the following steps from your client:

Use Cloud Build to build and push the container image.

cd $REPO_ROOT/src/docker/nemo-24.07
gcloud builds submit --region=${REGION} \
    --config cloudbuild.yml \
    --substitutions _ARTIFACT_REGISTRY=$ARTIFACT_REGISTRY \
    --timeout "2h" \
    --machine-type=e2-highcpu-32 \
    --quiet \
    --async

This command outputs the build ID.

You can monitor the build progress by streaming the logs for the build ID. To do this, run the following command. Replace <BUILD_ID> with your build ID.
```
BUILD_ID=<BUILD_ID>
gcloud beta builds log $BUILD_ID --region=$REGION
```

Configure and submit a pretraining job

The default job setting is 50 training steps and bf16 precision. To execute the job with the default settings, run the following command from your client:

cd $RECIPE_ROOT
helm install -f values.yaml \
  --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \
  --set-file workload_config=$REPO_ROOT/src/frameworks/a3mega/nemo-configs/llama3-1-70b-256gpus-bf16.yaml \
  --set queue=${KUEUE_NAME} \
  --set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \
  --set workload.image=${ARTIFACT_REGISTRY}/nemo_workload:24.07 \
  $USER-llama-3-1-70b-nemo \
  $REPO_ROOT/src/helm-charts/a3mega/jobset

Configure job settings

You can overwrite any of the default NeMo configurations for this job. To do this, we can set the new arguments using --set workload.arguments.

Examples

To set the number of training steps to 100, run the following command from your client:

cd $RECIPE_ROOT
helm install -f values.yaml \
  --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \
  --set-file workload_config=$REPO_ROOT/src/frameworks/a3mega/nemo-configs/llama3-1-70b-256gpus-bf16.yaml \
  --set queue=${KUEUE_NAME} \
  --set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \
  --set workload.image=${ARTIFACT_REGISTRY}/nemo_workload:24.07 \
  --set "workload.arguments[0]=trainer.max_steps=100" \
  $USER-llama-3-1-70b-nemo \
  $REPO_ROOT/src/helm-charts/a3mega/jobset

Monitor the JobSet

To check the status of pods in the JobSet, run the following command from your client:

kubectl get pods -l app.kubernetes.io/instance=$USER-llama-3-1-70b-nemo

To get the logs of pods in the JobSet, run the following command from your client:

kubectl logs -l app.kubernetes.io/instance=$USER-llama-3-1-70b-nemo

To get the logs for one of the pods, run the following command from your client:

kubectl logs POD_NAME

Uninstall the Helm release

You can delete the job and other resources created by the Helm chart. To uninstall Helm, run the following command from your client:

helm uninstall $USER-llama-3-1-70b-nemo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretrain Llama-3.1-70B workloads on A3 Mega GKE Node pools

Orchestration and deployment tools

Test environment

Training Dataset

Run the recipe

Launch Cloud Shell

Configure environment settings

Get the recipe

Get cluster credentials

Build and push a docker container image to Artifact Registry

Configure and submit a pretraining job

Configure job settings

Monitor the JobSet

Uninstall the Helm release

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Pretrain Llama-3.1-70B workloads on A3 Mega GKE Node pools

Orchestration and deployment tools

Test environment

Training Dataset

Run the recipe

Launch Cloud Shell

Configure environment settings

Get the recipe

Get cluster credentials

Build and push a docker container image to Artifact Registry

Configure and submit a pretraining job

Configure job settings

Monitor the JobSet

Uninstall the Helm release