Skip to content

Latest commit

 

History

History
222 lines (169 loc) · 7.78 KB

File metadata and controls

222 lines (169 loc) · 7.78 KB

Pretrain Llama-3.1-70B workloads on A3 Mega GKE Node pools

This recipe outlines the steps for running a Llama-3.1-70B pretraining workload on A3 Mega GKE Node pools by using the NVIDIA NeMo framework.

Orchestration and deployment tools

For this recipe, the following setup is used:

Test environment

This recipe has been optimized for and tested with the following configuration:

To prepare the required environment, see GKE environment setup guide .

Training Dataset

Uses a mock pretraining dataset provided by the NeMo framework.

Run the recipe

It's recommended to use Cloud Shell as your client to complete the steps. Cloud Shell comes pre-installed with the necessary utilities, including kubectl, the Google Cloud SDK, and Helm.

Launch Cloud Shell

In the Google Cloud console, start a Cloud Shell Instance .

Configure environment settings

From your client, complete the following steps:

  1. Set the environment variables to match your environment:
export PROJECT_ID=<PROJECT_ID>
export REGION=<CLUSTER_REGION>
export CLUSTER_REGION=<CLUSTER_REGION>
export CLUSTER_NAME=<CLUSTER_NAME>
export GCS_BUCKET=<GCS_BUCKET>
export ARTIFACT_REGISTRY=<ARTIFACT_REGISTRY>
export KUEUE_NAME=<KUEUE_NAME>

Replace the following values:

  • <PROJECT_ID>: your Google Cloud project ID
  • <REGION>: the region where you want to run the Cloud Build
  • <CLUSTER_REGION>: the region where your cluster is located
  • <CLUSTER_NAME>: the name of your GKE cluster
  • <GCS_BUCKET>: the name of your Cloud Storage bucket. Don't include the gs:// prefix.
  • <ARTIFACT_REGISTRY>: the full name of your Artifact Registry in the following format: LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY
  • <KUEUE_NAME>: the name of the Kueue local queue configured for TAS. The default queue created by the cluster toolkit is a3-mega. Make sure to verify the name of the local queue in your cluster.
  1. Set the default project:
gcloud config set project $PROJECT_ID

Get the recipe

From your client, clone the gpu-recipes repository and set a reference to the recipe folder.

git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/training/a3mega/llama3-1-70b/nemo-pretraining-gke

Get cluster credentials

From your client, get the credentials for your cluster.

gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION

Build and push a docker container image to Artifact Registry

To build the container, complete the following steps from your client:

  1. Use Cloud Build to build and push the container image.

    cd $REPO_ROOT/src/docker/nemo-24.07
    gcloud builds submit --region=${REGION} \
        --config cloudbuild.yml \
        --substitutions _ARTIFACT_REGISTRY=$ARTIFACT_REGISTRY \
        --timeout "2h" \
        --machine-type=e2-highcpu-32 \
        --quiet \
        --async

This command outputs the build ID.

  1. You can monitor the build progress by streaming the logs for the build ID. To do this, run the following command. Replace <BUILD_ID> with your build ID.

    BUILD_ID=<BUILD_ID>
    gcloud beta builds log $BUILD_ID --region=$REGION

Configure and submit a pretraining job

The default job setting is 50 training steps and bf16 precision. To execute the job with the default settings, run the following command from your client:

cd $RECIPE_ROOT
helm install -f values.yaml \
  --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \
  --set-file workload_config=$REPO_ROOT/src/frameworks/a3mega/nemo-configs/llama3-1-70b-256gpus-bf16.yaml \
  --set queue=${KUEUE_NAME} \
  --set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \
  --set workload.image=${ARTIFACT_REGISTRY}/nemo_workload:24.07 \
  $USER-llama-3-1-70b-nemo \
  $REPO_ROOT/src/helm-charts/a3mega/jobset

Configure job settings

You can overwrite any of the default NeMo configurations for this job. To do this, we can set the new arguments using --set workload.arguments.

Examples

  • To set the number of training steps to 100, run the following command from your client:

    cd $RECIPE_ROOT
    helm install -f values.yaml \
      --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \
      --set-file workload_config=$REPO_ROOT/src/frameworks/a3mega/nemo-configs/llama3-1-70b-256gpus-bf16.yaml \
      --set queue=${KUEUE_NAME} \
      --set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \
      --set workload.image=${ARTIFACT_REGISTRY}/nemo_workload:24.07 \
      --set "workload.arguments[0]=trainer.max_steps=100" \
      $USER-llama-3-1-70b-nemo \
      $REPO_ROOT/src/helm-charts/a3mega/jobset

Monitor the JobSet

To check the status of pods in the JobSet, run the following command from your client:

kubectl get pods -l app.kubernetes.io/instance=$USER-llama-3-1-70b-nemo

To get the logs of pods in the JobSet, run the following command from your client:

kubectl logs -l app.kubernetes.io/instance=$USER-llama-3-1-70b-nemo

To get the logs for one of the pods, run the following command from your client:

kubectl logs POD_NAME

Uninstall the Helm release

You can delete the job and other resources created by the Helm chart. To uninstall Helm, run the following command from your client:

helm uninstall $USER-llama-3-1-70b-nemo