This recipe outlines the steps to benchmark inference of a DeepSeek R1 671B model using SGLang on an A3 Mega GKE Node pool with multiple nodes.
The recipe uses LeaderWorkerSet API in Kubernetes to spin up multiple nodes and handle distributed inference workload. LWS enables treating multiple Pods as a group, simplifying the management of distributed model serving.
For this recipe, the following setup is used:
- Orchestration - Google Kubernetes Engine (GKE)
- LeaderWorkerSet Deployment - Helm chart is used to configure and deploy multi-node inference using the LeaderWorkerSet API provisioning leader and worker pods for distributed inference of the DeepSeek R1 671B model using SGLang. The chart generates the manifest, adhering to best practices for using GPUDirect-TCPXO with Google Kubernetes Engine (GKE), which includes setting optimal values for NVIDIA NCCL and the TCPXO NCCL plugin.
Before running this recipe, ensure your environment is configured as follows:
- A GKE cluster with the following setup:
- An A3 Mega node pool (2 nodes, 16 GPUs)
- An Artifact Registry repository to store the Docker image.
- A Google Cloud Storage (GCS) bucket to store results. Important: This bucket must be in the same region as the GKE cluster.
- A client workstation with the following pre-installed:
- Google Cloud SDK
- Helm
- kubectl
- To access the DeepSeek R1 671B model through Hugging Face, you'll need a Hugging Face token. Follow these steps to generate a new token if you don't have one already:
- Create a Hugging Face account, if you don't already have one.
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a Name of your choice and a Role of at least
Read. - Select Generate a token.
- Copy the generated token to your clipboard.
To prepare the required environment, see GKE environment setup guide.
It is recommended to use Cloud Shell as your client to complete the steps.
Cloud Shell comes pre-installed with the necessary utilities, including
kubectl, the Google Cloud SDK, and Helm.
In the Google Cloud console, start a Cloud Shell Instance.
From your client, complete the following steps:
- Set the environment variables to match your environment:
export PROJECT_ID=<PROJECT_ID>
export REGION=<REGION>
export CLUSTER_REGION=<CLUSTER_REGION>
export CLUSTER_NAME=<CLUSTER_NAME>
export GCS_BUCKET=<GCS_BUCKET>
export ARTIFACT_REGISTRY=<ARTIFACT_REGISTRY>
export SGLANG_IMAGE=sglang
export SGLANG_VERSION=v0.4.3.post2-cu125-srtReplace the following values:
<PROJECT_ID>: your Google Cloud project ID<REGION>: the region where you want to run Cloud Build<CLUSTER_REGION>: the region where your cluster is located<CLUSTER_NAME>: the name of your GKE cluster<GCS_BUCKET>: the name of your Cloud Storage bucket. Do not include thegs://prefix<ARTIFACT_REGISTRY>: the full name of your Artifact Registry in the following format: LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY<SGLANG_IMAGE>: the name of the SGLang image<SGLANG_VERSION>: the version of the SGLang image. We recommended running the recipe with SGLang v0.4.3.post2-cu125-srt.
- Set the default project:
gcloud config set project $PROJECT_IDFrom your client, clone the gpu-recipes repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/inference/a3mega/deepseek-r1-671b/sglang-serving-gke
From your client, get the credentials for your cluster.
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
To build the container, complete the following steps from your client:
-
Use Cloud Build to build and push the container image.
cd $REPO_ROOT/src/docker/sglang gcloud builds submit --region=${REGION} \ --config cloudbuild.yml \ --substitutions _ARTIFACT_REGISTRY=$ARTIFACT_REGISTRY,_SGLANG_IMAGE=$SGLANG_IMAGE,_SGLANG_VERSION=$SGLANG_VERSION \ --timeout "2h" \ --machine-type=e2-highcpu-32 \ --disk-size=1000 \ --quiet \ --async
This command outputs the build ID.
-
You can monitor the build progress by streaming the logs for the
build ID. To do this, run the following command.Replace
<BUILD_ID>with your build ID.BUILD_ID=<BUILD_ID> gcloud beta builds log $BUILD_ID --region=$REGION
The recipe serves DeepSeek R1 671B model using SGLang on multiple A3 Mega nodes in native FP8 mode
To start the serving, the recipe does the following steps:
- Downloads the full DeepSeek R1 671B model checkpoints from Hugging Face
- Starts SGLang server on two A3 Mega nodes, each with 8 GPUs and setting up necessary communication between the nodes.
- Loads the model checkpoints on multiple nodes and apply SGLang optimizations.
- Server is ready to respond to requests.
The recipe uses the helm chart to run the above steps.
-
Create Kubernetes Secret with Hugging Face token to allow the job to download the model checkpoints.
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>
kubectl create secret generic hf-secret \ --from-literal=hf_api_token=${HF_TOKEN} \ --dry-run=client -o yaml | kubectl apply -f -
-
Install the LeaderWorkerSet API (LWS). Please follow the instructions here to install a specific version of LWS API.
kubectl apply --server-side -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml
Validate that the LeaderWorkerSet controller is running in the lws-system namespace, using the following command:
kubectl get pod -n lws-system
The output is similar to the following:
NAME READY STATUS RESTARTS AGE lws-controller-manager-56956867cb-4km9g 1/1 Running 0 24h
-
Install the helm chart to prepare the model.
cd $RECIPE_ROOT helm install -f values.yaml \ --set job.image.repository=${ARTIFACT_REGISTRY}/${SGLANG_IMAGE} \ --set clusterName=${CLUSTER_NAME} \ --set job.image.tag=${SGLANG_VERSION} \ --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ $USER-serving-deepseek-r1-model \ $REPO_ROOT/src/helm-charts/a3mega/sglang-inference
-
To view the logs for the deployment, you can run
kubectl logs -f service/$USER-serving-deepseek-r1-model-svc -
Verify if the deployment has started by running
kubectl get service/$USER-serving-deepseek-r1-model-svc -
Once the deployment has started, you will see logs similar to this:
[2025-02-19 16:39:10 DP7 TP7] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-02-19 16:39:11 DP7 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=576,K=7168,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel. [2025-02-19 16:39:16] INFO: 127.0.0.1:36440 - "POST /generate HTTP/1.1" 200 OK [2025-02-19 16:39:23] INFO: 127.0.0.1:36454 - "POST /generate HTTP/1.1" 200 OK [2025-02-19 16:39:29] INFO: 127.0.0.1:52874 - "POST /generate HTTP/1.1" 200 OK [2025-02-19 16:39:36] INFO: 127.0.0.1:52888 - "POST /generate HTTP/1.1" 200 OK [2025-02-19 16:39:42] INFO: 127.0.0.1:49466 - "POST /generate HTTP/1.1" 200 OK [2025-02-19 16:39:50] INFO: 127.0.0.1:53222 - "POST /generate HTTP/1.1" 200 OK [2025-02-19 16:39:56] INFO: 127.0.0.1:53238 - "POST /generate HTTP/1.1" 200 OK [2025-02-19 16:40:03] INFO: 127.0.0.1:53292 - "POST /generate HTTP/1.1" 200 OK [2025-02-19 16:40:10] INFO: 127.0.0.1:46284 - "POST /generate HTTP/1.1" 200 OK [2025-02-19 16:40:10] The server is fired up and ready to roll!
-
To make API requests to the service, you can port forward the service to your local machine.
kubectl port-forward svc/$USER-serving-deepseek-r1-model-svc 30000:30000 -
Make the API requests to the service.
curl http://localhost:30000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model":"default", "messages":[ { "role":"system", "content":"You are a helpful AI assistant" }, { "role":"user", "content":"How many r are there in strawberry ?" } ], "temperature":0.6, "top_p":0.95, "max_tokens":2048 }'
If everything is setup correctly, you should a response similar to this:
{ "id":"dd176721e73246b5a0ce0490fd9ba798", "object":"chat.completion", "created":1738368064, "model":"default", "choices":[ { "index":0, "message":{ "role":"assistant", "content":"<think>\nOkay, let's figure out how many times the letter \"r\" appears in the word \"strawberry.\" First, I need to spell out the word and check each letter one by one.\n\nSo, the word is S-T-R-A-W-B-E-R-R-Y. Let me write it out slowly to make sure I don't miss any letters. S, T, R, A, W, B, E, R, R, Y. Wait, let me count again. S (1), T (2), R (3), A (4), W (5), B (6), E (7), R (8), R (9), Y (10). Hmm, that's 10 letters total. Now, I need to count how many times the letter \"r\" appears.\n\nStarting from the beginning: S - no. T - no. R - that's the first R. Then A, W, B, E. Next comes R again, that's the second R. Then another R right after, so that's the third R. Finally, Y. So in total, there are three R's in \"strawberry.\"\n\nWait, let me double-check. Spelling it out: S-T-R-A-W-B-E-R-R-Y. The R is at the third position, then after E, there's two R's in a row. So that's three R's. Yeah, that seems right. I think that's correct. Maybe I should write it out again to confirm.\n\nS T R A W B E R R Y. Positions 3, 8, and 9 are R's. So three times. Yep, that's three R's. I don't think I missed any. The answer should be three.\n</think>\n\nThe word \"strawberry\" contains **3** instances of the letter **r**. \n\n**Breakdown:** \nS - T - **R** - A - W - B - E - **R** - **R** - Y.", "tool_calls":null }, "logprobs":null, "finish_reason":"stop", "matched_stop":1 } ], "usage":{ "prompt_tokens":17, "total_tokens":435, "completion_tokens":418, "prompt_tokens_details":null } }The thoughts of the models are enclosed in
<think>tags which can be parsed out to get the reasoning of the model. -
You may also make use of the utility script
stream_chat.shto stream responses in real time./stream_chat.sh "Which is bigger 9.9 or 9.11 ?" -
To run benchmarks for inference, you can use the default benchmarking tool from SGLang like this
kubectl exec -it service/$USER-serving-deepseek-r1-model-svc -- /bin/bash -c "python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-range-ratio 1 --num-prompt 1100 --random-input 1000 --random-output 1000 --host 0.0.0.0 --port 30000 --output-file /gcs/benchmark_logs/sglang/ds_1000_1000_1100_output.jsonl"
Once the benchmark is done, you can find the results in the GCS Bucket. You should see logs similar to this:
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Max reqeuest concurrency: not set Successful requests: 1100 Benchmark duration (s): ... Total input tokens: 1100000 Total generated tokens: 1100000 Total generated tokens (retokenized): 1096494 Request throughput (req/s): xxx Input token throughput (tok/s): xxxx Output token throughput (tok/s): xxxx Total token throughput (tok/s): xxxx Concurrency: xxx ----------------End-to-End Latency---------------- Mean E2E Latency (ms): xxxxxxx Median E2E Latency (ms): xxxxxxx ---------------Time to First Token---------------- Mean TTFT (ms): xxxxxxx Median TTFT (ms): xxxxxxx P99 TTFT (ms): xxxxxxx -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): xxxxxxx Median TPOT (ms): xxxxxxx P99 TPOT (ms): xxxxxxx ---------------Inter-token Latency---------------- Mean ITL (ms): xxxxxxx Median ITL (ms): xxxxxxx P99 ITL (ms): xxxxxxx ==================================================
To clean up the resources created by this recipe, complete the following steps:
-
Uninstall the helm chart.
helm uninstall $USER-serving-deepseek-r1-model -
Delete the Kubernetes Secret.
kubectl delete secret hf-secret
If you created your cluster using the GKE environment setup guide, it is configured with default settings that include the names for networks and subnetworks used for communication between:
- The host to external services.
- GPU-to GPU communication.
For clusters with this default configuration, the Helm chart can automatically generate the required networking annotations in a Pod's metadata. Therefore, you can use the streamlined command to install the chart, as described in the the Multi node inference benchmark of DeepSeek R1 671B with SGLang on A3 Mega GKE Node Pool section.
To configure the correct networking annotations for a cluster that uses non-default names for GKE Network resources, you must provide the names of the GKE Network resources in you cluster when installing the chart. Use the following example command, remembering to replace the example values with the actual names of your cluster's GKE Network resources:
cd $RECIPE_ROOT
helm install -f values.yaml \
--set job.image.repository=${ARTIFACT_REGISTRY}/${SGLANG_IMAGE} \
--set job.image.tag=${SGLANG_VERSION} \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set network.subnetworks[0]=default \
--set network.subnetworks[1]=gvnic-1 \
--set network.subnetworks[2]=rdma-0 \
--set network.subnetworks[3]=rdma-1 \
--set network.subnetworks[4]=rdma-2 \
--set network.subnetworks[5]=rdma-3 \
--set network.subnetworks[6]=rdma-4 \
--set network.subnetworks[7]=rdma-5 \
--set network.subnetworks[8]=rdma-6 \
--set network.subnetworks[9]=rdma-7 \
$USER-serving-deepseek-r1-model \
$REPO_ROOT/src/helm-charts/a3mega/sglang-inference