Single Host Model Serving with SGLang on A3 Ultra GKE Node Pool

This document outlines the steps to serve and benchmark various Large Language Models (LLMs) using the SGLang framework on a single A3 Ultra GKE Node pool.

This guide walks you through setting up the necessary cloud infrastructure, configuring your environment, and deploying a high-performance LLM for inference.

1. Test Environment
2. High-Level Architecture
3. Environment Setup (One-Time)
4. Run the Recipe
- 4.1. Serving DeepSeek R1 671B
- 4.2. Serving Llama 4 Models
5. Monitoring and Troubleshooting
6. Cleanup

1. Test Environment

The recipe uses the following setup:

Orchestration: Google Kubernetes Engine (GKE)
Deployment Configuration: A Helm chart is used to configure and deploy a Kubernetes Deployment. This deployment encapsulates the inference of the target LLM using the SGLang framework.

This recipe has been optimized for and tested with the following configuration:

GKE Cluster:
- A regional standard cluster version: 1.31.7-gke.1265000 or later.
- A GPU node pool with 1 a3-ultragpu-8g machine.
- Workload Identity Federation for GKE enabled.
- Cloud Storage FUSE CSI driver for GKE enabled.
- DCGM metrics enabled.
- Kueue and JobSet APIs installed.
- Kueue configured to support Topology Aware Scheduling.
A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs.

Important

To prepare the required environment, see the GKE environment setup guide. Provisioning a new GKE cluster is a long-running operation and can take 20-30 minutes.

2. High-Level Flow

Here is a simplified diagram of the flow that we follow in this recipe:

---
config:
  layout: dagre
---
flowchart TD
 subgraph workstation["Client Workstation"]
    T["Cluster Toolkit"]
    B("Kubernetes API")
    A["helm install"]
    Y["gcloud"]
  end
 subgraph imagerepo["Build Image"]
    H["Artifact Registry"]
    G["Cloud Build"]
  end
 subgraph huggingface["Hugging Face Hub"]
    I["Model Weights"]
  end
 subgraph gke["GKE Cluster (A3 Ultra)"]
    C["Deployment"]
    D["Pod"]
    E["SGLang Container"]
    F["Service"]
  end
 subgraph storage["Cloud Storage"]
    J["Bucket"]
  end

    %% Logical/actual flow
    T -- Create Cluster --> gke
    A --> B
    G -- Pushes Image --> H
    B --> C & F
    C --> D
    D --> E
    F --> C
    H -- Pulls Image --> E
    E -- Downloads at runtime --> I
    E -- Write logs --> J
    Y -- Run Build --> imagerepo


    %% Layout control
    gke ~~~ imagerepo

helm: A package manager for Kubernetes to define, install, and upgrade applications. It's used here to configure and deploy the Kubernetes Deployment.
Deployment: Manages the lifecycle of your model server pod, ensuring it stays running.
Service: Provides a stable network endpoint (a DNS name and IP address) to access your model server.
Pod: The smallest deployable unit in Kubernetes. The SGLang container runs inside this pod on a GPU-enabled node.
Cloud Build: A service to run build jobs on Google Cloud to build the SGLang container image.
Artifact Registry: A single place to manage container images.
Cloud Storage: A Cloud Storage bucket to store benchmark logs and other artifacts.

3. Environment Setup (One-Time)

First, you'll configure your local environment. These steps are required once before you can deploy any models.

3.1. Clone the Repository

git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=$(pwd)
export RECIPE_ROOT=$REPO_ROOT/inference/a3ultra/single-host-serving/sglang

3.2. Configure Environment Variables

This is the most critical step. These variables are used in subsequent commands to target the correct resources.

export PROJECT_ID=<PROJECT_ID>
export REGION=<REGION_for_cloud_build>
export CLUSTER_REGION=<REGION_of_your_cluster>
export CLUSTER_NAME=<YOUR_GKE_CLUSTER_NAME>
export KUEUE_NAME=<YOUR_KUEUE_NAME>
export ARTIFACT_REGISTRY=<your-artifact-registry-repo-full-path>
export GCS_BUCKET=<your-gcs-bucket-for-logs>
export SGLANG_IMAGE=lmsysorg/sglang
export SGLANG_VERSION=v0.4.6.post4-cu124

# Set the project for gcloud commands
gcloud config set project $PROJECT_ID

Replace the following values:

Variable	Description	Example
`PROJECT_ID`	Your Google Cloud Project ID.	`gcp-project-12345`
`REGION`	The GCP region to run the Cloud Build job.	`us-central1`
`CLUSTER_REGION`	The GCP region where your GKE cluster is located.	`us-central1`
`CLUSTER_NAME`	The name of your GKE cluster.	`a3-ultra-cluster`
`KUEUE_NAME`	The name of the Kueue local queue. The default queue created by the cluster toolkit is `a3-ultra`. Verify the name in your cluster.	`a3-ultra`
`ARTIFACT_REGISTRY`	Full path to your Artifact Registry repository.	`us-central1-docker.pkg.dev/gcp-project-12345/my-repo`
`GCS_BUCKET`	Name of your GCS bucket (do not include `gs://`).	`my-benchmark-logs-bucket`
`SGLANG_IMAGE`	The name for the Docker image to be built.	`lmsysorg/sglang`
`SGLANG_VERSION`	The tag/version for the Docker image.	`v0.4.6.post4-cu124`

3.3. Connect to your GKE Cluster

Fetch credentials for kubectl to communicate with your cluster.

gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION

3.4. Get Hugging Face token

To access models through Hugging Face, you'll need a Hugging Face token.

Create a Hugging Face account if you don't have one.
For gated models like Llama 4, ensure you have requested and been granted access on Hugging Face before proceeding.
Generate an Access Token: Go to Your Profile > Settings > Access Tokens.
Select New Token.
Specify a Name and a Role of at least Read.
Select Generate a token.
Copy the generated token to your clipboard. You'll use this later.

3.5. Create Hugging Face Kubernetes Secret

Create a Kubernetes Secret with your Hugging Face token to enable the job to download model checkpoints from Hugging Face.

# Paste your Hugging Face token here
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>

kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HF_TOKEN} \
--dry-run=client -o yaml | kubectl apply -f -

3.6. Build the SGLang Serving Image

This step uses Cloud Build to create a custom Docker image with SGLang and push it to your Artifact Registry repository.

Note

This build process can take up to 30 minutes as it compiles and installs several dependencies.

cd $REPO_ROOT/src/docker/sglang
gcloud builds submit --region=${REGION} \
    --config cloudbuild.yml \
    --substitutions _ARTIFACT_REGISTRY=$ARTIFACT_REGISTRY,_SGLANG_IMAGE=$SGLANG_IMAGE,_SGLANG_VERSION=$SGLANG_VERSION \
    --timeout "2h" \
    --machine-type=e2-highcpu-32 \
    --disk-size=1000 \
    --quiet \
    --async

Optionally, you can monitor the build progress by streaming its logs. Replace <BUILD_ID> with the ID from the previous command's output.

BUILD_ID=<BUILD_ID>
gcloud builds log $BUILD_ID --stream --region=$REGION

Warning

You may see pip's dependency resolver warnings in the build logs. These are generally safe to ignore as long as the Cloud Build job completes successfully.

You have now completed the environment setup! You are ready to deploy a model.

4. Run the Recipe

This recipe supports the deployment of the following models:

DeepSeek R1 671B
Llama 4 Maverick & Scout

Now, select a model to deploy. Each section below is self-contained for deploying a specific model.

Note

After running the recipe with helm install, it can take up to 30 minutes for the deployment to become fully available. This is because the GKE node must first pull the Docker image and then download the model weights from Hugging Face.

4.1. Serving DeepSeek R1 671B

This recipe serves the DeepSeek R1 671B model using SGLang framework on a single A3 Ultra node in native FP8 mode.

Upon launching the SGLang server, it performs the following steps:

Downloads the full DeepSeek R1 671B model checkpoints from Hugging Face.
Loads the model checkpoints and applies SGLang optimizations.
Server is ready to respond to requests.

4.1.1. Deploy DeepSeek R1 671B

Install the helm chart to prepare and serve the model using SGLang framework:

cd $RECIPE_ROOT
helm install -f values.yaml \
--set-file workload_launcher=$REPO_ROOT/src/launchers/sglang-launcher.sh \
--set-file serving_config=$REPO_ROOT/src/frameworks/a3ultra/sglang-configs/deepseek-r1-671b.yaml \
--set queue=${KUEUE_NAME} \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set workload.model.name=deepseek-ai/DeepSeek-R1 \
--set workload.image=${ARTIFACT_REGISTRY}/${SGLANG_IMAGE}:${SGLANG_VERSION} \
--set workload.framework=sglang \
$USER-serving-deepseek-r1-model \
$REPO_ROOT/src/helm-charts/a3ultra/inference-templates/deployment

This creates a Helm release and a Deployment named $USER-serving-deepseek-r1-model, and a Service named $USER-serving-deepseek-r1-model-svc.

Check the deployment status.
```
kubectl get deployment/$USER-serving-deepseek-r1-model
```
Wait until the READY column shows 1/1. See the Monitoring and Troubleshooting section to view the deployment logs.

Note

This deployment process can vary as it downloads the model weights from Hugging Face and then the server loads the model weights. Estimated time around 30 min.

4.1.2. Interact with DeepSeek R1 671B model

Make an API request:

Send a chat message and receive a JSON response from the model:

kubectl exec -it deployment/$USER-serving-deepseek-r1-model -- \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model":"default",
"messages":[
    {
        "role":"system",
        "content":"You are a helpful AI assistant"
    },
    {
        "role":"user",
        "content":"How many r are there in strawberry ?"
    }
],
"temperature":0.6,
"top_p":0.95,
"max_tokens":2048
}'

You should receive a JSON response from the model.

Stream a chat response:

First, open a new terminal session and forward a local port to the service to allow your local machine to communicate with the model server:
```
kubectl port-forward svc/$USER-serving-deepseek-r1-model-svc 8000:8000
```
In a separate terminal, run the stream_chat.sh utility script:
```
$RECIPE_ROOT/stream_chat.sh "Which is bigger 9.9 or 9.11 ?"
```

4.1.3. Benchmark DeepSeek R1 671B

Run the SGLang benchmarking tool directly inside the running deployment:

kubectl exec -it deployment/$USER-serving-deepseek-r1-model -- /bin/sh -c \
'mkdir -p /gcs/benchmark_logs/sglang && python3 -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --random-range-ratio 1 \
  --num-prompt 1100 \
  --random-input 1000 \
  --random-output 1000 \
  --host 0.0.0.0 \
  --port 8000 \
  --output-file /gcs/benchmark_logs/sglang/ds_1000_1000_1100_output.jsonl'

Benchmark results are displayed in the logs.

4.2. Serving Llama 4 Models

This recipe serves various Llama 4 models using SGLang framework on a single A3 Ultra node in full precision (BF16).

4.2.1. Compatible Llama 4 Models

Llama 4 models are offered in various sizes and precision. This recipe is compatible with:

Model Name	Total Size	Precision	Context Length
Llama-4-Scout-17B-16E	109B	BF16	3.6M
Llama-4-Scout-17B-16E-Instruct	109B	BF16	3.6M
Llama-4-Maverick-17B-128E	400B	BF16	1M
Llama-4-Maverick-17B-128E-Instruct	400B	BF16	1M

Note

Llama 4 models are gated. Ensure you have requested and been granted access on Hugging Face.

4.2.2. Deploy Llama 4 Models

Install the helm chart to prepare and serve the model using SGLang framework (choose one):

To serve Llama-4-Scout-17B-16E or Llama-4-Scout-17B-16E-Instruct (3.6M context):

cd $RECIPE_ROOT
helm install -f values.yaml \
--set-file workload_launcher=$REPO_ROOT/src/launchers/sglang-launcher.sh \
--set-file serving_config=$REPO_ROOT/src/frameworks/a3ultra/sglang-configs/llama4-scout.yaml \
--set queue=${KUEUE_NAME} \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set workload.model.name=meta-llama/Llama-4-Scout-17B-16E-Instruct \
--set workload.image=${ARTIFACT_REGISTRY}/${SGLANG_IMAGE}:${SGLANG_VERSION} \
--set workload.framework=sglang \
$USER-serving-llama-4-scout-model \
$REPO_ROOT/src/helm-charts/a3ultra/inference-templates/deployment

This creates a Helm release and a Deployment named $USER-serving-llama-4-scout-model, and a Service named $USER-serving-llama-4-scout-model-svc.

To serve Llama-4-Maverick-17B-128E or Llama-4-Maverick-17B-128E-Instruct (1M context):

cd $RECIPE_ROOT
helm install -f values.yaml \
--set-file workload_launcher=$REPO_ROOT/src/launchers/sglang-launcher.sh \
--set-file serving_config=$REPO_ROOT/src/frameworks/a3ultra/sglang-configs/llama4-maverick.yaml \
--set queue=${KUEUE_NAME} \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set workload.model.name=meta-llama/Llama-4-Maverick-17B-128E-Instruct \
--set workload.image=${ARTIFACT_REGISTRY}/${SGLANG_IMAGE}:${SGLANG_VERSION} \
--set workload.framework=sglang \
$USER-serving-llama-4-maverick-model \
$REPO_ROOT/src/helm-charts/a3ultra/inference-templates/deployment

This creates a Helm release and a Deployment named $USER-serving-llama-4-maverick-model, and a Service named $USER-serving-llama-4-maverick-model-svc.

Note

This deployment process can vary as it downloads the model weights from Hugging Face and then the server loads the model weights. Estimated time around 30 min.

Check the status of your deployment.

Change deployment name depending on the Llama 4 model variant deployed. For example, for Llama-4-Scout-17B-16E-Instruct, run:
```
kubectl get deployment/$USER-serving-llama-4-scout-model
```
Wait until the READY column shows 1/1. See the Monitoring and Troubleshooting section to view the deployment logs.

4.2.3. Interact with Llama 4 Models

Make an API request:

Send a chat message and receive a JSON response from the model. Here's the example for Llama-4-Scout-17B-16E-Instruct (assuming deployed as $USER-serving-llama-4-scout):

kubectl exec -it deployment/$USER-serving-llama-4-scout-model -- \
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model":"meta-llama/Llama-4-Scout-17B-16E-Instruct",
  "messages":[
      {
        "role":"system",
        "content":"You are a helpful AI assistant"
      },
      {
        "role":"user",
        "content":"What is the meaning of life?"
      }
  ],
  "temperature":0.6,
  "top_p":0.9,
  "max_tokens":128
}'

To get response from a different Llama 4 variant deployed, you can change the "model" field in the JSON payload.

Stream a chat response:

First, open a new terminal session and forward a local port to the service to allow your local machine to communicate with the model server. For example, for Llama-4-Scout-17B-16E-Instruct, run:
```
kubectl port-forward svc/$USER-serving-llama-4-scout-model-svc 8000:8000
```
In a separate terminal, run the stream_chat.sh utility script with the model name as the second argument:
```
$RECIPE_ROOT/stream_chat.sh "what is the meaning of life ?" "meta-llama/Llama-4-Scout-17B-16E-Instruct"
```

4.2.4. Benchmark Llama 4 Models

Run the SGLang benchmarking tool directly inside the running deployment. For example, to run benchmarks for the model Llama-4-Scout-17B-16E-Instruct with deployment $USER-serving-llama-4-scout-serving:

kubectl exec -it deployment/$USER-serving-llama-4-scout-model -- /bin/sh -c \
'mkdir -p /gcs/benchmark_logs/sglang && python3 -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --random-range-ratio 1 \
  --num-prompt 1100 \
  --random-input 1000 \
  --random-output 1000 \
  --host 0.0.0.0 \
  --port 8000 \
  --output-file /gcs/benchmark_logs/sglang/llama4_1000_1000_1100_output.jsonl'

Benchmark results are displayed in the logs.

5. Monitoring and Troubleshooting

After the model is deployed via Helm as described in the sections above, use the following steps to monitor the deployment and interact with the model. Replace <deployment-name> and <service-name> with the appropriate names from the model-specific deployment instructions (e.g., $USER-serving-deepseek-r1-model and $USER-serving-deepseek-r1-model-svc).

5.1. Check Deployment Status

Check the status of your deployment. Replace the name if you deployed a different model.

# Example for DeepSeek
kubectl get deployment/$USER-serving-deepseek-r1-model

Wait until the READY column shows 1/1. If it shows 0/1, the pod is still starting up.

Note

In the GKE UI on Cloud Console, you might see a status of "Does not have minimum availability" during startup. This is normal and will resolve once the pod is ready.

5.2. View Logs

To see the logs from the SGLang server (useful for debugging), use the -f flag to follow the log stream:

kubectl logs -f deployment/$USER-serving-deepseek-r1-model

You should see logs indicating SGLang server downloading/loading the model, and then starting the API server, similar to this:

[2025-01-31 11:43:03] INFO:     Started server process [1]
[2025-01-31 11:43:03] INFO:     Waiting for application startup.
[2025-01-31 11:43:03] INFO:     Application startup complete.
[2025-01-31 11:43:03] INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
[2025-01-31 11:43:04] INFO:     127.0.0.1:40408 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-31 11:43:04 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-31 11:43:06 TP7] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/N=4096,K=512,device_name=NVIDIA_H200,dtype=fp8_w8a8,block_shape=[128, 128].json for W8A8 Block FP8 kernel.
[2025-01-31 11:43:06 TP0] Using configuration from /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/
[2025-01-31 11:43:10] INFO:     127.0.0.1:40424 - "POST /generate HTTP/1.1" 200 OK
[2025-01-31 11:43:10] The server is fired up and ready to roll!

5.3. Common Issues

Error: Connection refused when using port-forward

If you are trying to stream responses using kubectl port-forward and get a connection error, check the following:
1. Is the deployment ready? Run kubectl get deployment and ensure the READY column is 1/1.
2. Is the port-forward command running? The command must remain active in its own terminal while you make requests.
3. Check Pod Logs: Use kubectl logs -f ... to check for any error messages.
4. Try again: Sometimes transient network issues can cause this. Stop the port-forward command (Ctrl+C) and run it again.
Error: denied: requested access to the resource is denied during Cloud Build

This almost always means the ARTIFACT_REGISTRY environment variable is incorrect. It must be the full path: <location>-docker.pkg.dev/<project-id>/<repository-name>.
Error: deployments.apps "..." not found

This indicates a typo in the deployment name. Use helm list to see the correct release names or kubectl get deployments to see all available deployment names.

6. Cleanup

To avoid incurring further charges, clean up the resources you created.

Uninstall the Helm Release:

First, list your releases to get the deployed models:
```
# list deployed models
helm list --filter $USER-serving-
```
Then, uninstall the desired release:
```
# uninstall the deployed model
helm uninstall <release_name>
```
Replace <release_name> with the helm release names listed.

Delete the Kubernetes Secret:

kubectl delete secret hf-secret --ignore-not-found=true

(Optional) Delete the built Docker image from Artifact Registry if no longer needed.
(Optional) Delete Cloud Build logs.
(Optional) Clean up files in your GCS bucket if benchmarking was performed.
(Optional) Delete the test environment provisioned including GKE cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single Host Model Serving with SGLang on A3 Ultra GKE Node Pool

Table of Contents

1. Test Environment

2. High-Level Flow

3. Environment Setup (One-Time)

3.1. Clone the Repository

3.2. Configure Environment Variables

3.3. Connect to your GKE Cluster

3.4. Get Hugging Face token

3.5. Create Hugging Face Kubernetes Secret

3.6. Build the SGLang Serving Image

4. Run the Recipe

4.1. Serving DeepSeek R1 671B

4.1.1. Deploy DeepSeek R1 671B

4.1.2. Interact with DeepSeek R1 671B model

4.1.3. Benchmark DeepSeek R1 671B

4.2. Serving Llama 4 Models

4.2.1. Compatible Llama 4 Models

4.2.2. Deploy Llama 4 Models

4.2.3. Interact with Llama 4 Models

4.2.4. Benchmark Llama 4 Models

5. Monitoring and Troubleshooting

5.1. Check Deployment Status

5.2. View Logs

5.3. Common Issues

6. Cleanup

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Single Host Model Serving with SGLang on A3 Ultra GKE Node Pool

Table of Contents

1. Test Environment

2. High-Level Flow

3. Environment Setup (One-Time)

3.1. Clone the Repository

3.2. Configure Environment Variables

3.3. Connect to your GKE Cluster

3.4. Get Hugging Face token

3.5. Create Hugging Face Kubernetes Secret

3.6. Build the SGLang Serving Image

4. Run the Recipe

4.1. Serving DeepSeek R1 671B

4.1.1. Deploy DeepSeek R1 671B

4.1.2. Interact with DeepSeek R1 671B model

4.1.3. Benchmark DeepSeek R1 671B

4.2. Serving Llama 4 Models

4.2.1. Compatible Llama 4 Models

4.2.2. Deploy Llama 4 Models

4.2.3. Interact with Llama 4 Models

4.2.4. Benchmark Llama 4 Models

5. Monitoring and Troubleshooting

5.1. Check Deployment Status

5.2. View Logs

5.3. Common Issues

6. Cleanup