The Shared Pathways Service accelerates developer iteration by providing a persistent, multi-tenant TPU environment. This decouples service creation from the development loop, allowing JAX clients to connect on-demand from a familiar local environment (like a laptop or cloud VM) to a long-running Pathways service that manages scheduling and error handling.
You have a GKE cluster with at least 1 TPU slice (v5e, v5p or v6e).
Start the Shared Pathways Service by using pw-service-example.yaml. Make sure to modify the following values to deploy the Pathways pods:
- A unique Jobset name for the head pod
- GCS bucket path
- TPU type and topology
- Number of slices
3. Verify that the pods created in Step#2 are running
Verify that the Shared Pathways Service components are started, specifically the Pathways resource manager (RM) and Pathways workers.
# Set the environment variables.
$ PROJECT=<your-project>
$ CLUSTER_NAME=<your-cluster>
$ REGION=<cluster-region> # e.g., us-central2
# Get credentials for your cluster.
$ gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project=$PROJECT && kubectl config view && kubectl config set-context --current --namespace=default$ kubectl get pods
# Sample expected output (1 Head pod and 1 or more Worker pods)
NAME READY STATUS RESTARTS AGE
pathways-cluster-pathways-head-0-0-zzmn2 2/2 Running 0 3m49s # HEAD POD
pathways-cluster-worker-0-0-bdzq4 1/1 Running 0 3m36s # WORKER 0
pathways-cluster-worker-1-0-km2rf 1/1 Running 0 3m36s # WORKER 1# e.g., pathways-cluster
$ JOBSET_NAME=<your-jobset-name> # same as you used in [pw-service-example.yaml](#pw-service-yaml)
# e.g., pathways-cluster-pathways-head-0-0-zzmn2
$ HEAD_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep head)
# e.g., pathways-cluster-worker-0-0-bdzq4
$ WORKER0_POD_NAME=$(kubectl get pods --selector=jobset.sigs.k8s.io/jobset-name=${JOBSET_NAME} -o jsonpath='{.items[?(@.status.phase=="Running")].metadata.name}' | sed 's/ /\n/g' | grep 'worker-0-0-')Find the detailed instructions here).
Find the address of the Pathways service from the logs. We check the worker pod logs in the below command.
$ kubectl logs $WORKER0_POD_NAME --container pathways-worker | grep "\-\-resource_manager_address"
I1208 20:10:18.148825 ...] argv[2]: '--resource_manager_address=pathways-cluster-pathways-head-0-0.pathways-cluster:29001'git clone https://github.com/AI-Hypercomputer/pathways-utils.git
pip install ./pathways-utilsIn your script,
- Import
isc_pathways - Add
with isc_pathways.connect(...)statement. The function takes the below values:- Cluster name
- Project name
- Region
- GCS bucket name
- Pathways Service (See instructions to find the RM address here)
- Write your ML code under this context manager (the
withblock) to run your JAX code on the underlying TPUs.
See run_connect_example.py for reference. Example code:
python3 pathwaysutils/experimental/shared_pathways_service/run_connect_example.py \
--cluster="my-cluster" \
--project="my-project" \
--region="cluster-region" \
--gcs_bucket="gs://user-bucket" \
--pathways_service="pathways-cluster-pathways-head-0-0.pathways-cluster:29001" \
--tpu_type="tpuv6e:2x2" \
--tpu_count=1 # number of slicesThe connect block will deploy a proxy pod dedicated to your client and connect your local runtime environment to the proxy pod via port-forwarding.
- You can start another client that uses the same
pathways_service(similar to Step#3). If the Shared Pathways Service finds available TPU(s) that match your request, your workload will start running on these available resources. However, if all TPUs are occupied, you can expect your script to halt until the TPUs are available again.
-
Refer to this guide if your Pathways pods do not come up!
-
Known errors: The cleanup process of the service is not as clean right now. You can safely ignore the
Segmentation faulterror, if you see any, after your ML job completes.