Distributed task orchestration platform in Go built around gRPC contracts, shared Postgres and Redis state, Kustomize-based GitOps delivery, and an Argo CD deployment path.
dispatchd uses Semantic Versioning for repository releases.
0.x.ycovers the current hardening phase, where the public API and operational model can still evolve.- patch releases are for backward-compatible fixes
- minor releases are for backward-compatible capabilities and hardening milestones
1.0.0is reserved for the point where the documented public API and operating model are stable enough for real production adoption
Released tags are immutable. Once a version is published, any correction ships as a new version instead of mutating the existing tag.
The release contract for dispatchd currently includes:
- the protobuf and gRPC contract under
proto/dispatchd/v1 - the generated Go client/server stubs under
gen/go/dispatchd - the documented operator-facing environment and deployment shape used by Docker Compose, Kustomize, Argo CD, and GitHub Actions
- the published image naming scheme under
ghcr.io/gnix0/dispatchd-*
Until 1.0.0, these surfaces may still change, but any change that affects compatibility should be called out in release notes and pull requests.
The runtime is organized around three service roles:
control-plane: accepts job submissions, serves query APIs, and persists job statescheduler: reconciles runnable and expired executions, maintains queue readiness, and holds the active scheduling leaseworker-gateway: manages bidirectional worker streams for registration, heartbeats, assignment delivery, and result feedback
The core domain is centered on durable Job and Execution records:
- a submitted job creates an initial queued execution
- the scheduler moves runnable executions into Redis-backed ready queues
- workers receive assignments through the gRPC stream and renew execution leases through heartbeats
- failed executions are retried with bounded exponential backoff
- retry exhaustion produces a terminal dead-lettered execution
The protobuf module under proto/ defines the external system contract.
JobServiceexposes submission, cancellation, lookup, and execution historyWorkerServicedefines the worker control stream used for registration, heartbeat, assignment delivery, results, and acknowledgements- generated protobuf and gRPC stubs are committed under
gen/go
Key message families:
Job: submitted work, metadata, priority, and retry policyExecution: an individual attempt for a jobRetryPolicy: max attempts plus backoff configurationConnectRequest/ConnectResponse: worker control-plane stream messages
Shared orchestration state is split across two storage layers:
Postgres: source of truth for jobs, executions, retries, idempotency, worker registry data, and execution metadataRedis: scheduler leadership lock and ready queues keyed by capability namespace
This allows the control-plane, scheduler, and worker-gateway to coordinate as separate processes instead of relying on process-local memory.
The gRPC transport layer lives under internal/transport/grpcapi.
- request payloads are mapped into application-layer inputs
- application errors are translated into explicit gRPC status codes
- unary and stream interceptors can enforce JWT-based authentication and role-based authorization
- worker stream messages validate worker identity against the authenticated principal when security is enabled
- reflection and standard gRPC health services are registered centrally
The application layer lives under internal/application/.
jobs: job creation, idempotency, execution tracking, completion handling, retry scheduling, and dead-letter decisionsscheduler: single-leader reconciliation of expired leases and runnable execution queueingworkers: worker registration, heartbeat state, capabilities, labels, concurrency metadata, and dispatch eligibility
The repository includes a zero-trust-ready foundation that can be enabled without changing the public API shape:
- JWT authentication and role-based authorization at the gRPC interceptor layer
- worker identity validation on the bidirectional worker stream
- audit logging for authenticated gRPC operations
- TLS and client-certificate validation boundaries in the server bootstrap path
- Kubernetes secret mounts for JWT material and TLS certificates
Default local development keeps:
AUTH_ENABLED=falseTLS_ENABLED=false
Security-relevant role mapping is:
submitter: submit jobsviewer: query jobs and execution historyoperator: submit, query, cancel, and operate service pathsworker/service: connect through the worker streamadmin: unrestricted gRPC access
- a single multi-stage Dockerfile builds any service binary through the
SERVICEbuild argument - docker-compose.yml packages the three services together with Postgres and Redis for local distributed execution
- deploy/base contains reusable Deployments, Services, ConfigMap, Secrets, ServiceAccount, and NetworkPolicy resources
- deploy/overlays/dev defines the development overlay used by local
kindclusters and GitOps updates - deploy/overlays/staging defines a promotion-oriented staging overlay
- deploy/overlays/prod defines a production-oriented overlay and PodDisruptionBudgets
- deploy/kind/cluster.yaml maps NodePorts for local access to the control-plane and worker-gateway
- image tags are managed through the Kustomize
imagessection in the dev overlay - scripts/update-dev-image-tags.sh updates the dev overlay tags for automated GitOps PRs
- the release workflow publishes images from semantic version tags and opens a PR with updated deployment tags instead of mutating manifests directly on the default branch
- deploy/argocd/dev-application.yaml defines the dev
Application - deploy/argocd/prod-region-a-application.yaml defines the primary production-region application
- deploy/argocd/prod-region-b-application.yaml defines the passive production-region application
- deploy/argocd/project.yaml defines the Argo CD project boundaries
- Argo CD targets the
deploy/overlays/dev,deploy/overlays/prod-region-a, anddeploy/overlays/prod-region-bpaths and can self-heal and prune once the repo is connected
If you fork or rename the repository, update the Argo CD repoURL fields to match the canonical Git URL for your deployment source.
- contribution expectations are documented in CONTRIBUTING.md
- security intake is documented in .github/SECURITY.md
- release history is tracked in CHANGELOG.md
- repository ownership defaults are declared in .github/CODEOWNERS
The repository includes active/passive region controls and explicit DR validation assets:
- scheduler leadership is coordinated through Redis so only one active scheduler instance reconciles work per region
- staging and production Kustomize overlays separate promotion targets from development
- production region overlays model active/passive control-plane and scheduler ownership through prod-region-a and prod-region-b
- production PodDisruptionBudgets keep the core services available during voluntary disruptions
- backup and restore helpers live in backup-postgres.sh, restore-postgres.sh, and validate-backup-restore.sh
- a scheduler restart drill lives in failover-smoke.sh
The region model is active/passive: only the primary region allows mutable orchestration work, while the passive region stays read-only until promoted.
Measured drill results:
- scheduler restart drill:
20/20jobs succeeded after a live scheduler restart;submit_to_succeeded p95 2513.668 ms; evidence in scheduler-restart-drill.json - backup/restore validation: restored counts matched the live snapshot exactly for jobs, executions, and workers; evidence in backup-restore-validation.json
The repository includes a Dockerized performance and observability workflow built around:
- Prometheus metrics scraping
- Grafana dashboards
- Jaeger trace collection through OTLP
k6gRPC load generation for the control-plane unary path- a dedicated
perf-workerload client for worker registration and heartbeat pressure - a dedicated perf-e2e driver for submit-to-claim and submit-to-succeeded measurements
Reference evidence lives under assets/perf, evidence/performance, and evidence/drills.
SLOs:
- control-plane unary gRPC availability:
>= 99.9% - control-plane unary gRPC latency:
p95 < 50 msin the reference Docker Compose environment - worker heartbeat acknowledgement latency:
p95 < 50 ms - worker heartbeat error rate:
< 0.1% - end-to-end submit-to-succeeded latency:
p95 < 7 son the average reference-compose run - scheduler restart drill latency:
p95 < 3 sfor the post-restart smoke run - backup/restore validation integrity: restored job, execution, and worker counts match the source snapshot exactly
Measured SLIs:
- control-plane unary smoke:
5772iterations,23088/23088checks passed,192.05 iterations/s,grpc_req_duration avg 4.15 ms,p95 10.28 ms; evidence in control-plane-smoke.json - worker heartbeat smoke:
180heartbeats,0errors,avg ack 10.37 ms,p95 20.451 ms,p99 22.555 ms; evidence in worker-heartbeats-smoke.json - end-to-end smoke:
20/20jobs succeeded,submit_to_claim p95 1474.958 ms,submit_to_succeeded p95 1577.84 ms; evidence in end-to-end-smoke.json - end-to-end average:
80/80jobs succeeded,submit_to_claim p95 6200.305 ms,submit_to_succeeded p95 6201.945 ms; evidence in end-to-end-average.json - scheduler restart drill:
20/20jobs succeeded after restart,submit_to_claim p95 2511.935 ms,submit_to_succeeded p95 2513.668 ms; evidence in scheduler-restart-drill.json - backup/restore validation: source and restored counts matched exactly at
jobs=6196,executions=6196,workers=14; evidence in backup-restore-validation.json
Evidence:
- Grafana request-rate and latency captures such as grafana_grpc_req_rate.png and grafana_grpc_p95_latency.png
- worker and scheduler activity captures such as grafana_worker_stream_events.png, grafana_dispatch_events.png, and grafana_scheduler_ticks_avg_duration.png
- JSON benchmark and drill summaries under evidence/performance and evidence/drills
The repository includes a dedicated delivery and security layer:
SAST: CodeQL for static analysis of the Go codebaseSCA:govulncheck, Dependabot, and repository-level Trivy filesystem scanningContainer Scanning: Trivy image scanning across the service imagesSecret Leakage Detection: Gitleaks plus Trivy secret scanningDAST: OWASP ZAP baseline scans against an explicitly provided environment URL
Because the core services expose gRPC rather than a browser-oriented HTTP application surface, DAST is modeled as an environment-level scan against an ingress, gateway, or externally exposed endpoint rather than as a local repo-only check.
cmd/ service entrypoints
deploy/ Kubernetes base, overlays, kind, and Argo CD manifests
gen/go/ generated protobuf and gRPC stubs
integration/ distributed smoke and future environment-level tests
internal/application/ application use cases
internal/platform/ shared runtime, config, security, store, and gRPC helpers
internal/transport/ gRPC handlers and protocol mapping
proto/ protobuf contracts
scripts/ developer automation helpers
tools/proto/ Dockerized protobuf toolchain
Core validation:
make fmt-check
make test
make build
make lint
make proto-checkDistributed smoke:
DISPATCHD_INTEGRATION=1 go test ./integration -run TestDistributedSubmitAssignCompleteFlow -count=1 -vProtobuf workflows:
make proto
make proto-breakingContainer workflows:
make compose-config
make compose-up
make compose-down
make backup-postgres
make failover-smoke
make validate-backup-restoreObservability and performance workflows:
make perf-stack-up
make perf-k6-smoke
make perf-worker-smoke
make perf-e2e-smoke
make perf-e2e-average
make perf-stack-downKubernetes workflows:
make k8s-render
make k8s-render-staging
make k8s-render-prod
make k8s-validate
make argocd-render
make kind-up
make kind-downGitOps tag update helper:
make gitops-update-dev IMAGE_TAG=sha-1234567The protobuf toolchain runs in Docker, so local protoc or buf installation is not required.