Skip to content

Provide Spire Workload Attestation Support to ztunnel#1676

Open
MikeZappa87 wants to merge 14 commits intoistio:masterfrom
MikeZappa87:mzappa/spirepid
Open

Provide Spire Workload Attestation Support to ztunnel#1676
MikeZappa87 wants to merge 14 commits intoistio:masterfrom
MikeZappa87:mzappa/spirepid

Conversation

@MikeZappa87
Copy link
Contributor

@MikeZappa87 MikeZappa87 commented Nov 21, 2025

SPIRE Delegated Identity API Integration for ztunnel

Overview

This document describes the design and implementation of SPIRE integration in ztunnel using the Delegated Identity API. The implementation supports one attestation mode: PID-based, each with different security and efficiency trade-offs.

Background

Current ztunnel Certificate Management

The existing ztunnel certificate management uses SecretManager to cache certificates by Identity (SPIFFE ID). When multiple pods share the same service account, they share a single cached certificate, reducing CA calls and memory usage.

SPIRE Delegated Identity API

SPIRE's Delegated Identity API allows a trusted delegate (ztunnel) to request certificates on behalf of workloads. The API supports two attestation methods:

  1. Selectors: Identify workloads by Kubernetes namespace + service account (We are not using this as its Spire specific)
  2. PID: Identify workloads by their process ID for stronger attestation

Design

Attestation Modes

In PID mode, each workload is attested individually using its container process ID. This approach:

  • Provides stronger security through per-workload attestation
  • Each pod receives its own certificate from SPIRE
  • Higher SPIRE server load and memory usage
  • SPIRE verifies the actual running process, not just Kubernetes metadata
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Pod A (PID 1)  │     │  Pod B (PID 2)  │     │  Pod C (PID 3)  │
└────────┬────────┘     └────────┬────────┘     └────────┬────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Certificate A  │     │  Certificate B  │     │  Certificate C  │
│ (SPIRE call #1) │     │ (SPIRE call #2) │     │ (SPIRE call #3) │
└─────────────────┘     └─────────────────┘     └─────────────────┘

CompositeId Design

Motivation

The original SecretManager used Identity as the cache key. To support PID-based attestation while maintaining backward compatibility with the existing CaClientTrait interface, we introduced CompositeId<RequestKeyEnum>.

Structure

pub struct CompositeId<RequestKeyEnum> {
    id: Identity,           // The SPIFFE identity (ns/sa)
    key: RequestKeyEnum,    // Distinguishes individual workloads
}

pub enum RequestKeyEnum {
    Identity(Identity),     // For backwards compability
    Workload(WorkloadUid),  // For PID mode: key by workload UID
}

Trade-offs

This design was chosen to maintain backward compatibility with CaClientTrait:

#[async_trait]
pub trait CaClientTrait: Send + Sync {
    async fn fetch_certificate(
        &self, 
        id: &CompositeId<RequestKeyEnum>
    ) -> Result<tls::WorkloadCertificate, Error>;
}

Benefits:

  • Single interface works for both SPIRE modes and the original CA client
  • No breaking changes to existing code paths
  • SecretManager can track per-workload state when needed

Consequences:

  • In PID mode, SecretManager caches by CompositeId, resulting in one cache entry per pod even if they share the same identity
  • This is intentional—each workload must be individually attested

PID Verification Flow

In PID mode, ztunnel performs the following steps:

  1. Receive certificate request with WorkloadUid
  2. Query Container Runtime (CRI) for the container's PID
  3. Call SPIRE with the PID for attestation
  4. Re-verify PID after SPIRE returns (guards against PID reuse attacks)
  5. Return certificate to caller
async fn get_cert_by_pid(&self, pid: i32, wl_uid: &WorkloadUid) -> Result<...> {
    // 1. Get certificate from SPIRE using PID
    let certs = self.get_cert_from_spire(DelegateAttestationRequest::Pid(pid)).await;
    
    // 2. Re-verify PID hasn't changed (TOCTOU protection)
    if let Some(pid_client) = &self.pid {
        let fetched_pid = pid_client.fetch_pid(wl_uid).await?;
        if fetched_pid.into_i32() != pid {
            return Err(Error::UnableToDeterminePidForWorkload(...));
        }
    }
    
    Ok(certs?)
}

Comparison Summary

Aspect Selector Mode PID Mode
Attestation Granularity Per identity (ns/sa) Per workload (pod)
Certificate Sharing Yes—same identity shares cert No—each pod gets own cert
SPIRE Calls 1 per unique identity 1 per pod
Memory Usage Lower Higher
Security Level Standard Enhanced
Cache Key CompositeId with Identity key CompositeId with WorkloadUid key

Configuration

# Enable SPIRE integration
spire_enabled: true

# Choose attestation mode
spire_mode: "ByPid"  # or "BySelectors"

# SPIRE socket path
spire_socket_path: "/run/spire/sockets/agent.sock"

# Timeout for SPIRE operations  
spire_timeout: "30s"

Future Considerations

  1. Certificate Caching with Per-Pod Attestation: In PID mode, we should cache and reuse certificates by Identity while still attesting every pod individually. This would reduce SPIRE server load and memory usage—multiple pods with the same identity would share one certificate after each pod passes local PID verification. The first pod triggers a SPIRE call; subsequent pods with the same identity only require local PID verification before reusing the cached certificate.

  2. Collaborate with SPIRE/SPIFFE Community: Work with the SPIRE and SPIFFE community to improve the Delegated Identity API and related interfaces to better support delegated attestation use cases like ztunnel's.

  3. Consider a different trait for attested workloads instead of modifying fetch_certificate.

@istio-testing istio-testing added do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 21, 2025
@istio-testing
Copy link
Contributor

Hi @MikeZappa87. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@MikeZappa87 MikeZappa87 force-pushed the mzappa/spirepid branch 6 times, most recently from 9105d4e to c20b489 Compare November 27, 2025 04:17
@MikeZappa87 MikeZappa87 marked this pull request as ready for review December 5, 2025 23:02
@MikeZappa87 MikeZappa87 requested a review from a team as a code owner December 5, 2025 23:02
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Dec 5, 2025
@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Dec 18, 2025
@Dimss
Copy link

Dimss commented Jan 7, 2026

Hey @MikeZappa87 we are (RH) interested in this feature, can I somehow help to push it forward ?
Looks like this PR needs rebase.
In addition, can you spot some light on the supported attestation methods?
In this PR you mentioning the Selector mode. But this PR was closed.
In the current PR you not mentioning Selector Mode, only PID mode.
To clarify, current PR includes supports for both Selector and PID attestation modes or only PID mode?

@MikeZappa87
Copy link
Contributor Author

Hey @MikeZappa87 we are (RH) interested in this feature, can I somehow help to push it forward ? Looks like this PR needs rebase. In addition, can you spot some light on the supported attestation methods? In this PR you mentioning the Selector mode. But this PR was closed. In the current PR you not mentioning Selector Mode, only PID mode. To clarify, current PR includes supports for both Selector and PID attestation modes or only PID mode?

@Dimss feel free to msg myself and Arndt on istio slack to discuss. We went the istio community sync before the holidays and had a couple action items.

I removed the selector approach as it would reduce the friction with this PR. Selector mode is a SPIRE specific implementation and does not exist in the SPIFFE specification. Right now, Arndt is doing work on the SPIFFE broker API spec which is what I believe the istio community would want as the current implementation is spire specific as a spiffe api does not exist yet.

Slack thread: https://istio.slack.com/archives/C049TCZMPCP/p1765304313250799

@tjons
Copy link

tjons commented Jan 17, 2026

@MikeZappa87 hey Mike! This is going to be super helpful... Anything I can do to help move this along? I've got a few commits on SPIRE, do you need help pushing anything on that side forward?

@MikeZappa87
Copy link
Contributor Author

@MikeZappa87 hey Mike! This is going to be super helpful... Anything I can do to help move this along? I've got a few commits on SPIRE, do you need help pushing anything on that side forward?

The istio community doesn't like the spire specific delegated identity api and want the spiffe broker endpoint api. We are working with the spiffe community to get that moving. Reach out to me on the istio slack, I can add you to the chat.

@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Jan 21, 2026
@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Jan 31, 2026
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Feb 5, 2026
@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Feb 17, 2026
@MikeZappa87 MikeZappa87 requested a review from a team as a code owner February 20, 2026 02:52
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Feb 20, 2026
@MikeZappa87 MikeZappa87 force-pushed the mzappa/spirepid branch 2 times, most recently from 944e2c0 to 4596b38 Compare February 24, 2026 20:12
nddq and others added 6 commits March 4, 2026 22:16
Document that this fork adds SPIRE as an alternative Certificate
Authority for workload identity, and explain the relationship with
Cilium as a temporary sub-project until upstream merge.

Co-authored-by: Bill Mulligan <billmulligan516@gmail.com>
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Change `readiness_addr` from binding to `0.0.0.0` (all interfaces)
to `127.0.0.1` (localhost only). Since ztunnel runs with
`hostNetwork: true`, binding to all interfaces unnecessarily exposes
the readiness endpoint to the network. The kubelet runs on the same
node and can reach localhost for health probes.

This reduces attack surface for hostNetwork pods.

Update `malicious_calls_inpod` test expectations for port 15021:
captured clients now get a connection reset (Request) and uncaptured
clients get connection refused (Connection) since readiness no longer
listens on the node IP.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Add CI and release workflows for building multi-arch container images.

CI workflow (build-images-ci.yaml):
- Triggers on push to master and PRs
- Builds binaries for amd64/arm64 with both aws-lc and boring TLS modes
- Pushes multi-arch images to dev registry with SHA tags
- FIPS variant uses tag suffix (-fips) on the same repo

Release workflow (build-images-releases.yaml):
- Triggers on version tags (v*)
- Same binary matrix as CI
- Pushes to release registry with version tags
- Creates GitHub Release with auto-generated changelog
- Attaches binaries and container image tarballs as release assets

Supporting files:
- Dockerfile using distroless base with TARGETARCH for multi-arch
- Makefile.docker with local build and multi-arch push targets
- build-release target added to Makefile.core.mk

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Filter artifact download to ztunnel-* pattern to avoid downloading
unrelated dockerbuild metadata artifacts. Add release environment to
create-release job so Quay.io secrets are available for skopeo login.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Enable xds_address to use Unix domain sockets via the unix:// URI
scheme. This allows ztunnel to connect to local control plane
components through Unix sockets instead of only TCP/TLS.

Changes:
- Add unix_socket_path() helper for string-based unix: URI detection
  (hyper's Uri parser does not support the unix scheme)
- Add UdsConnector and UdsGrpcChannel for Unix socket gRPC connections
- Add GrpcChannel enum to support both TLS and Unix socket channels
- Update grpc_channel() factory to route unix: URIs to UDS path
- Update XDS client to detect Unix scheme and skip TLS config
- Validate unix: URIs in config to reject empty socket paths
- Revert CA client to TLS-only (Unix socket scoped to XDS)
- Add unit tests for URI validation, channel factory, and connector

Unix sockets don't require TLS as they are local trusted paths. Auth
headers are still injected for control plane authentication.

Usage:
  export XDS_ADDRESS="unix:///var/run/xds.sock"

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Mar 5, 2026
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ok-to-test size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants