Title: Add jitter support for HTTP max_connection_duration
Description:
Currently, Envoy supports jitter for TCP max_connection_duration (implemented in #40686), but not for HTTP connections. This causes synchronised connection draining when many HTTP/2 connections reach the same max_connection_duration simultaneously, leading to thundering herd problems.
Use Case / Problem Statement
We are running a production Istio service mesh with approximately x number of HTTP/2 SIDECAR_INBOUND connections** using max_connection_duration: 7200s (2 hours).
Observed Behavior
When all connections hit the 2-hour mark simultaneously:
- Synchronised draining: All connections shutdown at the same time
- Service disruption: Incoming requests receive 503 errors during the drain window
- Response flags: Metrics show extensive
UC (Upstream Connection Termination) flags in Istio telemetry
Evidence
From Istio/Envoy metrics during drain cycles:
istio_requests_total{response_code="503",response_flags="UC"} [high counts]
istio_requests_total{response_code="503",response_flags="UC"}
istio_requests_total{response_code="503",response_flags="UC"}
This is a classic thundering herd problem caused by synchronised connection lifecycle management.
We need to implement connection duration limits for compliance and security reasons but cannot do so with the current behavior. The synchronised draining creates service disruptions. Kindly help with the feature.
Expecting the existing TCP jitter implementation (from #40686) to HTTP connection durations, allowing connections to be closed in a staggered manner.
Title: Add jitter support for HTTP max_connection_duration
Description:
Currently, Envoy supports jitter for TCP
max_connection_duration(implemented in #40686), but not for HTTP connections. This causes synchronised connection draining when many HTTP/2 connections reach the samemax_connection_durationsimultaneously, leading to thundering herd problems.Use Case / Problem Statement
We are running a production Istio service mesh with approximately x number of HTTP/2 SIDECAR_INBOUND connections** using
max_connection_duration: 7200s(2 hours).Observed Behavior
When all connections hit the 2-hour mark simultaneously:
UC(Upstream Connection Termination) flags in Istio telemetryEvidence
From Istio/Envoy metrics during drain cycles:
This is a classic thundering herd problem caused by synchronised connection lifecycle management.
We need to implement connection duration limits for compliance and security reasons but cannot do so with the current behavior. The synchronised draining creates service disruptions. Kindly help with the feature.
Expecting the existing TCP jitter implementation (from #40686) to HTTP connection durations, allowing connections to be closed in a staggered manner.