Skip to content

Feat/traffic keepalive#2591

Draft
matthewlouisbrockman wants to merge 110 commits intomainfrom
feat/traffic-keepalive
Draft

Feat/traffic keepalive#2591
matthewlouisbrockman wants to merge 110 commits intomainfrom
feat/traffic-keepalive

Conversation

@matthewlouisbrockman
Copy link
Copy Markdown
Contributor

No description provided.

# Conflicts:
#	packages/client-proxy/internal/proxy/paused_sandbox_resumer_grpc.go
Add a shared route IP resolver with the local-cluster fallback needed by CI, and make API/client-proxy callers treat empty resolved routes as unavailable instead of successful resume responses.

This keeps BYOC/remote empty node IPs from being treated as routable while preserving the local 127.0.0.1 path.
# Conflicts:
#	.env.gcp.template
#	iac/provider-gcp/Makefile
# Conflicts:
#	packages/api/internal/handlers/proxy_grpc.go
#	packages/client-proxy/internal/proxy/proxy.go
#	packages/client-proxy/internal/proxy/proxy_test.go
#	packages/client-proxy/internal/proxy/sandbox_lifecycle_client_grpc.go
#	packages/client-proxy/main.go
# Conflicts:
#	packages/api/internal/api/api.gen.go
#	packages/client-proxy/internal/proxy/proxy.go
# Conflicts:
#	packages/api/internal/handlers/proxy_grpc.go
#	packages/client-proxy/internal/proxy/paused_sandbox_resumer_grpc.go
#	packages/client-proxy/internal/proxy/proxy.go
#	packages/client-proxy/internal/proxy/proxy_test.go
#	packages/client-proxy/main.go
@cla-bot cla-bot Bot added the cla-signed label May 7, 2026
@cursor
Copy link
Copy Markdown

cursor Bot commented May 7, 2026

PR Summary

Medium Risk
Touches sandbox lifecycle and proxy/orchestrator routing paths and introduces new TTL-extension RPC; bugs here can cause unauthorized access or sandboxes not expiring/refreshing as intended.

Overview
Adds a new sandbox lifecycle surface (lifecycle.*) with traffic keepalive configuration, threads the keepalive config through API create/resume/pause, DB snapshot config, orchestrator gRPC (SandboxKeepaliveConfig), and routing catalog metadata.

Client proxy now triggers asynchronous keepalive refresh on catalog hits when keepalive is enabled, and the API exposes a new gRPC KeepAliveSandbox endpoint; also refactors proxy auth/traffic-token validation. Potential issues: autoPause/autoResume alias conflict checks may reject clients that always send default false, keepalive refresh is fire-and-forget per request (possible bursty background calls under high QPS), and routing-catalog updates now depend on gRPC metadata being correctly forwarded (missing/incorrect metadata will silently skip catalog writes/deletes).

Reviewed by Cursor Bugbot for commit 21be331. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

❌ 5 Tests Failed:

Tests completed Failed Passed Skipped
2612 5 2607 7
View the full list of 9 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 70.45% (Passed 39 times, Failed 93 times)

Stack Traces | 2.51s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (2.51s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 72.96% (Passed 43 times, Failed 116 times)

Stack Traces | 227s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (226.75s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 73.20% (Passed 41 times, Failed 112 times)

Stack Traces | 4.02s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox i3bl4j0q17ppczgjdhbf4
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1367}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox i3bl4j0q17ppczgjdhbf4
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1368}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox i3bl4j0q17ppczgjdhbf4
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1369}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Thu, 07 May 2026 18:10:31 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox i3bl4j0q17ppczgjdhbf4
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (4.02s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 54.20% (Passed 60 times, Failed 71 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 58.43% (Passed 37 times, Failed 52 times)

Stack Traces | 7.6s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1250}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (7.60s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 60.50% (Passed 47 times, Failed 72 times)

Stack Traces | 85.2s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (85.20s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 64.08% (Passed 37 times, Failed 66 times)

Stack Traces | 23.1s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1264}}
Executing command bash in sandbox iyxwfnc5ano0am6spd12b (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 182 MB\nFree memory before tmpfs mount: 802 MB\nMemory to use in integrity test (80% of free, min 64MB): 641 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"641+0 records in\n641+0 records out\n672137216 bytes (672 MB, 641 MiB) copied, 3.45325 s, 195 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=641\"\n\tUser time (seconds): "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"0.00\n\tSystem ti"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"me (seconds)"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:": 3.43\n\tPercent of"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"0:03.45\n\tAverage s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"hared text size (kbytes): 0\n\tAverage unshared data "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"size (kbytes"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"): 0\n\tAverag"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"e stack size"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" (kbytes): 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\n\tAverage total size (kbytes): 0\n\tMaximum resident set size"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" (kbytes): "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"2600\n\tAverage "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"resident set "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" (reclaimin"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"g a frame) page faults: "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"343\n\tVoluntary context switches: 3\n\tInvoluntary context switch"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"es: 24\n\tSwa"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ps: 0\n\tFile system "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"inputs: 176\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket m"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"essages rec"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"eived: 0\n\tSigna"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ls delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i7xwnhgwdvbam6lzzl1u9
Executing command bash in sandbox i7xwnhgwdvbam6lzzl1u9 (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1281}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"a402dd6a2587b019bf48b7b1a3cbd82619c666aaad1af7e5cc4acafa13c17cfb\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox i7xwnhgwdvbam6lzzl1u9
Executing command bash in sandbox i7xwnhgwdvbam6lzzl1u9 (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1284}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox i7xwnhgwdvbam6lzzl1u9: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (23.15s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestEnvdAccessTokenAutoResumeViaProxy

Flake rate in main: 49.33% (Passed 38 times, Failed 37 times)

Stack Traces | 12.4s run time
=== RUN   TestEnvdAccessTokenAutoResumeViaProxy
=== PAUSE TestEnvdAccessTokenAutoResumeViaProxy
=== CONT  TestEnvdAccessTokenAutoResumeViaProxy
    traffic_access_token_test.go:359: 
        	Error Trace:	.../tests/proxies/traffic_access_token_test.go:359
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestEnvdAccessTokenAutoResumeViaProxy
--- FAIL: TestEnvdAccessTokenAutoResumeViaProxy (12.42s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestSandboxWithTrafficAccessTokenAutoResumeViaProxy

Flake rate in main: 49.33% (Passed 38 times, Failed 37 times)

Stack Traces | 20.2s run time
=== RUN   TestSandboxWithTrafficAccessTokenAutoResumeViaProxy
=== PAUSE TestSandboxWithTrafficAccessTokenAutoResumeViaProxy
=== CONT  TestSandboxWithTrafficAccessTokenAutoResumeViaProxy
    traffic_access_token_test.go:263: [Status code: 502] Response body: {"sandboxId":"ivoddvn2f0mx1y6z1xcw8","message":"The sandbox is running but port is not open","port":8080,"code":502}
    traffic_access_token_test.go:263: [Status code: 502] Response body: {"sandboxId":"ivoddvn2f0mx1y6z1xcw8","message":"The sandbox is running but port is not open","port":8080,"code":502}
    traffic_access_token_test.go:263: [Status code: 502] Response body: {"sandboxId":"ivoddvn2f0mx1y6z1xcw8","message":"The sandbox is running but port is not open","port":8080,"code":502}
Executing command apt-get in sandbox iuq3cq1x0kolg6lx1q91t (user: root)
    traffic_access_token_test.go:292: 
        	Error Trace:	.../tests/proxies/traffic_access_token_test.go:292
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestSandboxWithTrafficAccessTokenAutoResumeViaProxy
--- FAIL: TestSandboxWithTrafficAccessTokenAutoResumeViaProxy (20.23s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The metadataFirstValue function returns a boolean indicating if the value was found. Ignoring this boolean for providedToken and providedEnvdToken means that a missing header is treated the same as an empty header. It is more robust and explicit to check if the token was actually provided before attempting to match it, especially for security-sensitive authentication tokens.

Comment on lines +134 to +138
providedToken, _ := metadataFirstValue(incomingMetadata, proxygrpc.MetadataTrafficAccessToken)

if !tokensMatch(providedToken, expectedToken) {
return denyResumePermission()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The metadataFirstValue function returns a boolean indicating if the value was found. Ignoring this boolean for providedToken means that a missing MetadataTrafficAccessToken header is treated the same as an empty MetadataTrafficAccessToken header. While tokensMatch might handle this correctly if expectedToken is always non-empty, it's more robust and explicit to check if the token was actually provided before attempting to match it, especially for security-sensitive authentication tokens.

Suggested change
providedToken, _ := metadataFirstValue(incomingMetadata, proxygrpc.MetadataTrafficAccessToken)
if !tokensMatch(providedToken, expectedToken) {
return denyResumePermission()
}
providedToken, found := metadataFirstValue(incomingMetadata, proxygrpc.MetadataTrafficAccessToken)
if !found || !tokensMatch(providedToken, expectedToken) {
return denyResumePermission()
}

Comment on lines +143 to 147
providedEnvdToken, _ := metadataFirstValue(incomingMetadata, proxygrpc.MetadataEnvdAccessToken)

var clientProxyClaims oauth.Claims
if s.requireEdgeClientProxyAuth {
var authErr error
clientProxyClaims, authErr = oauth.RequireClaims(ctx, incomingMetadata, s.clientProxyOAuth)
if authErr != nil {
return nil, authErr
if !tokensMatch(providedEnvdToken, *envdAccessToken) {
return denyResumePermission()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the traffic access token, ignoring the found boolean for providedEnvdToken can lead to a missing MetadataEnvdAccessToken header being treated as an empty one. For security-sensitive tokens, it's best practice to explicitly check if the token was present.

Suggested change
providedEnvdToken, _ := metadataFirstValue(incomingMetadata, proxygrpc.MetadataEnvdAccessToken)
var clientProxyClaims oauth.Claims
if s.requireEdgeClientProxyAuth {
var authErr error
clientProxyClaims, authErr = oauth.RequireClaims(ctx, incomingMetadata, s.clientProxyOAuth)
if authErr != nil {
return nil, authErr
if !tokensMatch(providedEnvdToken, *envdAccessToken) {
return denyResumePermission()
}
providedEnvdToken, found := metadataFirstValue(incomingMetadata, proxygrpc.MetadataEnvdAccessToken)
if !found || !tokensMatch(providedEnvdToken, *envdAccessToken) {
return denyResumePermission()
}

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 21be331. Configure here.

}

if s.Keepalive != nil && s.Keepalive.Traffic != nil && s.Keepalive.Traffic.Enabled {
trafficKeepalive.MaybeRefresh(ctx, sandboxId, sandboxPort, trafficAccessToken, envdAccessToken, c, s)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nil pointer dereference when trafficKeepalive manager is nil

Low Severity

catalogResolution accepts a nil trafficKeepalive parameter (tests pass nil), but when a sandbox has keepalive enabled in its catalog entry, trafficKeepalive.MaybeRefresh(...) is called without a nil guard. Since MaybeRefresh accesses m.resumer without checking if m is nil, this would panic. Production is safe because NewClientProxy always initializes the manager, but the function contract allows nil and tests rely on it.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 21be331. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants