Summary
The daily scheduled E2E pipeline (🧪 E2E Tests) has a ~27% failure rate (8 failures in the last 30 runs). The most frequent flaky test is TestTerraformModuleTest, which intermittently finds 5 permissions instead of the expected 8.
Root Cause
The tests share a single service principal per (os, type) matrix combination. Each test calls DetachRolesFromSP at startup to remove all role assignments, then waits 15 seconds for Azure RBAC de-propagation before creating a new custom role and starting the MPF iteration loop.
However, 15 seconds is often insufficient for Azure to fully de-propagate the previous test's role assignments. When TestTerraformModuleTest starts after TestTerraformACINoTfvarsFile (which discovers resourcegroups/{read,write,delete}), the SP may still effectively have those permissions cached from the prior test's custom role.
What happens on a failed run
Traced from runs #21550858823 (Jan 31) and #21840777556 (Feb 9):
| Iteration |
What happened |
Permission discovered |
| 0 |
azurerm_resource_group.this: Creation complete after 13s — RG created successfully, then workspaces/read denied |
workspaces/read |
| 1 |
RG refreshed OK, workspaces/write denied |
workspaces/write |
| 2 |
Deployment succeeds → enters destroy phase → "Authorization Successful" |
workspaces/delete (from destroy) |
Result: 5 permissions found (workspaces/{read,write,delete} + deployments/{read,write} from initial)
Expected: 8 permissions (the above + resourcegroups/{read,write,delete})
The 3 missing resourcegroups/* permissions were never surfaced as AuthorizationFailed errors because the SP still had them from the previous test's role assignment that hadn't fully de-propagated.
Key evidence
- The resource group is created by Terraform (not by MPF —
autoCreateResourceGroup is false for all Terraform tests)
- The RG creation succeeds without any auth error in iteration 0, proving the SP still has
resourcegroups/write from the prior test
- The same 3 permissions are missing in every occurrence (Jan 31 and Feb 9 show identical patterns)
- The test runs sequentially (
-p 1 -parallel 1), so the issue is temporal, not concurrent
Affected Code
pkg/usecase/mpfService.go line 112: time.Sleep(15 * time.Second) — wait after role deletion
e2eTests/e2eTerraform_test.go:229: assert.Equal(t, 8, len(perms)) — exact count assertion
Failed Runs
| Date |
Run ID |
Test |
Error |
| Feb 24 |
22370086522 |
TestARMTemplatMultiResourceTemplateFullDeployment |
54 perms, expected ≥57 (same root cause) |
| Feb 9 |
21840777556 |
TestTerraformModuleTest |
5 perms, expected 8 |
| Jan 31 |
21550858823 |
TestTerraformModuleTest |
5 perms, expected 8 |
Proposed Solutions
- Short-term: Increase the RBAC de-propagation wait from 15s to 45s
- Medium-term: Replace the fixed sleep with an active probe — poll until a test API call returns
403 AuthorizationFailed, confirming old permissions are fully revoked
- Long-term: Use separate service principals per test function to eliminate cross-test RBAC contamination entirely
Summary
The daily scheduled E2E pipeline (
🧪 E2E Tests) has a ~27% failure rate (8 failures in the last 30 runs). The most frequent flaky test isTestTerraformModuleTest, which intermittently finds 5 permissions instead of the expected 8.Root Cause
The tests share a single service principal per
(os, type)matrix combination. Each test callsDetachRolesFromSPat startup to remove all role assignments, then waits 15 seconds for Azure RBAC de-propagation before creating a new custom role and starting the MPF iteration loop.However, 15 seconds is often insufficient for Azure to fully de-propagate the previous test's role assignments. When
TestTerraformModuleTeststarts afterTestTerraformACINoTfvarsFile(which discoversresourcegroups/{read,write,delete}), the SP may still effectively have those permissions cached from the prior test's custom role.What happens on a failed run
Traced from runs #21550858823 (Jan 31) and #21840777556 (Feb 9):
azurerm_resource_group.this: Creation complete after 13s— RG created successfully, thenworkspaces/readdeniedworkspaces/readworkspaces/writedeniedworkspaces/writeworkspaces/delete(from destroy)Result: 5 permissions found (
workspaces/{read,write,delete}+deployments/{read,write}from initial)Expected: 8 permissions (the above +
resourcegroups/{read,write,delete})The 3 missing
resourcegroups/*permissions were never surfaced asAuthorizationFailederrors because the SP still had them from the previous test's role assignment that hadn't fully de-propagated.Key evidence
autoCreateResourceGroupisfalsefor all Terraform tests)resourcegroups/writefrom the prior test-p 1 -parallel 1), so the issue is temporal, not concurrentAffected Code
pkg/usecase/mpfService.goline 112:time.Sleep(15 * time.Second)— wait after role deletione2eTests/e2eTerraform_test.go:229:assert.Equal(t, 8, len(perms))— exact count assertionFailed Runs
TestARMTemplatMultiResourceTemplateFullDeploymentTestTerraformModuleTestTestTerraformModuleTestProposed Solutions
403 AuthorizationFailed, confirming old permissions are fully revoked