Skip to content

Add error_message field to bundle deploy telemetry#4793

Draft
shreyas-goenka wants to merge 14 commits intomainfrom
bundle-deploy-error-message
Draft

Add error_message field to bundle deploy telemetry#4793
shreyas-goenka wants to merge 14 commits intomainfrom
bundle-deploy-error-message

Conversation

@shreyas-goenka
Copy link
Contributor

@shreyas-goenka shreyas-goenka commented Mar 19, 2026

Changes

Adds support for logging error messages encountered during bundle deploy in telemetry. This gives the developer ecosystem team visibility into user-facing errors.

What changed:

  • libs/telemetry/protos/bundle_deploy.go: Added ErrorMessage field to BundleDeployEvent struct
  • libs/logdiag/logdiag.go: Added FirstErrorSummary field to LogDiagData to capture the summary of the first error diagnostic. Added GetFirstErrorSummary() getter
  • cmd/bundle/utils/process.go: Moved LogDeployTelemetry into a defer so telemetry is always logged, even when the deploy command fails, for both diagnostic errors and regular Go errors
  • bundle/phases/telemetry.go: Populates ErrorMessage from logdiag.GetFirstErrorSummary(ctx). Error messages are capped at 500 characters
  • bundle/phases/telemetry_scrub.go: Best-effort regex scrubber that runs before telemetry upload. The error message is treated as PII by the logging infrastructure; we scrub to avoid collecting more information than necessary. Scrubs:
    • Windows absolute paths → [REDACTED_PATH]
    • Workspace paths (/Workspace/...) → [REDACTED_WORKSPACE_PATH]
    • Unix absolute paths → [REDACTED_PATH]
    • Explicit relative paths (./..., ../..., .databricks/..., ~/...) → [REDACTED_REL_PATH]
    • Implicit relative paths (resources/file.yml) → [REDACTED_REL_PATH]
    • Email addresses → [REDACTED_EMAIL]
    • Inspired by VS Code's telemetry path scrubbing and Sentry's @userpath rule
  • libs/telemetry/logger.go: Added HasConfigUsed guard before accessing auth config in Upload, preventing panics when config setup fails early

Tests:

  • Unit tests for the scrubber in bundle/phases/telemetry_scrub_test.go
  • Acceptance test bundle/telemetry/deploy-error-message verifies the error message field is populated and scrubbed in telemetry

@shreyas-goenka shreyas-goenka force-pushed the bundle-deploy-error-message branch from c7ee5c8 to cb59f1d Compare March 19, 2026 13:01
@eng-dev-ecosystem-bot
Copy link
Collaborator

eng-dev-ecosystem-bot commented Mar 19, 2026

Commit: 968e9fb

Run: 23611240390

Env 🔄​flaky 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 7 10 270 806 6:06
💚​ aws windows 7 10 272 804 4:54
💚​ aws-ucws linux 7 10 366 722 7:40
💚​ aws-ucws windows 7 10 368 720 5:50
💚​ azure linux 1 12 273 804 7:42
🔄​ azure windows 1 1 12 274 802 7:12
🔄​ azure-ucws linux 1 1 12 370 718 9:17
💚​ azure-ucws windows 1 12 373 716 5:01
💚​ gcp linux 1 12 269 807 6:02
💚​ gcp windows 1 12 271 805 5:22
19 interesting tests: 10 SKIP, 7 RECOVERED, 2 flaky
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
💚​ TestAccept 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🔄​ TestSyncEnsureRemotePathIsUsableIfRepoExists ✅​p ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p
🔄​ TestFetchRepositoryInfoAPI_FromRepo ✅​p ✅​p ✅​p ✅​p ✅​p 🔄​f ✅​p ✅​p ✅​p ✅​p
Top 20 slowest tests (at least 2 minutes):
duration env testname
4:52 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
4:15 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
4:09 gcp windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:45 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:41 gcp linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:18 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:17 azure linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:15 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
3:13 aws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
3:06 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:58 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:49 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:45 azure-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:44 aws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:41 aws-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:38 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:35 aws-ucws linux TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct
2:19 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:13 azure windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=terraform
2:12 azure-ucws windows TestAccept/bundle/resources/apps/inline_config/DATABRICKS_BUNDLE_ENGINE=direct

Track the first error diagnostic summary encountered during bundle deploy
in telemetry. Move telemetry logging into a defer so it's always captured,
even when deploy fails.

Co-authored-by: Isaac
@shreyas-goenka shreyas-goenka force-pushed the bundle-deploy-error-message branch from cb59f1d to 1a8ff40 Compare March 19, 2026 13:21
@shreyas-goenka shreyas-goenka marked this pull request as ready for review March 19, 2026 13:24
@github-actions
Copy link

Suggested reviewers

Based on git history of the changed files, these people are best suited to review:

  • @denik -- recent work in bundle/phases/, cmd/bundle/utils/, libs/logdiag/

Confidence: high

Eligible reviewers

Based on CODEOWNERS, these people or teams could also review:

@andrewnester, @anton-107, @pietern, @simonfaltum

Suggestions based on git history of 5 changed files (5 scored). See CODEOWNERS for path-specific ownership rules.

Copy link
Member

@simonfaltum simonfaltum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Swarm Summary (2 independent reviewers + cross-review)

Verdict: REQUEST CHANGES - The defer-based approach is correct in principle, but has 3 issues that need attention before merge.

  1. [IMPORTANT] Potential panic on auth/config failures - The deferred LogDeployTelemetry() runs whenever b != nil, even before cmdctx.SetConfigUsed() succeeds. On auth/config failures, telemetry upload will call cmdctx.ConfigUsed(ctx) and panic.
  2. [IMPORTANT] Double deploy events on failure - Failed deploys will emit two BundleDeployEvents: one from the new defer and one from the root-level fallback in cmd/root/root.go:185.
  3. [IMPORTANT] PII risk in error messages - ErrorMessage is sent verbatim with no sanitization or size bound. Many deploy errors interpolate local filesystem paths, resource names, and workspace URLs.

Comment on lines +98 to +111
// Log deploy telemetry on all exit paths. This is a defer to ensure
// telemetry is logged even when the deploy command fails, for both
// diagnostic errors and regular Go errors.
if opts.Deploy {
defer func() {
if b == nil {
return
}
errMsg := logdiag.GetFirstErrorSummary(ctx)
if errMsg == "" && retErr != nil && !errors.Is(retErr, root.ErrAlreadyPrinted) {
errMsg = retErr.Error()
}
phases.LogDeployTelemetry(ctx, b, errMsg)
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT] This defer runs whenever b != nil, but cmdctx.SetConfigUsed() may not have been called yet (e.g., if configureBundle fails on auth/profile errors before reaching SetConfigUsed in cmd/root/bundle.go:187). When telemetry upload later calls cmdctx.ConfigUsed(ctx) in libs/telemetry/logger.go, it will panic.

Fix: guard with if !cmdctx.HasConfigUsed(ctx) { return } inside the defer, or move the defer setup to after SetConfigUsed() has succeeded.

Comment on lines +98 to +111
// Log deploy telemetry on all exit paths. This is a defer to ensure
// telemetry is logged even when the deploy command fails, for both
// diagnostic errors and regular Go errors.
if opts.Deploy {
defer func() {
if b == nil {
return
}
errMsg := logdiag.GetFirstErrorSummary(ctx)
if errMsg == "" && retErr != nil && !errors.Is(retErr, root.ErrAlreadyPrinted) {
errMsg = retErr.Error()
}
phases.LogDeployTelemetry(ctx, b, errMsg)
}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT] This defer will now log a full BundleDeployEvent on deploy failure. But cmd/root/root.go:182-189 still appends a legacy empty BundleDeployEvent on every nonzero bundle_deploy exit. Result: two deploy events per failed deploy, which will skew failure counts and error-rate dashboards.

Fix: remove the root-level failure fallback, or gate it on "no deploy event was already logged".

BundleDeployEvent: &protos.BundleDeployEvent{
BundleUuid: bundleUuid,
DeploymentId: b.Metrics.DeploymentId.String(),
ErrorMessage: errMsg,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT] ErrorMessage is sent verbatim from logdiag.GetFirstErrorSummary() / retErr.Error(). Many deploy errors interpolate local filesystem paths or user-controlled config values (e.g., from statemgmt/state_pull.go, config/mutator/translate_paths.go). This starts shipping raw PII/workspace details to telemetry with no sanitization or size bound.

Fix: emit a sanitized error code/category, or at least scrub paths and cap length (e.g., 500 chars).

Comment on lines +103 to +108
if b == nil {
return
}
errMsg := logdiag.GetFirstErrorSummary(ctx)
if errMsg == "" && retErr != nil && !errors.Is(retErr, root.ErrAlreadyPrinted) {
errMsg = retErr.Error()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] No new unit tests for the defer + error capture logic. The core behavior change (telemetry always fires, error message captured from logdiag or retErr) should have test coverage. Consider testing:

  • Telemetry fires on deploy failure with error message
  • ErrAlreadyPrinted errors fall through to GetFirstErrorSummary
  • Successful deploy passes empty error message

Two fixes:

1. Remove redundant empty BundleDeployEvent logging in root.go Execute.
   The defer in process.go already logs detailed deploy telemetry via
   LogDeployTelemetry on all exit paths. The root.go code caused duplicate
   events and Upload retries (mock returns numProtoSuccess=1).

2. Add HasConfigUsed guard in telemetry.Upload before calling ConfigUsed.
   When bundle config fails early (e.g., restricted script execution
   blocking preinit), ConfigUsed is never set on the context. The
   LogDeployTelemetry defer still logs an event, so Upload would panic
   trying to create an API client without auth config.

Co-authored-by: Isaac
@shreyas-goenka shreyas-goenka force-pushed the bundle-deploy-error-message branch from 93f4e62 to a2fb5dd Compare March 26, 2026 16:14
@shreyas-goenka shreyas-goenka marked this pull request as draft March 26, 2026 16:27
@shreyas-goenka shreyas-goenka force-pushed the bundle-deploy-error-message branch 2 times, most recently from 426f35a to e50f480 Compare March 26, 2026 17:06
@shreyas-goenka shreyas-goenka force-pushed the bundle-deploy-error-message branch from e50f480 to b6d60a8 Compare March 26, 2026 17:07
@shreyas-goenka shreyas-goenka force-pushed the bundle-deploy-error-message branch from b6d60a8 to 36d67ed Compare March 26, 2026 17:15
@shreyas-goenka shreyas-goenka force-pushed the bundle-deploy-error-message branch from 36d67ed to 7b76e5b Compare March 26, 2026 17:22
Adds a scrubber that runs before error messages are sent to telemetry:
1. Replaces the bundle root path with "." to avoid leaking local paths
2. Redacts remaining home directory paths (/Users/..., /home/..., C:\Users\...)
3. Redacts email addresses (e.g., in workspace paths)

Inspired by VS Code's telemetry path scrubbing and Sentry's @userpath rule.

Co-authored-by: Isaac
@shreyas-goenka shreyas-goenka force-pushed the bundle-deploy-error-message branch from 7b76e5b to ca229c4 Compare March 26, 2026 17:43
… path redaction

- Replace windowsHomeDirRegexp with general windowsAbsPathRegexp
- Add workspacePathRegexp for [REDACTED_WORKSPACE_PATH] label
- Use capturing groups instead of replaceDelimitedMatch helper
- Merge HomeDirRegexFallback tests into AbsolutePaths test

Co-authored-by: Isaac
The replacePath function converted bundle root to "." and home dir to
"~" before regex scrubbing. Since absolute and relative paths are
handled by separate regex patterns anyway, this added complexity for no
benefit. Absolute paths now consistently get [REDACTED_PATH].

Co-authored-by: Isaac
~/... expands to an absolute path, so it should match [REDACTED_PATH]
rather than [REDACTED_REL_PATH].

Co-authored-by: Isaac
Redacted paths now include the file extension for a known set of types,
e.g. [REDACTED_PATH](yml), [REDACTED_REL_PATH](json). This helps with
debugging without leaking sensitive information.

Co-authored-by: Isaac
Add .r, .scala, .sh, .hcl, .ini, .zip, .tar, .csv for better
coverage of file types common in the Databricks CLI ecosystem.

Co-authored-by: Isaac
Add .js, .ts, .jsx, .tsx, .html, .css (Databricks Apps / Vite),
.env, and .md.

Co-authored-by: Isaac
Add .xml, .properties, .conf, .tfstate, .tfvars, .egg, .gz, .tgz,
.dbc, .parquet, .avro, .log, .lock, .pem, .crt.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants