[SPARK-38101][CORE] Fix executors failing fetching map statuses with INTERNAL_ERROR_BROADCAST by azmatsiddique · Pull Request #54987 · apache/spark

azmatsiddique · 2026-03-24T14:40:16Z

What changes were proposed in this pull request?
This PR introduces a retry mechanism in MapOutputTrackerWorker.getStatuses to mitigate executor failures during shuffle map status fetching. Specifically, it wraps the RPC fetch and broadcast deserialization in a bounded retry loop (up to 3 attempts with a 100ms delay). If MapOutputTracker.deserializeOutputStatuses fails due to the broadcast variable being concurrently invalidated on the driver (marked by a SparkException with "Unable to deserialize"), the worker will now retry the request to obtain a fresh broadcast or the map statuses directly.

Why are the changes needed?
Executors can fail with [INTERNAL_ERROR_BROADCAST] Failed to get broadcast... if the driver invalidates a cached map status broadcast (via updateMapOutput) while an executor is in the process of fetching or deserializing it. This race condition, while rare, causes MetadataFetchFailedException and task retries. By handling this specifically at the MapOutputTrackerWorker level, we can recover from these transient invalidations without failing the task.

Does this PR introduce any user-facing change?
No.

How was this patch tested?
Unit Test: Added a new test case SPARK-38101: concurrent updateMapOutput not interfering with getStatuses in
MapOutputTrackerSuite.scala
. This test simulates aggressive concurrent map status updates on the driver while multiple executor threads fetch statuses, verifying that the retry logic successfully masks the invalidation errors.
Regression Testing: Ran the full MapOutputTrackerSuite (34 tests) and all passed.

…s corrupted file

…n last CSV column ### What changes were proposed in this pull request? This PR fixes an issue where the CSV reader inconsistently parses empty quoted strings (`""`) when the `escape` option is set to an empty string (`""`). Previously, mid-line empty quoted strings correctly resolved to null/empty, but the last column resolved to a literal `"` character due to univocity parser behavior. ### Why are the changes needed? To ensure consistent parsing of CSV data regardless of column position. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug where users were receiving incorrect data (a literal quote instead of an empty/null value) for the last column in a row under specific CSV configurations. ### How was this patch tested? Added a new regression test in `CSVSuite` that verifies consistent parsing of both mid-line and end-of-line empty quoted fields.

…INTERNAL_ERROR_BROADCAST

azmatsiddique added 6 commits March 22, 2026 22:38

[SPARK-55968][SQL] Do not treat vectorized reader capacity overflow a…

762186c

…s corrupted file

Trigger Github Actions

14797e9

[SPARK-55559][SQL] Fix BIT_COUNT for negative tinyint/smallint/int

2096f5b

[SPARK-54916][ML] Fix Parquet footer error in DecisionTree test suites

2ca1250

[SPARK-38101][CORE] Fix executors failing fetching map statuses with …

637a415

…INTERNAL_ERROR_BROADCAST

azmatsiddique mentioned this pull request Mar 24, 2026

[SPARK-38101] execuors fail fetching map statuses with INTERNAL_ERROR_BROADCAST #54723

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38101][CORE] Fix executors failing fetching map statuses with INTERNAL_ERROR_BROADCAST#54987

[SPARK-38101][CORE] Fix executors failing fetching map statuses with INTERNAL_ERROR_BROADCAST#54987
azmatsiddique wants to merge 6 commits intoapache:masterfrom
azmatsiddique:fix-spark-38101

azmatsiddique commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azmatsiddique commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant