Add Poison Message Handling to the Dispatchers by sophiatev · Pull Request #1331 · Azure/durabletask

sophiatev · 2026-03-24T19:21:32Z

This PR adds poison message handling to the activity, entity, and orchestration dispatchers. The general policy followed is that we want to make sure when poison message handling is enabled:

exceptions are not thrown from the dispatchers which would cause repeated aborting of the work item in the case of an irrecoverable error
we stop dispatching a message after its dispatch count exceeds the user-configured maximum
whenever possible, we surface this information to the customer via a failed orchestration or entity operation

Depending on the type of "irrecoverable" error, the backends might have to add special edge-case handling for the poison message. The SDK's responsibility is simply to mark the message as poisoned and prevent its processing.

Note that we have intentionally chosen not to include poison message handling for unlock requests. This is because failing to unlock an entity could leave an entire task hub in a bad state, so we retain the current behavior.

Copilot

Pull request overview

This PR introduces “poison message” handling across the orchestration, entity, and activity dispatchers by tracking per-event dispatch attempts and failing/dropping work once a configured maximum dispatch count is exceeded.

Changes:

Adds DispatchCount to HistoryEvent (and propagates it into entity request messages) and adds MaxDispatchCount to IOrchestrationService.
Adds poison detection logic in TaskOrchestrationDispatcher, TaskEntityDispatcher, and TaskActivityDispatcher to fail/drop over-dispatched messages.
Adds structured logging support for poison message detection.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
src/DurableTask.Core/TaskOrchestrationDispatcher.cs	Detects over-dispatched orchestration events and fails the orchestration with non-retriable `FailureDetails`.
src/DurableTask.Core/TaskEntityDispatcher.cs	Propagates dispatch counts into entity requests, filters/handles poison operations, and emits poison logs/failures.
src/DurableTask.Core/TaskActivityDispatcher.cs	Detects poison activity events and either discards or fails activities based on dispatch count.
src/DurableTask.Core/Logging/LogHelper.cs	Adds `PoisonMessageDetected` structured logging helpers.
src/DurableTask.Core/Logging/LogEvents.cs	Adds a new structured log event type for poison message detection.
src/DurableTask.Core/Logging/EventIds.cs	Introduces a new event id for poison detection logs.
src/DurableTask.Core/IOrchestrationService.cs	Adds `MaxDispatchCount` configuration knob for providers.
src/DurableTask.Core/History/HistoryEvent.cs	Adds `DispatchCount` to all history events for serialization/transport.
src/DurableTask.Core/Entities/OrchestrationEntityContext.cs	Adds `AbandonAcquire()` to reset critical section lock acquisition state.
src/DurableTask.Core/Entities/EventFormat/RequestMessage.cs	Adds `DispatchCount` field to entity request messages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…dingMessage

…arameters

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ase of poison message handling, except for entity unlock requests

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cgillum

Sharing some initial feedback - I unfortunately haven't reviewed everything yet.

cgillum

Finished reviewing all the changes.

cgillum · 2026-06-02T19:35:01Z

+                        );
+                    traceActivity?.SetStatus(ActivityStatusCode.Error, message);
                }
+                else


The diff is a little hard to read due to the refactoring churn. I'm not sure how accurate my quick reading of the code is, but I'm wondering if introducing this else is part of the problem. Instead of having an else block, can we exit early from the function in the if block and remove else? That would make the code simpler if possible.

cgillum · 2026-06-02T20:01:28Z

+                            runtimeState.OrchestrationInstance,
+                            request,
+                            $"Entity request has dispatch count {request.DispatchCount} which exceeds the maximum dispatch count " +
+                            $"of {this.maxDispatchCount} and will be failed.");


I see that we log the fact that a message is poisoned, but we still process it anyways? I would have expected some kind of short-circuit logic to stop us from processing it.

We do want to process it in the sense that we want to make sure to send a calling orchestration the failure details if we can

Co-authored-by: Chris Gillum <cgillum@microsoft.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 14 comments.

Co-authored-by: Chris Gillum <cgillum@gmail.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 19 comments.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

…ad for trace activities

sophiatev · 2026-06-04T00:57:57Z

+            // This can happen if not all of the operations in the batch were executed, in which case we populate the remaining
+            // activities with the failure details if they are available.
+            // If not, this work will be deferred and tried again, so we do not want to publish the activity.
+            for (int i = results.Count; i < traceActivities.Count; i++)


This is not related to this PR but while working on it I realized this old logic I had for trace activities was incorrect so I took the chance to fix it.

initial implementation

de244ff

Copilot AI review requested due to automatic review settings March 24, 2026 19:21

Copilot started reviewing on behalf of sophiatev March 24, 2026 19:22 View session