Add Poison Message Handling to the Dispatchers#1331
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces “poison message” handling across the orchestration, entity, and activity dispatchers by tracking per-event dispatch attempts and failing/dropping work once a configured maximum dispatch count is exceeded.
Changes:
- Adds
DispatchCounttoHistoryEvent(and propagates it into entity request messages) and addsMaxDispatchCounttoIOrchestrationService. - Adds poison detection logic in
TaskOrchestrationDispatcher,TaskEntityDispatcher, andTaskActivityDispatcherto fail/drop over-dispatched messages. - Adds structured logging support for poison message detection.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/DurableTask.Core/TaskOrchestrationDispatcher.cs | Detects over-dispatched orchestration events and fails the orchestration with non-retriable FailureDetails. |
| src/DurableTask.Core/TaskEntityDispatcher.cs | Propagates dispatch counts into entity requests, filters/handles poison operations, and emits poison logs/failures. |
| src/DurableTask.Core/TaskActivityDispatcher.cs | Detects poison activity events and either discards or fails activities based on dispatch count. |
| src/DurableTask.Core/Logging/LogHelper.cs | Adds PoisonMessageDetected structured logging helpers. |
| src/DurableTask.Core/Logging/LogEvents.cs | Adds a new structured log event type for poison message detection. |
| src/DurableTask.Core/Logging/EventIds.cs | Introduces a new event id for poison detection logs. |
| src/DurableTask.Core/IOrchestrationService.cs | Adds MaxDispatchCount configuration knob for providers. |
| src/DurableTask.Core/History/HistoryEvent.cs | Adds DispatchCount to all history events for serialization/transport. |
| src/DurableTask.Core/Entities/OrchestrationEntityContext.cs | Adds AbandonAcquire() to reset critical section lock acquisition state. |
| src/DurableTask.Core/Entities/EventFormat/RequestMessage.cs | Adds DispatchCount field to entity request messages. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ase of poison message handling, except for entity unlock requests
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cgillum
left a comment
There was a problem hiding this comment.
Sharing some initial feedback - I unfortunately haven't reviewed everything yet.
cgillum
left a comment
There was a problem hiding this comment.
Finished reviewing all the changes.
| ); | ||
| traceActivity?.SetStatus(ActivityStatusCode.Error, message); | ||
| } | ||
| else |
There was a problem hiding this comment.
The diff is a little hard to read due to the refactoring churn. I'm not sure how accurate my quick reading of the code is, but I'm wondering if introducing this else is part of the problem. Instead of having an else block, can we exit early from the function in the if block and remove else? That would make the code simpler if possible.
| runtimeState.OrchestrationInstance, | ||
| request, | ||
| $"Entity request has dispatch count {request.DispatchCount} which exceeds the maximum dispatch count " + | ||
| $"of {this.maxDispatchCount} and will be failed."); |
There was a problem hiding this comment.
I see that we log the fact that a message is poisoned, but we still process it anyways? I would have expected some kind of short-circuit logic to stop us from processing it.
There was a problem hiding this comment.
We do want to process it in the sense that we want to make sure to send a calling orchestration the failure details if we can
Co-authored-by: Chris Gillum <cgillum@microsoft.com>
Co-authored-by: Chris Gillum <cgillum@gmail.com>
…ad for trace activities
| // This can happen if not all of the operations in the batch were executed, in which case we populate the remaining | ||
| // activities with the failure details if they are available. | ||
| // If not, this work will be deferred and tried again, so we do not want to publish the activity. | ||
| for (int i = results.Count; i < traceActivities.Count; i++) |
There was a problem hiding this comment.
This is not related to this PR but while working on it I realized this old logic I had for trace activities was incorrect so I took the chance to fix it.
This PR adds poison message handling to the activity, entity, and orchestration dispatchers. The general policy followed is that we want to make sure when poison message handling is enabled:
Depending on the type of "irrecoverable" error, the backends might have to add special edge-case handling for the poison message. The SDK's responsibility is simply to mark the message as poisoned and prevent its processing.
Note that we have intentionally chosen not to include poison message handling for unlock requests. This is because failing to unlock an entity could leave an entire task hub in a bad state, so we retain the current behavior.