Skip to content

Always unwind workflow recursion state before fault handling and unexpected execution failures#18990

Open
Copilot wants to merge 5 commits intomainfrom
copilot/fix-recursion-counter-leak
Open

Always unwind workflow recursion state before fault handling and unexpected execution failures#18990
Copilot wants to merge 5 commits intomainfrom
copilot/fix-recursion-counter-leak

Conversation

Copy link
Contributor

Copilot AI commented Mar 12, 2026

WorkflowManager.ExecuteWorkflowAsync could leave recursion counters elevated when an exception escaped outside the activity execution catch block. Once leaked, subsequent TriggerEventAsync calls for the same workflow type were treated as recursive invocations and skipped with no signal.

  • Description
    • Execution cleanup

      • Wrap ExecuteWorkflowAsync in try/finally so recursion state is always decremented on exit, including unexpected failures while traversing transitions or resolving activities.
      • Keep the activity fault path decrement, but move it to run before OnWorkflowFaultAsync() so fault handlers observe cleared recursion state and can safely trigger workflows again.
      • Use a local guard to avoid double-decrementing when the inner fault path already unwound recursion.
    • Regression coverage

      • Add a focused WorkflowManagerTests case that reproduces the failure with a transition targeting a non-existent activity.
      • Verify a second TriggerEventAsync call is allowed to execute again instead of being silently suppressed by leaked recursion state.
      • Add regression coverage showing a fault handler can trigger the same workflow again after an activity exception.
    • Behavioral effect

      • Expected workflow faults still fault.
      • Fault handlers now run after recursion state has been cleared for the current workflow execution.
      • Unexpected execution failures no longer poison the WorkflowManager instance for later triggers in the same scope.
IncrementRecursion(workflowContext.Workflow);
var recursionDecremented = false;

try
{
    // existing workflow execution
}
catch (Exception)
{
    DecrementRecursion(workflowContext.Workflow);
    recursionDecremented = true;

    await _workflowFaultHandler.OnWorkflowFaultAsync(...);
    throw;
}
finally
{
    if (!recursionDecremented)
    {
        DecrementRecursion(workflowContext.Workflow);
    }
}
Original prompt

This section details on the original issue you should resolve

<issue_title>WorkflowManager recursion counter leaks when ExecuteWorkflowAsync throws outside the activity catch block</issue_title>
<issue_description>### Describe the bug

WorkflowManager.ExecuteWorkflowAsync calls IncrementRecursion at the top of the method but only calls DecrementRecursion in two specific code paths — inside the catch block (faulted activity) and at the very bottom of the method. There is no try/finally wrapping the execution loop.

If an unhandled exception escapes the loop body outside the inner try/catch (for example, a NullReferenceException from workflowContext.GetActivity(activity.ActivityId) when a transition points to a deleted activity, or any other unexpected throw), the DecrementRecursion at line 518 is never reached. The counters in _recursions and _currentRecursionDepth remain permanently elevated for the lifetime of the WorkflowManager instance (which is scoped per request/tenant).

The consequence is silent workflow suppression: the next call to TriggerEventAsync for the same workflow type finds a non-zero recursion count, treats it as an in-progress recursive invocation, and returns an empty result — without logging a warning or error. Workflows simply stop firing with no observable indication.

// WorkflowManager.cs ~L393
public async Task<IEnumerable<ActivityRecord>> ExecuteWorkflowAsync(...)
{
    IncrementRecursion(workflowContext.Workflow);   // ← incremented here
    ...
    while (scheduled.Count > 0)
    {
        activity = scheduled.Pop();
        var activityContext = workflowContext.GetActivity(activity.ActivityId);  // ← can throw NRE outside the try/catch below
        ...
        try { ... }
        catch (Exception ex)
        {
            DecrementRecursion(...);   // ← only reached for exceptions from activity.ExecuteAsync
            return blocking.Distinct();
        }
    }
    DecrementRecursion(...);   // ← never reached when an exception escapes the loop
}

Orchard Core version

Reproducible on current main.

To Reproduce

  1. Create a WorkflowType with a Transition whose DestinationActivityId does not match any activity in WorkflowType.Activities (e.g. after manually editing workflow definitions or via a migration that renames activity IDs).
  2. Call TriggerEventAsync to start the workflow — a NullReferenceException is thrown from inside ExecuteWorkflowAsync.
  3. Catch or observe the exception in logs.
  4. Call TriggerEventAsync again for the same event/workflow type.
  5. Observe: the second call returns an empty list and no workflow is started — even though no recursion is actually occurring.

Expected behavior

DecrementRecursion should always be called when ExecuteWorkflowAsync exits, regardless of how it exits. The fix is to wrap the method body in a try/finally:

IncrementRecursion(workflowContext.Workflow);
try
{
    // ... existing loop ...
    return blockingActivities;
}
finally
{
    DecrementRecursion(workflowContext.Workflow);
}

The DecrementRecursion call inside the inner catch block (line ~473) should also be removed, as the finally will cover it.

Logs and screenshots

No log entry is produced for the silent suppression — that is what makes this bug particularly dangerous. The only observable symptom is workflows that stop executing after a prior exception, with no error in the logs.</issue_description>

<agent_instructions>Attempt to reproduce the issue in the editor too</agent_instructions>

Comments on the Issue (you are @copilot in this section)


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Copilot AI and others added 2 commits March 12, 2026 17:15
Co-authored-by: sebastienros <1165805+sebastienros@users.noreply.github.com>
Co-authored-by: sebastienros <1165805+sebastienros@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix recursion counter leaks in ExecuteWorkflowAsync method Always unwind workflow recursion state after unexpected execution failures Mar 12, 2026
Copilot AI requested a review from sebastienros March 12, 2026 17:18
@sebastienros sebastienros marked this pull request as ready for review March 12, 2026 17:51
@sebastienros
Copy link
Member

@Smfakhoury Please check the changes

Comment on lines -473 to -475
DecrementRecursion(workflowContext.Workflow);
_logger.LogError(ex, "An unhandled error occurred while executing an activity. Workflow ID: '{WorkflowTypeId}'. Activity: '{ActivityId}', '{ActivityName}'. Putting the workflow in the faulted state.", workflowType.Id, activityContext.ActivityRecord.ActivityId, activityContext.ActivityRecord.Name);
workflowContext.Fault(ex, activityContext);

await _workflowFaultHandler.OnWorkflowFaultAsync(this, workflowContext, activityContext, ex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot In case of an exception, the recursion counter is now decremented after OnWorkflowFaultAsync() is being called. Should that be fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a0ec8df. The fault-path now clears the recursion state before calling OnWorkflowFaultAsync, and I added a regression test covering fault-handler reentry so the same workflow can be triggered again from the fault handler.

Co-authored-by: gvkries <7062785+gvkries@users.noreply.github.com>
Copilot AI changed the title Always unwind workflow recursion state after unexpected execution failures Always unwind workflow recursion state before fault handling and unexpected execution failures Mar 19, 2026
Copilot AI requested a review from gvkries March 19, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WorkflowManager recursion counter leaks when ExecuteWorkflowAsync throws outside the activity catch block

3 participants