Skip to content

reinstate no-spawn option for remove#7134

Open
dwsutherland wants to merge 3 commits intocylc:masterfrom
dwsutherland:reinstate-remove-no-spawn
Open

reinstate no-spawn option for remove#7134
dwsutherland wants to merge 3 commits intocylc:masterfrom
dwsutherland:reinstate-remove-no-spawn

Conversation

@dwsutherland
Copy link
Copy Markdown
Member

@dwsutherland dwsutherland commented Dec 10, 2025

At present there's no way to remove parentless tasks from the scheduler and not have them spawn their next instance.

This is not ideal, and at ESNZ I've run into this problem many times while adding new tasks to a workflow and them being spawned in at the distant past (i.e. the ICP), and then the only way to fix the workflow is to spawn them forward one cycle at a time to the desired cycle point.

Also, in general, perhaps you wish not to run an isolate/branch for a gap then introduce again at some future cycle point.

This PR fixes the problem by reinstating --no-spawn option to cylc remove..

Example:

[scheduling]
    initial cycle point = 20260101T0000Z
    [[graph]]
        P1D = """
            @wall_clock => a => b => c
            d
        """

[runtime]
    [[root]]
        script = sleep 5
    [[a,b,c,d]]

running shows:
image
Removing d normally ($ cylc remove couple/run1//20260106T0000Z/d) spawns it into the next cycle:
image
With --no-spawn:
image

And of course removing the sequentially spawned a (cylc remove --no-spawn couple/run1//20260101T0000Z/a), with this option, gives:
image
(as expected)

Code changes were made in such a way that they should be conflict-free with #7132

Seems our forms are smart enough at the UI end:
image

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX. (not needed I think, docs aren't that detailed about removal and options are explained in UI/CLI info)
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@dwsutherland dwsutherland added this to the 8.7.0 milestone Dec 10, 2025
@dwsutherland dwsutherland self-assigned this Dec 10, 2025
@dwsutherland dwsutherland added bug? Not sure if this is a bug or not could be better Not exactly a bug, but not ideal. labels Dec 10, 2025
@dwsutherland dwsutherland force-pushed the reinstate-remove-no-spawn branch 2 times, most recently from 70fdee6 to 6481ee0 Compare December 11, 2025 03:25
@hjoliver
Copy link
Copy Markdown
Member

I've run into this problem many times while adding new tasks to a workflow and them being spawned in at the distant past (i.e. the ICP),

How did they get spawned back at the ICP - hopefully that does not happen automatically?

(Even if that was human error, it's a still a problem that we need to be able to fix easily, but just wondering...)

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Dec 11, 2025

I've run into this problem many times while adding new tasks to a workflow and them being spawned in at the distant past (i.e. the ICP),

How did they get spawned back at the ICP - hopefully that does not happen automatically?

(Even if that was human error, it's a still a problem that we need to be able to fix easily, but just wondering...)

If you re-run ICP (i.e. deployment) tasks..

But like you said, it might be a human error of inserting tasks in the wrong place and not being able to remove without spawning.

Also, as mentioned, you may want to remove these tasks for an interval then reintroduce

@hjoliver
Copy link
Copy Markdown
Member

If you re-run ICP (i.e. deployment) tasks..

I was going to say unwanted flow-on from retriggering ICP tasks should not be a problem now, since group trigger - now we can re-run in the same flow, and flow-on will be blocked by the prior history ... but adding new tasks (which by definition have no history back in the graph) does bring the flow-on problem back!

@dwsutherland
Copy link
Copy Markdown
Member Author

If you re-run ICP (i.e. deployment) tasks..

I was going to say unwanted flow-on from retriggering ICP tasks should not be a problem now, since group trigger - now we can re-run in the same flow, and flow-on will be blocked by the prior history ... but adding new tasks (which by definition have no history back in the graph) does bring the flow-on problem back!

This will also not be a problem when we relegate deployment tasks to a bespoke startup/whatever graph section 😉

@dwsutherland dwsutherland force-pushed the reinstate-remove-no-spawn branch from 6481ee0 to 5655542 Compare December 19, 2025 01:33
@MetRonnie MetRonnie added the schema change Change to the Cylc GraphQL schema label Jan 6, 2026
@oliver-sanders
Copy link
Copy Markdown
Member

Also, as mentioned, you may want to remove these tasks for an interval then reintroduce

FYI, skip mode is probably the best solution for this use case:

  • The task can be broadcasted on/off (no need for trigger/remove).
  • Skip mode tasks get a special icon in the GUI and can be filtered in views.
  • Task spawning is never interrupted, no need for users to understand this.
  • The task cannot be brought back by accident, e.g. by triggering a family or cycle which contains the task.

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Feb 18, 2026

FYI, skip mode is probably the best solution for this use case

Yes it does look promising...

The task cannot be brought back by accident, e.g. by triggering a family or cycle which contains the task.

However, the flip side, you may need to put all downstream tasks in skip mode too right?
If so, you would have to know the entire branches and/or what families to include/exclude in the broadcast.

Also, the gap might be so large that you don't want these tasks running (even in skip mode)..
Another thing, what if they are xtriggered tasks and the upstream workflow didn't run in that gap (...etc).

Might be better to have both options.

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Feb 24, 2026

However, the flip side, you may need to put all downstream tasks in skip mode too right?
If so, you would have to know the entire branches and/or what families to include/exclude in the broadcast.

Yes, however, the flip side of the flip side is that the remove solution requires you to know enough about the graph structure to correctly identify the task(s) which require removal, and enough about SoD behaviour to determine the point at which those tasks need to be removed (in some cases tasks may need to be removed multiple times due to subsequent outputs causing them to re-spawn).

But anyway....

Another option might be to set the tasks, this would be nice and easy, but at present we don't have a syntax for "set all instances of X before a certain cycle, e.g cylc set 'my-workflow//<2026/foo'. Though this might be something we end up looking into for #1159...

@oliver-sanders
Copy link
Copy Markdown
Member

As it stands, the cylc remove command is a bit odd, it can do one of three things:

  1. Remove a single instance of a task.
  2. Temporally remove a single instance of a task (if it gets re-spawned as the result of another output).
  3. Remove all instances of all tasks (i.e, stop the workflow) or lop off an entire branch of the workflow.

All depending on the exact details of the graph structure and the precise state of the task pool at the time the command was issued. The user can't really tell what will happen without a lot of graph understanding and SoD knowledge which is a shame.

The new option might be a little bit confusing (i.e, it only applies to parentless / sequential-xtriggererd tasks), but the status quo is a bit confusing, so whatever.

@hjoliver, thoughts?

@oliver-sanders
Copy link
Copy Markdown
Member

Things to test for this PR:

  • Reload - make sure to preserve the no-spawn attr on the reloaded instance.
  • Restart - I think at present this feature breaks if the workflow is restarted as the no-spawn attr is not backed up in the DB?
  • Graph change - CC orphans: remove orphaned tasks on reload/restart #7209 - need to make sure removal continues to function if the task is orphaned (one for that PR).
  • Subsequent task triggering?

@hjoliver
Copy link
Copy Markdown
Member

hjoliver commented Feb 25, 2026

@hjoliver, thoughts?

Skip mode is ideal:

  • if the number of tasks (and/or cycles) to be skipped is not too large
  • and if you want to resume live mode after the gap

But otherwise we probably need the --no-spawn option to sort of cut-off the self-spawning (parentless) graph:

  • if there are too many tasks (and/or cycles) to target for skipping
  • or you don't want to resume live mode after a gap

(Note this could be needed to recover after accidentally triggering a new flow with parentless tasks too, in which case cut-off is unequivocally required, not skip).


Generally....

  1. These days I am on board with the idea that it's best if things just work without users having to understand their graph, if/when that's possible. However:
  • we may still need some lower-level commands or options for expert use in some situations, and relatedly:
  • it might be the case that simpler "just work" commands have some failings because they inevitably have to assume some things about what the user wants
  1. Users should actually be able to understand what we call "SOD" easily, as simply how a workflow naturally should work: if a task generates outputs, downstream tasks that depend on those outputs will be spawned just as the graph says; end of story.

Unfortunately the way we spawn parentless tasks is a sort of pragmatic (implementation) caveat though, not fundamentally a feature of the SOD concept, that is harder to understand. I've commented on this since the beginning of SOD planning: really parentless tasks should be spawned by the clock, or by an xtrigger, or by some spawner object (if not clock or xtriggered). But it was easier to get it working by having each parentless task spawn its own next instance to wait on non-task-prerequisites - which is really a relic of SOS, not SOD. And that is probably not easy to manipulate without "low level" commands.

@dwsutherland
Copy link
Copy Markdown
Member Author

Yes, however, the flip side of the flip side is that the remove solution requires you to know enough about the graph structure

This:

we may still need some lower-level commands or options for expert use in some situations

Also, it's not an option slapping users in the face.. It's there if needed.

Another option might be to set the tasks

Problem here is you run into the same skip-broadcast issue; you may inadvertently trigger a cascade of downstream tasks to run/spawn..
This is one main reason why the --no-spawn option is useful..

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Feb 25, 2026

To be clear, not arguing against the use case here, and we do need some kind of a solution to enable the removal of orphan tasks (#7209).

This argument is absolutely an option, but:

  • It might not work reliably:
    • E.g, It's not preserved across restarts / reloads, we can handle this of course.
    • But it's going to introduce a bunch more interactions for us to test, e.g, how does this interact with set, trigger, plain-remove, plain-restart, auto-restart, graph-changes, etc.
  • It will most likely only ever be used by one person!
    • It's really hard to explain what cylc remove does in the first place
      • It's completely contextual, so its behaviours can't be easily predicted ahead of time.
      • Future instances of the task may or may not be spawned according to graph structure, the precise timing of the command, and any outputs received whilst the command is being actioned.
      • In the extreme, removing just one task could cause an entire infinitely-cycling workflow to shut down!
    • It's really really hard to explain the situations in which this new option would be needed in the help text.
      • This involves explaining parentless and sequentially-xtriggered tasks.
      • As well as explaining that sequential xtriggers may be implicit, e.g, as in the case of @wall_clock.
    • It's really really really hard for an operator to determine whether an incoming workflow configuration change is such a situation:
      • They need to diff the before/after graphs, then identify all tasks which are parentless OR triggered by a sequential xtrigger.
      • Edge case, these tasks might not be parentless or sequentially-xtriggered in all cycles.
    • It's really really really really hard for an operator to compose an intervention to mitigate against this.
      • They need to identify the link tasks which join the old tasks onto the new ones.
      • Note, these dependencies may differ from cycle to cycle!
    • Odds are that there is only one operator in the world with the insight required to use this feature.
      • Impressive skills, but I feel there should be a less involved way to achieve this!

So we wouldn't be able to publish generic instructions for this intervention and there are a bunch of caveats. Hence trying to explore whether there are other options here.

In the SoS days, the remove option had spawn/no-spawn modes which worked consistently in all scenarios, whereas as it stands here, the absence of --no-spawn does not infer that future instances of the task will be spawned (status quo remove behaviour), nor does it actually ensure that future instances don't spawn in general, only that parentless/sequential-xtrigger spawning do not propagate forwards from this particular task instance.


Another option might be to set the tasks

Problem here is you run into the same skip-broadcast issue; you may inadvertently trigger a cascade of downstream tasks to run/spawn..

This plays both ways, with the remove approach, you may inadvertently trigger a cascade of downstream tasks if:

  • You fail to identify [all of] the correct task(s) to remove.
  • You fail to use the --no-spawn option.

With the set approach, you may inadvertently trigger a cascade of downstream tasks if:

  • You fail to identify [all of] the correct task(s) to set.

The set approach would be a bit harder to get wrong because we can automatically generate the list of newly added tasks (and already log this for reloads), whereas we cannot automatically determine the correct task(s) to remove (which may be different from one cycle to another).

Moreover, with the remove approach, you may create the conditions for a workflow stall due to missing task outputs, whereas the set and skip approaches ensure that this cannot happen.

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Feb 25, 2026

This plays both ways, with the remove approach, you may inadvertently trigger a cascade of downstream tasks if:

  • You fail to identify [all of] the correct task(s) to remove.

There's no graph trigger off the removal of waiting tasks that would cause downstream to run (aside from runahead release, and spawning the next instance), there is a trigger for succeeded (the far more common occurrence) i.e. set/skip..

  • You fail to use the --no-spawn option.

Not using --no-spawn is the default (i.e. default behavior is expected, so if they intended something different that's on them) .. so I don't see the problem here.

The set approach would be a bit harder to get wrong because we can automatically generate the list of newly added tasks

No, it's much more likely downstream would start running... There is only one downstream that will spawn and potentially run with removal (the next instance), whereas with set you could have any number of direct dependents spawn and/or kick off (including the next instance!)..

It will most likely only ever be used by one person!

The whole ESNZ operations team have run into this problem, and would like the --no-spawn option.
It's not uncommon for us due to the way our operation is setup (60 workflows with hundreds of sequential xtriggers tasks)

@hjoliver
Copy link
Copy Markdown
Member

hjoliver commented Feb 26, 2026

I can see both sides of this to some extent, but I'm not sure I entirely understand your points @oliver-sanders - e.g.:

It will most likely only ever be used by one person!

Surely that's not true?

If it is possible - even by accident - to trigger an unwanted flow that spawns onward via parentless tasks, then we need a way to recover from that situation. Isn't "remove without spawning" the easiest way to do that? (If there really is a more intuitive way, fine, but I'm not convinced yet).

As @dwsutherland notes, this won't be the default, it is just a command option with an obscure sounding name "no-spawn" that should deter casual use, and which we can document the caveats and dangers of: "don't use this low level command option if you don't understand it!".

Would it help to construct a dummy workflow that clearly illustrates the scenario we hope to address here, so we can consider the various options in a more concrete context?

@dwsutherland
Copy link
Copy Markdown
Member Author

Would it help to construct a dummy workflow...

A basic example would be to create a gap, i.e. remove without spawning at one cycle and reintroduce at some future cycle..
The tests here just about do that, would just need a reintroduction (cylc set --pre=all . . .)..

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Feb 26, 2026

Another option might be to set the tasks

Problem here is you run into the same skip-broadcast issue ...

This plays both ways, with the remove approach, you may inadvertently trigger a cascade of downstream tasks if: ...

The set approach would be a bit harder to get wrong because we can ...

No, it's much more likely downstream would start running... There is only one downstream that will spawn and potentially run with removal (the next instance), whereas with set you could have any number of direct dependents spawn and/or kick off (including the next instance!)..

@dwsutherland, you've misunderstood me, the set approach would be safer here.

If we set ALL of the tasks between the initial-cycle-point and the runahead-cycle-point then it would be impossible for ANY of these tasks to run (the DB would prohibit it).




If it is possible - even by accident - to trigger an unwanted flow that spawns onward via parentless tasks, then we need a way to recover from that situation. Isn't "remove without spawning" the easiest way to do that? (If there really is a more intuitive way, fine, but I'm not convinced yet).

Agreed (that we need a solution) of course (and I have suggested two alternatives above), but I think you may have skipped past the issues with remove as a solution for this, I'll elaborate below...

It will most likely only ever be used by one person!

Surely that's not true?

If I thought that I wouldn't have written it!




Would it help to construct a dummy workflow that clearly illustrates the scenario we hope to address here, so we can consider the various options in a more concrete context?

Ok, here's a (purposefully vague) scenario:

  • We've performed a graph change which added the tasks a, b and c.
  • The change was made when the oldest cycle in the workflow was 2026, we only want the new tasks to run from 2027 onwards.

And here's how we would handle it with the three different approaches discussed so far...

Set

Mark all previous instances of the the tasks between the ICP and 2026 (inclusive) as succeeded:

cylc set //<2027/{a,b,c} - (theoretical cycle selection syntax)

Succeeded outputs for each task are written to the DB, any tasks spawned will be instantly completed and removed from the pool:

  • The tasks can not be run accidentally (the DB will prevent it).
  • The workflow will not stall if future cycles depend on these tasks.
  • Any inter-workflow triggers from other workflows will not hang.
  • The setting of these tasks will be remembered, so if you perform a future graph change and re-trigger of the R1 tasks, you will not need to re-set these tasks again.

Skip

Tell all the newly added tasks to skip until 2027:

cylc broadcast --expire=2027 -n a -n b -n c -s 'run mode = skip'

Tasks will be configured to skip, succeeded outputs will be written to the DB when they do.

  • It may take a while for the workflow to chew through the simulated tasks, after which, it's all as above (for "set"):
  • The tasks can not be run accidentally (the DB will prevent it).
  • The workflow will not stall if future cycles depend on these tasks.
  • Any inter-workflow triggers from other workflows will not hang.
  • The setting of these tasks will be remembered, so if you perform a future graph change and re-trigger of the R1 tasks, you will not need to re-skip these tasks again.

Remove

Remove the first instances of the tasks ensuring that they don't spawn their successors:

cylc remove '<2027/{a,b,c} --no-spawn' - (theoretical cycle selection syntax)

The tasks will be removed from the pool:

  • This will prevent the tasks from running.
  • But there will be no outputs in the DB, so there is nothing stopping these tasks from running at some point in the future.
  • Any inter-workflow triggers will hang.
  • If the remove command is run before all of the R1 tasks have completed, then downstream tasks may be re-inserted requiring re-removal.
    • But if the remove command isn't run earlier than this some of them might run, so you may require an additional hold or pause operation here.
  • If the operator forgets to use the --no-spawn option (or doesn't realise that they need to), then the tasks will need to be re-removed again.
  • And these tasks will need to be re-removed for each subsequent re-triggering of the R1 tasks.
  • Note the <2027 syntax is required to ensure that we only remove the instances of the tasks that we don't want

Specific examples

Continuing with the same example (where the tasks a, b & c have been added)...

1: First instance of added tasks is not necessarily the ICP

R1 = install_x
+P1D/P1D = install_x => a
+P2D/P1D = install_x => b
+P3D/P1D = install_x => c

This means we need to remove the tasks from all (*) cycles (i.e cylc remove */{a,b,c} --no-spawn) to ensure we mop up all instances.

2: First instance of added tasks are not necessarily n=0 at the time of the remove command:

R1 = """
    install_a => install_b => install_c
    install_a => a
    install_b => b
    install_c => c

With the remove solution, you would need to run the remove command three times:

  • After install_a succeeds
  • After install_b succeeds
  • And after install_c succeeds.

3: Not all previous cycles are necessarily inactive

P1Y = """
    x => y => z
    x => a
    y => b
    z => c
"""

Say when we perform the graph change, the workflow state is:

  • 2025:
    • z:running
  • 2026:
    • y:running
  • 2027:
    • x:running

Then the following tasks will be spawned when the upstreams complete (despite any previous attempts to remove them) and require subsequent re-removal:

  • 2025/a
  • 2025/b
  • 2025/c
  • 2026/b
  • 2026/c

Conclusions

The remove approach is an option, but it isn't the only option and it's far from perfect.

Set and skip are also options, they may also be imperfect, but have the robustness advantage of writing outputs into the DB which results in fewer caveats.

Ideally, we should be able to apply graph changes automatically without encountering these situations in the first place (see #7203). So we may potentially consider some sort of automation here, e.g, by pre-initial rules, Cylc could automatically set newly added tasks to succeeded in earlier cycles, completely avoiding the problem in the first place.

(added as a note in #7203)

@hjoliver
Copy link
Copy Markdown
Member

hjoliver commented Feb 26, 2026

Thanks @oliver-sanders - I'll attempt to find the time to digest your write-up in detail later today.

[Update, sorry, failed due to user support, R20 meetings, and ongoing attempts to get access to all users workflow - logs which I don't currently have because security ... but I see David has continued the discussion below].

As a first comment, I was preferring the remove option as much simpler in terms of both implementation and command syntax (if perhaps not conceptually simpler, being a low level intervention) - you only need to target a single task.

cylc set //<2027/{a,b,c} - (theoretical cycle selection syntax)

Your other suggestions rely on selecting a potentially large number of tasks over many cycles and modifying all of those in the DB. Maybe that's OK in some cases though, with the right syntax (I like your suggested syntax there). But does it cover all cases, and easily?

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Feb 27, 2026

If we set ALL of the tasks between the initial-cycle-point and the runahead-cycle-point then it would be impossible for ANY of these tasks to run (the DB would prohibit it).

"If we set ALL of the tasks" is doing a bit of work here.. It assumes we want all tasks in those cycles to be succeeded, and that everything downstream (tasks/workflows) to also act as if the task(s) did the actual work.
Many workflows have multiple subgraphs in a single cycle, so you'd have to pick out the family/collection of tasks to set knowing the downstream consequences in detail..

If it is possible - even by accident - to trigger an unwanted flow that spawns onward via parentless tasks, then we need a way to recover from that situation. Isn't "remove without spawning" the easiest way to do that? (If there really is a more intuitive way, fine, but I'm not convinced yet).

It will most likely only ever be used by one person!

Surely that's not true?

If I thought that I wouldn't have written it!

Well, this is very common with our workflows, all of the operations team have run into this problem.. In fact, just this week I helped someone by using the skip method (an easy way to deal with a single parentless task) even though it took a while to churn through (which remove --no-spawn would avoid).

Again, as a consequence of our setup, many workflows have downstream workflows (~60 operational workflows). So these downstream workflows have many parentless xtriggered tasks.. And a large subset of these workflows are collaborative with research (so research poll our operations with workflows containing parentless xtriggered tasks)..
All this to say, it's very unlikely to be just one person.

This will be less of a problem when separate graph sections are in:
#7020

However, you can still imagine people wanting to create/avoid a gap without artificially setting/skipping everything in between.

  • We've performed a graph change which added the tasks a, b and c.
  • The change was made when the oldest cycle in the workflow was 2026, we only want the new tasks to run from 2027 onwards.

And here's how we would handle it with the three different approaches discussed so far...

Set

Mark all previous instances of the the tasks between the ICP and 2026 (inclusive) as succeeded:

cylc set //<2027/{a,b,c} - (theoretical cycle selection syntax)

Succeeded outputs for each task are written to the DB, any tasks spawned will be instantly completed and removed from the pool:

  • The tasks can not be run accidentally (the DB will prevent it).

This is a simple example, in a real workflow you may have to set more than just {a,b,c}, you'll also have to do that in any downstream workflow task (sub-)graphs.

  • The workflow will not stall if future cycles depend on these tasks.
  • Any inter-workflow triggers from other workflows will not hang.

If these are tasks doing real things, and downstream tasks/workflows have genuine dependence on the things they did, then you may want them to stall/hang. There are complexities in the choice of tasks, what setting them to succeeded does, and the order in which you do it.

  • The setting of these tasks will be remembered, so if you perform a future graph change and re-trigger of the R1 tasks, you will not need to re-set these tasks again.

R1 can be set, and the next cycle task removed... And as mentioned above, it's not necessarily desirable to have fake succeeded tasks, so the same could be said about remembering them.

Skip

Tell all the newly added tasks to skip until 2027:

cylc broadcast --expire=2027 -n a -n b -n c -s 'run mode = skip'

Tasks will be configured to skip, succeeded outputs will be written to the DB when they do.

Again, way more complicated to select an entire subgraph in real workflows

  • It may take a while for the workflow to chew through the simulated tasks, after which, it's all as above (for "set"):
  • The tasks can not be run accidentally (the DB will prevent it).
  • The workflow will not stall if future cycles depend on these tasks.
  • Any inter-workflow triggers from other workflows will not hang.

Same as above, for real tasks, stall/hang might be a good thing for real downstream tasks/workflows.

  • The setting of these tasks will be remembered, so if you perform a future graph change and re-trigger of the R1 tasks, you will not need to re-skip these tasks again.

As with set.

Remove

Remove the first instances of the tasks ensuring that they don't spawn their successors:

cylc remove '<2027/{a,b,c} --no-spawn' - (theoretical cycle selection syntax)

If this is a => b => c , then you only need to remove a, and for sequential xtriggers only the first if the xtrigger isn't satisfied:

cylc remove --no-spawn '2026/a'

(or set R1, remove the next)

The tasks will be removed from the pool:

  • This will prevent the tasks from running.
  • But there will be no outputs in the DB, so there is nothing stopping these tasks from running at some point in the future.
  • Any inter-workflow triggers will hang.

This may be desirable (as explained above).

  • If the remove command is run before all of the R1 tasks have completed, then downstream tasks may be re-inserted requiring re-removal.

Then set R1, remove --no-spawn the next instance.

  • But if the remove command isn't run earlier than this some of them might run, so you may require an additional hold or pause operation here.

Yes, sometimes this is required.

  • If the operator forgets to use the --no-spawn option (or doesn't realise that they need to), then the tasks will need to be re-removed again.

Yes, this is true, and the next instance popping up will remind you.. It could be arguably worse not to remember all the downstream tasks to set/skip.
This is not really an argument against, in my mind, people could forget anything.. But, if they intend to not spawn, the option is there.

  • And these tasks will need to be re-removed for each subsequent re-triggering of the R1 tasks.

As above, set R1 and then remove-no-spawn the next instance(s)..
Would still be easier that looking-up/remembering all the downstream tasks/cycles to include in your set, because the one to remove is in view.

Specific examples

Continuing with the same example (where the tasks a, b & c have been added)...

The remove --no-spawn is usually applicable to parentless tasks, most of the continued examples have parents..
The R1 example would be applicable, but you could set those and remove the next instances while the workflow is paused (in many cases)..
Downstream workflows with parentless tasks polling the removed task, will also need dealt with...

Conclusions

The remove approach is an option, but it isn't the only option and it's far from perfect.

Set and skip are also options, they may also be imperfect, but have the robustness advantage of writing outputs into the DB which results in fewer caveats.

Both aren't perfect, yes, but both have their place in my humble opinion..

Ideally, we should be able to apply graph changes automatically without encountering these situations in the first place (see #7203). So we may potentially consider some sort of automation here, e.g, by pre-initial rules, Cylc could automatically set newly added tasks to succeeded in earlier cycles, completely avoiding the problem in the first place.

As mentioned, fake succeeded tasks may not always be desirable, this would have to be specified by the user..

However, ideally, I do agree that there should be better design/option to dealing with gaps so not to expose users to these "low-level" internal workings..
However, at present, users are still exposed to a spawning mechanism.. And this small change introduces an intervention will help in some situations better than other interventions.

@hjoliver
Copy link
Copy Markdown
Member

(We might need two approaches here: low-level remove --no-spawn is quick and easy and would help our ops team now; then we can consider the gap-filling approaches as well, with less urgency.)

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Feb 27, 2026

@hjoliver, to save time, skip my responses above, I've tried to provide a quick summary here.

@dwsutherland, just to clarify, you don't need to defend your desired solution here (it may still have value), I'm just trying to get a handle on the problem as there are several edge cases and scenarios listed here which we do not have a solution to.

@dwsutherland, I'm guessing there are no arguments with these points:

  1. The remove operation needs to be performed on the whole subgraph in the general case in order to reliably remove these tasks:
    • Only in the edge case that all newly added tasks are in a single chain e.g, a => b => c, can this be reduced to a single task, providing of course that the operator has the skill to diff the graphs and determine this.
    • You do which is great! But some deployments are maintained by people who aren't connected with workflow development and aren't necessarily aware of incoming graph changes or necessarily even the workflows graph structure in general.
  2. The remove operation may need to be performed multiple times:
    • Because newly-added tasks don't necessarily all spawn in the instant you trigger the R1 tasks, and may appear at different times
    • (Specific examples 2).
  3. Remove can only mop up tasks after they have spawned (and potentially after they have submitted / started).
    • I.E, it can only be reactive as the problem occurs, not precative to prevent it happening in the first place.
  4. There are difficulties determining the first cycle in which the new tasks should be added (Specific examples 3).
  5. The automated handling of graph changes would require some automatic handling of this situation.
  6. The same tasks will have to be re-removed with any subsequent re-running of the R1 tasks.




@dwsutherland, @hjoliver: To demonstrate some of these problems with an example derived from the above:

flow.cylc
[scheduler]
    allow implicit tasks = True

[scheduling]
    cycling mode = integer
    initial cycle point = 1
    final cycle point = 5
    [[graph]]
        R1 = """
            install_x => install_y => install_z
        """
        P1 = """
            install_z[^] => x => y => z
            z[-P1] => x
        """
        R1/2 = """
            x => remover
        """
        R1/$ = """
            z => stop
        """

        # the graph change:
        ## +P1/P1 = """
        ##     install_x[^] => a
        ## """
        ## +P2/P1 = """
        ##     install_y[^] => b
        ## """
        ## +P3/P1 = """
        ##     install_z[^] => c
        ##     a & b => c
        ##     x => a
        ##     z => c
        ## """
        #
        # dependency to make reporting easier:
        ## R1/$ = """
        ##     c => stop
        ## """

[runtime]
    [[root]]
        script = sleep 5
    [[INSTALL]]
    [[install_x, install_y, install_z]]
        inherit = INSTALL
    [[y]]
        script = """
            if [[ $CYLC_TASK_CYCLE_POINT -eq 3 ]]; then
                sed -i 's/##//' "${CYLC_WORKFLOW_RUN_DIR}/flow.cylc"
                cylc reload "${CYLC_WORKFLOW_ID}"
                cylc trigger "${CYLC_WORKFLOW_ID}//1/INSTALL"

                # we want the tasks a, b & c to run from cycle 4 onwards
                cylc set  --pre=all "${CYLC_WORKFLOW_ID}//4/[ab]"
            fi
        """

    [[remover]]
        script = """
            while true; do
                cylc remove "${CYLC_WORKFLOW_ID}//[123]/[abc]" --no-spawn
                sleep 3
            done &
            remover_pid="$!"
            for i in $(seq 1 10); do
                echo $i
                $(cylc__job__poll_grep_workflow_log '5/x.*succeeded') || true
            done
            kill "$remover_pid" || true
        """

    [[stop]]
        script = """
            echo "$(cylc cat "${CYLC_WORKFLOW_ID}" | grep 'Removed tasks: [123]/[abc]' | wc -l) 'cylc remove's required:"
            cylc cat "${CYLC_WORKFLOW_ID}" | grep 'Removed tasks: [123]/[abc]'
            echo
            echo
            echo "$(cylc cat "${CYLC_WORKFLOW_ID}" | grep -o '\([123]/[abc]/01\)' | sort | uniq | wc -l) unintended tasks became active:"
            cylc cat "${CYLC_WORKFLOW_ID}" | grep '\([123]/[abc]/01\)'
        """

This workflow:

  • Reloads itself to add the tasks "a", "b" and "c".
  • Kicks off a "remover" task which spam the cylc remove command every 3 seconds to get rid of the unwanted tasks.
  • Reports on the success of this in the "stop" task at the end of the workflow.

When I run the example (on this branch):

  • The cylc remove --no-spawn command must be run three times top mop up the added tasks.
  • Two unintended tasks became active and have to be killed!

The exact outcomes are highly timings dependent (due to the unpredictable behaviour of cylc remove) so you'll get different numbers with each run. Obviously the example could be contrived to exacerbate this as desired, I haven't tried fiddling the numbers yet.

Whereas if we replace the "remove" with "set":

  • The cylc set command needs only be run once.
  • Zero unintended tasks become active and need to be killed.

My points are:

  1. The caveats attached to the remove approach are numerous and somewhat severe, e.g:
    • Unintended stalls.
    • Multiple removals required.
    • Unintended active tasks may occur.
    • High knowledge requirement (of both graph structure, cylc remove mechanics and the special --no-spawn option).
    • (And the small matter that this feature is not yet reload/restart safe).
  2. Providing generic documentation to operators explaining how to handle this isn't really possible
    • How do they know when they need to use the --no-spawn option?
    • How do we explain this to people who don't know the graph structure of the workflow?
    • I can't reasonably tell them to just "spam" the cylc remove command for a few cycles!
  3. There are alternatives with fewer caveats, we do need to consider them!
    • I haven't exhaustively listed options here, I've just dropped in a couple of quick suggestions!
    • Don't just brush off the alternatives straight away!
    • IMO, we do need to consider safe automation of graph changes reinstall-reload: safer automated reloading of workflows #7203. Not so important in your environment as it has full time operators with understanding of the workflow, this isn't always the case.
  4. This problem isn't strictly coupled to R1 tasks (though they are a really good way to demonstrate the issue).

@hjoliver
Copy link
Copy Markdown
Member

hjoliver commented Mar 2, 2026

Thanks for the clear example @oliver-sanders. Your argument makes sense for that example, but I think it has several properties that make use of remove harder than we'd encounter in real life:

  • the added tasks spawn into multiple future cycles (that's possible, but I've never seen it in the wild)
  • they run very quickly (which greatly increases the odds of "unintended" tasks spawning - not that they're that hard to kill after the fact)

Further, I don't think you've responded to some of our comments above, such as:

  • what if the number of cycles to set tasks in is very large (or even infinite, with a new flow say)?
  • what if the there are a large number of these tasks? (If you don't set them ALL to succeeded, many "unintended" tasks will spawn off of the set operation)

Surely in those scenarios, remove --no-spawn of one (or a few) parentless tasks at the top of a single cycle would be by far the easiest intervention - in effect, in one go we simply stop the flow from reaching the many downstream tasks and intermediate cycles.

It seems to me that we probably need both approaches, perhaps along with documenting (warning) that remove --no-spawn is a low level intervention that requires a somewhat deeper knowledge of Cylc.

I think David is going to post a more detailed response based on his experience with our operational workflows again.

The caveats attached to the remove approach are numerous and somewhat severe,...

I don't disagree with your caveats in general, but I think most aren't really a problem for our real-world use cases, and further it's OK to have some powerful lower-level commands that are useful in some common situations even if more generic use has caveats and requires more from the user.

Providing generic documentation to operators explaining how to handle this isn't really possible ...

I don't think it's that hard. I would use the simpler real examples of David's type and warn that you really have to know what you're doing if you stray far from that simplicity.

There are alternatives with fewer caveats, we do need to consider them!

Yep, I'm not dismissing those - I just think we need this as well.

[UPDATE] and David has made a good case below that remove IS by far the easiest approach for our ops issues.

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Mar 2, 2026

@dwsutherland, just to clarify, you don't need to defend your desired solution here..

I promise I'm only fighting for it because I see remove as desirable on a practical level, as a solution with less complications in some scenarios... I'm completely open to not use it if I'm convinced another solution beats it on all fronts.

@dwsutherland, I'm guessing there are no arguments with these points:

  1. The remove operation needs to be performed on the whole subgraph in the general case in order to reliably remove these tasks:

    • Only in the edge case that all newly added tasks are in a single chain e.g, a => b => c, can this be reduced to a single task, providing of course that the operator has the skill to diff the graphs and determine this.
    • You do which is great! But some deployments are maintained by people who aren't connected with workflow development and aren't necessarily aware of incoming graph changes or necessarily even the workflows graph structure in general.

It's actually the opposite (believe it or not):

  • I use remove --no-spawn on the parentless tasks at the beginning of the subgraph in order to both avoid automatic spawning of parentless tasks and so I do not have to operate on the "whole subgraph". (in your example x isn't parentless)
  • In this way, the edge case (a => b => c) is actually the usual case for it's employment, and common at ESNZ because of our modular workflow setup.
  • It makes it easier for those with little knowledge because, as the first parentless task(s) is visible, only action on the first task is required.

set/skip, on the other hand, will require knowledge of an entire subgraph if you can't target entire cycle points (and even then there may be off-cycle dependents).

  1. The remove operation may need to be performed multiple times:

    • Because newly-added tasks don't necessarily all spawn in the instant you trigger the R1 tasks, and may appear at different times
    • (Specific examples 2).

I would use set with your example, and remove only on parentless tasks (which x isn't) for reason described above.

  1. Remove can only mop up tasks after they have spawned (and potentially after they have submitted / started).

    • I.E, it can only be reactive as the problem occurs, not precative to prevent it happening in the first place.

True.. Usually I would pause the workflow to avoid

  1. There are difficulties determining the first cycle in which the new tasks should be added (Specific examples 3).

So set/skip everything before in order to have the task spawn in after the gap? Yeah, this would work, however the user will need to weigh up the pros/cons.

  1. The automated handling of graph changes would require some automatic handling of this situation.

Not sure I follow.. I guess you're saying manual interventions should be avoided in the first place? I agree.

  1. The same tasks will have to be re-removed with any subsequent re-running of the R1 tasks.

I would set the first downstream R1 task(s), in this case, and remove the any downstream and next instance spawned...
This would have to out weigh the trade offs of using set here and future cycles...

@dwsutherland, @hjoliver: To demonstrate some of these problems with an example derived from the above:

Will provide and example below, in line with what we encounter in our operational workflows and my responses above.

My points are:

  1. The caveats attached to the remove approach are numerous and somewhat severe, e.g:

    • Unintended stalls.

Better than unintended downstream running? (tasks or other workflows)..
Stalls/Hung-tasks would be preferable in many situations.

  • Multiple removals required.

People have to use their judgement.. In the situations I would use remove it wouldn't require many, and would target less cycle-points than alternate actions.

  • Unintended active tasks may occur.

I would say the set/skip has a higher risk...

  • High knowledge requirement (of both graph structure, cylc remove mechanics and the special --no-spawn option).

Disagree (as explained), it's the opposite, in the situations I use remove it would avoid knowing downstream/graph structure.. And tasks which are removed are already visually present.

  • (And the small matter that this feature is not yet reload/restart safe).

Oh?

. . .

I think the remaining points (about caveats and documentation) are not entirely applicable given my response to your remove framing and example..
I'll move onto examples of where I'd employ remove, that are close matches to our operation




Examples

  1. Simple:
flow.cylc
[scheduler]
    allow implicit tasks = True
[scheduling]
    initial cycle point = 20260101T0000Z
    [[xtriggers]]
        xtrig_a = . . .
        xtrig_x = . . .
    [[graph]]
        R1 = """
            install_other
            install_x => install_y => install_z
            install_z & install_other => install_done
        """
        PT10M = """
            install_done[^] => x => y => z?
            @xtrig_x => x
            install_done[^] => a => b => c?
            @xtrig_a => a
            c | z => d
        """
[runtime]
    [[root]]
        script = sleep 5

When removing x:

  • You only need one command, and only one task removed if sequential xtriggers = True.

  • You don't need any downstream knowledge (aside from where you want to add it in again), and in this example you don't need to know what to do with z WRT c | z => d.

  • You can create a gap, and a gap with undetermined end point (i.e. re-introduce at some point in the future, without changing the definition).

  • Have no fake succeeded tasks, so can reintroduce x anywhere in the gap if desired in the future without flow complications.

  • Can combine with set to handle R1 or accidental spawn in.

  • Stalls downstream workflows that are polling the gap, so they don't just trigger off (so can use remove with them).. (actually a good thing!)

Compare this to set/skip:

  • You need to know the whole graph, and what do you do with z WRT c | z => d?... (i.e. what if c hasn't run yet, and could fail)

  • You cannot handle/create gaps, or uncertain futures.

  • Need to use flows to re-run over the pseudo run of the previous flow, which may have consequences.

  • Kicks off downstream workflows (if not operated on first), and because the upstream tasks didn't actually run, may cause them to fail.

  1. More complex

Let NDown be a hypothetical collection of downstream tasks and workflows that are so numerous and with dependencies so convoluted that mere mortals cannot fathom it.

flow.cylc
[scheduler]
    allow implicit tasks = True
[scheduling]
    initial cycle point = 20260101T0000Z
    [[xtriggers]]
        xtrig_a = . . .
        xtrig_x = . . .
    [[graph]]
        R1 = """
            install_other
            install_x => install_y => install_z
            install_z & install_other => install_done
        """
        PT10M = """
            install_done[^] => x => <NDown_x>
            @xtrig_x => x
            install_done[^] => a => <NDown_a>
            @xtrig_a => a

        """
[runtime]
    [[root]]
        script = sleep 5

There's actually no way to handle this with set / skip ..
Remove x with --no-spawn , and we're done..
You could say "well, what if x already kicked off", then you'd be in no worse position to set / skip with that/those cycle(s).




Thoughts:
A large number of workflows at ESNZ are structured this way, with some having hundreds of parentless xtriggered tasks (i.e. ingestion and delivery workflows)...

And often these scenarios will occur with addition of new sub-graph on reload and rerun of R1 tasks, however for the purposes of handling/creating gaps it's probably still justified.

We also have our DR system that, at the moment, spins up workflows with off-ICP start tasks (which introduces a gap).. If this workflow gets restarted, then tasks are spawned on rerun of the ICP.. This problem can be alleviated with isolated graphs:
#7020
However, until then, remove --no-spawn would be very useful in managing this.

@oliver-sanders - I will employ both set / skip and remove --no-spawn (if/when available)..
If you want to convince me to not have remove --no-spawn as an option, you'll need to show me that there is a more practical way to handle these examples.

@oliver-sanders

This comment was marked as resolved.

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Mar 11, 2026

Caveats to the remove approach

To conclude this lengthy debate, several caveats to the remove approach have been identified.

The following major points have been conceded (by acknowledgement or lack of counter):

  1. The remove operation needs to be performed on the whole subgraph in the general case in order to reliably remove these tasks
    • But not in your special case where all added tasks in a single chain.
  2. The remove operation may need to be performed multiple times
    • In the likely event that not all downstream tasks spawn simultaneously.
  3. Remove can only mop up tasks after they have spawned (and potentially after they have submitted / started).
    • Note that your approach of pausing the workflow to avoid this will only work if there is no inter-dependence between the R1 tasks, otherwise the workflow will have to be resumed to allow them to run.
  4. The same tasks will have to be re-removed with any subsequent re-running of the R1 tasks.
    • Really surprised you're not concerned about this! From your description, this applies to your use case!

Additionally, this point was not conceded but was demonstrated in a worked example:

  1. Unintended active tasks may occur.
    • The logical result of (3) above, proven in example.

I'm not going to spend time arguing out the remaining points. The key thing is that we have acknowledged the approach has caveats which make it challenging for the general case, and even some caveats which apply to your specific use case.

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Mar 11, 2026

General case solution

If you want to convince me to not have remove --no-spawn as an option,

I'm not necessarily trying to convince you not to have remove --no-spawn as an option (read back, I noted that this actually necessary, at least internally, for #7198), I'm pointing out that it's not a great solution and doesn't work for all cases, there is more likely than not a better resolution to this scenario.


you'll need to show me that there is a more practical way to handle these examples.

Great, let's talk alternatives!

Outline of the scenario:

  • We have run a bunch of cycles.
  • We have added some tasks, and reloaded/restarted the workflow.
  • We have triggered an explicit selection of tasks somewhere in the workflow's history.
  • We expected only the explicitly triggered tasks to spawn/run, but some other tasks got spawned too.
  • (note that this problem is not coupled to R1 tasks, though this is a pressing use case at present)

Outline of the problem:

  • When we trigger a historical task, it may cause newly added tasks to spawn and either run (if dependencies are satisfied) or force a workflow-stall depending on the specifics of the graph change.

While the status-quo behaviour is "logical" by SoD, it's rather illogical from a user perspective. It can certainly be argued that this is a bug (I lean towards the position that it is, more on that below).

Irrespective of whether we accept this as a bug or not, there are two possible ways for Cylc to handle this situation:

  1. Spawn these tasks.
  2. Don't spawn these tasks.

We have to choose one of these as the default behaviour, currently, we choose (1), however, I would argue that:

  • (2) is a more logical default.
  • Not spawning these tasks is the most likely user intention.
  • The inverse intervention (i.e, to spawn these tasks) is easier and less caveat-prone this way around.
More:

Yes, we could provide tooling and documentation to help operators get out of this situation, but as outlined above, it's very difficult to document this in the general case and the intervention can get rather messy depending on the specifics of the graph.

But it would be much easier to prevent this from happening in the first place. We can always provide a mechanism for running these new-old tasks if anyone were ever to want this.


I did actually outline a mechanism in the above which would achieve this:

e.g, by pre-initial rules, Cylc could automatically set newly added tasks to succeeded in earlier cycles, completely avoiding the problem in the first place.

I.e, all historical instances of newly added tasks would be automatically marked as completed:

  • Historical instances of newly added tasks would not spawn so cannot become active (unless explicitly triggered).
  • Later cycles will not stall due to unsatisfied prereqs (as per standard pre-initial logic).
  • The same applies to inter-workflow triggers.

You can still remove/trigger/set these tasks to cater to any exotic use case, but the default behaviour is much more manageable.


This problem is actually very similar to the issue of pre-start-cycle-point tasks in the warm start scenario (#7178), an issue which is impacting our operational workflows at the moment:

  • With warm starts, all tasks are newly added, we presume that all historical task instances (tasks before the warm-start point) have completed. This is the so-called "pre-initial" logic.
  • With this scenario, it is arguable that we should consider all historical task instances (tasks before the reload-point) as completed.

The "reload-point" becomes the oldest active cycle point which is consistent with:

As such, another option would be to set the start-cycle-point for all newly added tasks to the oldest active cycle point which would provide nice consistency.


So there's two simple alternatives:

  • Consider historical instances of newly-added tasks to be completed.
  • Set the start cycle point of the recurrences of newly-added tasks to the oldest active cycle point.

Both of these approaches would completely resolve this situation, removing the need for the operator to run the cylc remove --no-spawn command.

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Mar 17, 2026

Bear with me, I do come to an agreement in the end. . .

Caveats to the remove approach

To conclude this lengthy debate, several caveats to the remove approach have been identified.

If you don't like it from an idealistic point of view, that's fine, however these points are off target..
I'm not looking for something to replace set/skip, or some general solution or conceptual reframing (although maybe you are), I'm looking for a solution to particular scenarios where set/skip are less practical (scenarios common at ESNZ)..

The following major points have been conceded (by acknowledgement or lack of counter):

  1. The remove operation needs to be performed on the whole subgraph in the general case in order to reliably remove these tasks

    • But not in your special case where all added tasks in a single chain.

This point is void.. I've explained already:

  • I would not use it on the "whole subgraph" just the parentless beginning of a sub-graph.
  • The special case is the usual case.
  1. The remove operation may need to be performed multiple times

    • In the likely event that not all downstream tasks spawn simultaneously.

I object here too.. To create/avoid a gap I would usually just target the parentless task(s) of a cycle/glob-of-cycle point(s) if necessary.

  1. Remove can only mop up tasks after they have spawned (and potentially after they have submitted / started).

    • Note that your approach of pausing the workflow to avoid this will only work if there is no inter-dependence between the R1 tasks, otherwise the workflow will have to be resumed to allow them to run.

Yes, this is a post spawn intervention.. Yes, it would be nice to avoid the situation in the first place, but people make mistakes.. And it's nice that they can visualize the thing they need to act on (as opposed to the "whole subgraph" which they would have to know about.

  1. The same tasks will have to be re-removed with any subsequent re-running of the R1 tasks.

    • Really surprised you're not concerned about this! From your description, this applies to your use case!

Additionally, this point was not conceded but was demonstrated in a worked example:

I already addressed this:

I would set the first downstream R1 task(s), in this case, and remove the any downstream and next instance spawned...
This would have to out weigh the trade offs of using set here and future cycles...

That's why I'm not concerned..

  1. Unintended active tasks may occur.

    • The logical result of (3) above, proven in example.

And again, I've already addressed this.. There are "unintended active tasks" with skip/set approaches, and I would say it's more pernicious.. I would rather things don't kick off because of fake succeeded states in the workflow..


General case solution

If you want to convince me to not have remove --no-spawn as an option,

I'm not necessarily trying to convince you not to have remove --no-spawn as an option (read back, I noted that this actually necessary, at least internally, for #7198), I'm pointing out that it's not a great solution and doesn't work for all cases, there is more likely than not a better resolution to this scenario.

you'll need to show me that there is a more practical way to handle these examples.

Great, let's talk alternatives!

Outline of the scenario:

  • We have run a bunch of cycles.
  • We have added some tasks, and reloaded/restarted the workflow.
  • We have triggered an explicit selection of tasks somewhere in the workflow's history.
  • We expected only the explicitly triggered tasks to spawn/run, but some other tasks got spawned too.
  • (note that this problem is not coupled to R1 tasks, though this is a pressing use case at present)

Outline of the problem:

  • When we trigger a historical task, it may cause newly added tasks to spawn and either run (if dependencies are satisfied) or force a workflow-stall depending on the specifics of the graph change.

While the status-quo behaviour is "logical" by SoD, it's rather illogical from a user perspective. It can certainly be argued that this is a bug (I lean towards the position that it is, more on that below).

Irrespective of whether we accept this as a bug or not, there are two possible ways for Cylc to handle this situation:

  1. Spawn these tasks.
  2. Don't spawn these tasks.

We have to choose one of these as the default behaviour, currently, we choose (1), however, I would argue that:

  • (2) is a more logical default.
  • Not spawning these tasks is the most likely user intention.
  • The inverse intervention (i.e, to spawn these tasks) is easier and less caveat-prone this way around.

I'm with you.. Another scenario would be wanting to create a gap for some unspecified number of cycle points by removing some lead (parentless) tasks.

But it would be much easier to prevent this from happening in the first place. We can always provide a mechanism for running these new-old tasks if anyone were ever to want this.

Agree, again..

This problem is actually very similar to the issue of pre-start-cycle-point tasks in the warm start scenario (#7178), an issue which is impacting our operational workflows at the moment:

  • With warm starts, all tasks are newly added, we presume that all historical task instances (tasks before the warm-start point) have completed. This is the so-called "pre-initial" logic.
  • With this scenario, it is arguable that we should consider all historical task instances (tasks before the reload-point) as completed.

There's another issue I've run into, which this will solve, and that's off ICP start tasks (when you spin up a workflow on specified tasks, cycles ahead of the ICP)..
You can accidentally end up spawning in these tasks earlier when running R1 (as this gap is meaningless after a reload/restart).. Setting earlier instances as complete would help with this.

So there's two simple alternatives:

  • Consider historical instances of newly-added tasks to be completed.
  • Set the start cycle point of the recurrences of newly-added tasks to the oldest active cycle point.

Both of these approaches would completely resolve this situation, removing the need for the operator to run the cylc remove --no-spawn command.

Agreed!
Also, to remove gaps and alleviate the extent to which these alternatives need to play a part:
I want isolated-graphs/inter-graph-dependencies to be a thing:
#7020 (comment)

Along with ideas around parentless tasks #7228 (which I agree that xtriggers should be like automatic python tasks separate from what they trigger)

However

As mentioned above, there's still the scenario where people may want to create gaps or a patch work of them.. (without removing the tasks altogether, and definitely without having gaps considered complete.)
Although I suppose you may wish to stamp out this devilish behavior 😆

For now:

This PR is a small change that can be undone once we have some of these solutions ironed out and ready to go (which I assume would be something like 8.8 release).

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Mar 17, 2026

If you don't like it from an idealistic point of view, that's fine

Oh wow!

I have given a long list of rational arguments outlining several caveats to the the remove approach, no ideals required (note, I'm not trying to push any one solution in particular, just trying to find the best solution in general).

I'm just trying to get agreement that the remove approach has caveats (which it most definitely does), some of which apply to you!

You may have workarounds for these caveats (e.g, pauseing the workflow, or setting the first instance of the task to remove), and you might not be concerned by some of these caveats (e.g, if they don't apply to your particular example), but they are still caveats!

If you want to continue haggling out the caveats...

This point is void.. I've explained already:

This point doesn't apply to your specific example, however, that doesn't make it in any way "void".

You are not the only person who will ever encounter this issue. Your example is not the only one to which it will apply. Linear pipelines are a special case.


I object here too.. To create/avoid a gap I would usually just target the parentless task(s) of a cycle/glob-of-cycle point(s) if necessary.

You've misunderstood this point. The R1 tasks you're triggering may be inter-dependent, not all of the downstreams are necessarily spawned at the same time, so multiple remove commands may be required. This is demonstrated in my worked example above, it's provable, I don't think there's room for argument here?


Yes, this is a post spawn intervention

Point reluctantly conceded then?


I already addressed this:

You have provided a workaround for this issue using "set", however, the workaround is an acknowledgement of this caveat, all I'm trying to do here is to demonstrate that there are caveats!

However (since you mention it), you have vehemently argued against the use of "set" for these purposes above!


And again, I've already addressed this.. There are "unintended active tasks" with skip/set approaches

No, this point has not been addressed, unintended active tasks do not (and logically can not) occur with alternative approaches.

Otherwise, we're in agreement that it would be easier if we could prevent this issue from happening in the first place which is great (all I need to achieve here).

What are your thoughts on the two alternative mechanisms I suggested above? Any other ideas?

These alternatives may also be quick to implement, though, as noted above, implementing an alternative doesn't mean we wouldn't add a --no-spawn option (or would remove it). It just wouldn't be the primary solution to handling this scenario.


As mentioned above, there's still the scenario where people may want to create gaps or a patch work of them.. (without removing the tasks altogether, and definitely without having gaps considered complete.)
Although I suppose you may wish to stamp out this devilish behavior 😆

I'm not 100% sure what scenario you're describing here, but I think this is exactly the use case for which skip-mode was introduced?

Nothing "devilish" about a real-world use case! We use skip-mode to toggle tasks on/off.

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Mar 24, 2026

I'm just trying to get agreement that the remove approach has caveats (which it most definitely does), some of which apply to you!

Of course it does, as does set/skip .. Like:

  • the need to know the whole graph..
  • if you don't know/act-on the whole downstream graph you will kick off tasks (creating active tasks.. I'm not just talking about parentless active tasks kicking off, it could be anything downstream)
  • the assumption that having fake succeeded tasks all through the workflow is more desirable than not, when the tasks haven't actually performed their intended function.

You may have workarounds for these caveats (e.g, _pause_ing the workflow, or _set_ting the first instance of the task to remove), and you might not be concerned by some of these caveats (e.g, if they don't apply to your particular example), but they are still caveats!

Of course there are caveats, I'm just asking for an alternate to the above repercussions.. And the ability to handle/create gaps in situation where it's more practical to do so..

If you want to continue haggling out the caveats...

I really don't... But I have to, because there are scenarios in which --no-spawn is more practical, even if you can find caveats to argue for it's non-use in some cases... All I need is a case in which it is.

You have provided a workaround for this issue using "set", however, the workaround is an acknowledgement of this caveat, all I'm trying to do here is to demonstrate that there are caveats!

However (since you mention it), you have vehemently argued against the use of "set" for these purposes above!

I'm not arguing for one or another, just that --no-spawn remove has it's use.. And in some situations makes an otherwise complicated intervention simple..
It's up to the user to weigh up whether it makes sense

And again, I've already addressed this.. There are "unintended active tasks" with skip/set approaches

No, this point has not been addressed, unintended active tasks do not (and logically can not) occur with alternative approaches.

Yes, if you don't select all you need to set/skip due to human error/limitations in knowing the whole downstream.. Then something might kick off.

What are your thoughts on the two alternative mechanisms I suggested above? Any other ideas?

image

These alternatives may also be quick to implement, though, as noted above, implementing an alternative doesn't mean we wouldn't add a --no-spawn option (or would remove it). It just wouldn't be the primary solution to handling this scenario.

I'm happy with that, I'm not arguing for it as a primary solution.. But to give options.

I'm not 100% sure what scenario you're describing here, but I think this is exactly the use case for which skip-mode was introduced?

Skip mode is fake succeeded, may be undesirable.. Also they would have to use a new flow if later they wanted to run a gap.

(final?) Note:

Again, I'm not arguing that there are no caveats to --no-spawn remove..
I am arguing that there are some situation where --no-spawn is more desirable on a practical level (like my above 2nd example of the unfathomably complicated downstream), or can act as a feature... Avoiding the caveats of set/skip...
And that this would help operations team at ESNZ who encounter this often due to our setup..

We may not be even having this conversation if the isolated graph sections was in (although there may have been still been an argument for it)... Kind of wish our energy was put into that instead of this 🤷‍♂️

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Mar 24, 2026

Kind of wish our energy was put into that instead of this 🤷‍♂️

Oh boi.

I'm not going to continue in this, but instead summarise as:

I suggested a couple of approaches which avoid the need for removal, and asked for feedback:

we're in agreement that it would be easier if we could prevent this issue from happening in the first place which is great (all I need to achieve here).

What are your thoughts on the two alternative mechanisms I suggested above? Any other ideas?

You seem ok with these ideas:

I'm happy with that

Please can I get your thoughts on these?

(Note, this issue is not just about R1 tasks, it can happen at any point in the workflow's history)

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Mar 25, 2026

Please can I get your thoughts on these?

Well with respect to these:

So there's two simple alternatives:

  • Consider historical instances of newly-added tasks to be completed.
  • Set the start cycle point of the recurrences of newly-added tasks to the oldest active cycle point.

(Noting here, that newly-added is those added to the workflow definition and scheduler reloaded... And the discussion is how the spawning mechanism should treat them in the history prior to their introduction/some-current-point)

I think the 2nd alternative has some assumptions that I'm not sure about.. i.e. what if the newly added needs to be introduced a few cycles before the oldest active cycle...
Also, you'd have to parse out whether they are separate subgraphs or not (because that oldest active cycle may not be relevant to what's been added)..

The first is a reasonable option, however, it didn't "really"/actually complete.. Did it?

How about this for an alternative:

  • Reruns of tasks in an existing flow cannot add to the flow.

(i.e. cannot spawn the newly-added tasks off a flow that's already run the same path)

Obviously, if the outputs/states change on rerun causing an alternative path to be followed for the first time (i.e. failed handling ...etc) then a newly-added could spawn in that flows' new path.

In my mind this would fix the problem with both R1 and any other set of historical reruns, without the need to meddle with historical completion states (pre-ICP aside)..
Also, it means, to run a newly-added historically you'd have to be very explicit in targeting it or set a new flow going.

By now, people may have gotten into the habit of setting existing already-run tasks to spawn new tasks so maybe we can still accommodate this.. (?)

Thoughts?

@oliver-sanders
Copy link
Copy Markdown
Member

Consider historical instances of newly-added tasks to be completed.

[this] is a reasonable option, however, it didn't "really"/actually complete.. Did it?

It didn't complete, however, Cylc effectively considers pre-inital tasks to have completed. This is the same situation as warm start where we have proposed that cylc workflow-state returns completed states for these tasks (#7178).

Since Cylc assumes pre-initial tasks to have completed for the purpose of task prerequisites, it is arguably inconsistent that the record states to the contrary and that this logic only applies internally to one workflow, and not externally to others.

The tasks could be marked as having run in "skip mode" which serves as indication that they are only "simulated complete" tasks, not the result of a real submission. Skip mode tasks display a (fast-forward) task badge in the GUI.

The upside of this approach is that these tasks can easily and intuitively be re-run (via trigger) or removed as needed.


Set the start cycle point of the recurrences of newly-added tasks to the oldest active cycle point.

I think [this] alternative has some assumptions that I'm not sure about.. i.e. what if the newly added needs to be introduced a few cycles before the oldest active cycle...

We could, potentially include a mechanism to allow earlier instances of these tasks to be run (similar to the Cylc 7 cylc insert --no-check command).

However, another way to handle this issue would be to introduce the concept of a "reload point" which I suggested in #7201. This would allow the operator to decide which cycle point the reload changes apply from, newly added tasks could then be automatically spawned into the pool from that point onwards (#5949).

This issue (#7134) along with those mentioned in the last paragraph (#7201, #5949) are part of #7203 which is about enabling workflows to be safely and automatically reloaded. At present, graph changes have many caveats which means a human operator has to supervise reloads to manually add/remove/set tasks in order to manage the transition. #7203 is about resolving these caveats making reloads safe and predictable so that manual intervention is not needed, enabling the continuous deployment of workflows in automated environments. This topic will come up in a developers discussion in the near-future. This is a large part of why it's important for us to have an automated solution for this problem which works in the general case, irrespective of any manual overrides we might provide.

@dwsutherland
Copy link
Copy Markdown
Member Author

Cylc effectively considers pre-inital tasks to have completed

I'm more happy with pre-initial than post-initial.

"skip mode" which serves as indication that they are only "simulated complete"

Is workflow_state (testing for succeeded) not satisfied on these types of task completion (I suppose I could test).

introduce the concept of a "reload point"

Yes, I would be in favor of this.. It would also help with the speed of reload for large workflows that are far from the original ICP.

Do have a think about the alternative I proposed, it might feel fundamental, but it kind of makes sense in a way.

@oliver-sanders
Copy link
Copy Markdown
Member

Do have a think about the alternative I proposed, it might feel fundamental, but it kind of makes sense in a way.

Aah, sorry, forgot to respond to that, I didn't quite grapple the suggestion.

The historical instances of newly added tasks do not belong to any flow yet as there's no DB entry. So we would need to add output entries to mark them, say as completed, in order to assign them to a flow.

@hjoliver
Copy link
Copy Markdown
Member

(I'll try to summarize and suggest the way forward IMO on Element chat first...)

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Mar 27, 2026

The historical instances of newly added tasks do not belong to any flow yet as there's no DB entry. So we would need to add output entries to mark them, say as completed, in order to assign them to a flow.

It's the rerun tasks that have flow history, and that spawn the newly-added, this is what I'm referring to.

@oliver-sanders
Copy link
Copy Markdown
Member

oliver-sanders commented Mar 27, 2026

Sorry @dwsutherland, I don't understand what you're suggesting here, could you elaborate.

@dwsutherland
Copy link
Copy Markdown
Member Author

dwsutherland commented Mar 28, 2026

Sorry @dwsutherland, I don't understand what you're suggesting here, could you elaborate.

It's really what the packet reads.

The main problem we are addressing is unintended consequence of newly-added tasks being spawned by tasks that have already run (rerun of tasks that spawn newly-added).

So if we take on the following rule:

  • Reruns of tasks in an existing flow cannot add to the flow.

(i.e. cannot spawn the newly-added tasks off a flow that's already run the same path)

Obviously, if the outputs/states change on rerun causing an alternative path to be followed for the first time (i.e. failed handling ...etc) then a newly-added could spawn in that flows' new path.

what I'm saying is if
a => b => c
was run as flow {1} (say cycle 1) and the workflow moved on.

We then add d (reloading the workflow):
a => b => c => d

Then retrigger/rerun of a => b => c on cycle 1 under this rule would mean 1/d won't spawn (will not be added to the set of flow {1} tasks (tasks that have run as flow {1}))..

And to run 1/d you would need to trigger it (or set prereq to satisfied), or run one of a, b or c as a new flow (i.e. {2})

This change would appear to solve the problem.

@oliver-sanders
Copy link
Copy Markdown
Member

Right, gotcha, good idea, i.e:

"Don't spawn on the same output twice (in the same flow)".

I think that makes sense, off the top of my head, I can't think of any situation where we would not want to do that. It certainly makes reload a bit more predictable and you can always force task spawning with cylc trigger. We could, potentially, consider the duplicate output spawning in the status quo to be a bug.

Furthermore, there's a bunch of issues to do with the message-handling code which point towards a refactor in this code area:

But, unfortunately, in relation to "the problem of not spawning historical instances of newly added tasks", there are a couple of caveats:

  1. Group trigger utilises cylc remove under the hood, which wipes the outputs previously produced in the given flow, which would still cause historical instances to be indirectly spawned.
  2. Task ID matching could cause the historical instances to be directly spawned, e.g. cylc trigger <cycle>/<family>.

So I think we would still need a more formal mechanism of defining the "reload point" at which newly-added tasks are inserted from to solve the issue in general. However, this idea may well make sense for other reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug? Not sure if this is a bug or not could be better Not exactly a bug, but not ideal. schema change Change to the Cylc GraphQL schema small

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants