Skip to content

core: Use correct locks while running live migration of multiple disks of one VM#1010

Open
sermakov-orion wants to merge 1 commit intooVirt:masterfrom
sermakov-orion:bugfix/multiple-disks-live-migration
Open

core: Use correct locks while running live migration of multiple disks of one VM#1010
sermakov-orion wants to merge 1 commit intooVirt:masterfrom
sermakov-orion:bugfix/multiple-disks-live-migration

Conversation

@sermakov-orion
Copy link
Contributor

@sermakov-orion sermakov-orion commented Apr 22, 2025

The live disks migration may fail with an error "The VM is performing an operation on a Snapshot".
Steps to reproduce:

  1. Create two Storage Domains.
  2. Create a VM with multiple (10 or more) disks on one Storage Domain.
  3. Run the VM.
  4. Select all disks of the VM at the "Storage -> Disks" panel and press "Move" button.
  5. Press "OK" on the "Move Disk(s)" dialog.
    Some of the disks will be moved successfully. But some disks migration would fail with the "The VM is performing an operation on a Snapshot" error message:
    screenshot

This is caused by the #670 (https://bugzilla.redhat.com/show_bug.cgi?id=2110186) fix.
There may be two different scenarios of the LiveMigrateDiskCommand:

  1. Generic scenario, when the live migration initiated via UI/API. In this case needed lock is created at the MoveDiskCommand.lockVmWithWait. And it should be used at the LiveMigrateDiskCommand.performNextOperation as it was before the "Bug 2110186" fix.
  2. "ovirt-engine restarted during the live disks migration" scenario. In this case another lock (introduced by the "Bug 2110186" fix) should be used to cleanup interrupted commands.

Changes introduced with this PR

  • Use locks created by MoveDiskCommand.lockVmWithWait during the generic scenario

  • Use locks introduced in the "Bug 2110186" fix during the "ovirt-engine restarted during the live disks migration" scenario

Are you the owner of the code you are sending in, or do you have permission of the owner?

y

Copy link
Contributor

@JasperB-TeamBlue JasperB-TeamBlue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, fixes the issue provided. Tested out the API and no functionality affected to move disks there.

@sermakov-orion sermakov-orion changed the title Use correct locks while running live migration of multiple disks of o… core: Use correct locks while running live migration of multiple disks of o… Jun 4, 2025
@sermakov-orion sermakov-orion changed the title core: Use correct locks while running live migration of multiple disks of o… core: Use correct locks while running live migration of multiple disks of one VM Jun 4, 2025
@tb3088
Copy link

tb3088 commented Jun 27, 2025

seems like a rather important fix of a very common workflow which many competing products have been doing correctly for years (vmware, hyperv, etc). How many more months have to pass to get maintainer approval? @peter-boden et. al.?

…ne VM.

The live disks migration may fail with an error "The VM is performing an operation on a Snapshot". This is caused by the oVirt#670 (https://bugzilla.redhat.com/show_bug.cgi?id=2110186) fix. There may be two different scenarios of the LiveMigrateDiskCommand:
1. Generic scenario, when the live migration initiated via UI/API. In this case needed lock is created at the MoveDiskCommand.lockVmWithWait. And it should be used at the LiveMigrateDiskCommand.performNextOperation as it was before the "Bug 2110186" fix.
2. "ovirt-engine restarted during the live disks migration" scenario. In this case another lock (introduced by the "Bug 2110186" fix) should be used to cleanup interrupted commands.

Signed-off-by: Stepan Ermakov <sermakov@orionsoft.ru>
@sandrobonazzola sandrobonazzola force-pushed the bugfix/multiple-disks-live-migration branch from 3b514a5 to fcb1779 Compare June 30, 2025 06:00
@JasperB-TeamBlue
Copy link
Contributor

/ost

@github-actions
Copy link

⏳ Running ost suite 'basic-suite-master' on distro 'el9stream'.

Follow the progress here.

@github-actions
Copy link

😭💔 ost suite 'basic-suite-master' on distro 'el9stream' failed. (details)

@JasperB-TeamBlue
Copy link
Contributor

/ost

@github-actions
Copy link

github-actions bot commented Jul 1, 2025

⏳ Running ost suite 'basic-suite-master' on distro 'el9stream'.

Follow the progress here.

@github-actions
Copy link

github-actions bot commented Jul 1, 2025

😭💔 ost suite 'basic-suite-master' on distro 'el9stream' failed. (details)

@sermakov-orion
Copy link
Contributor Author

sermakov-orion commented Jul 2, 2025

Hi @JasperB-TeamBlue
The OST failed couple times here. But as I can see, the failure is not related to my changes. It is because the podman was not able to download the selenium browser image. What can we do to resolve the issue?

Copy link
Member

@dupondje dupondje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Sorry for the long waiting time, but I lack some time to review PR's.
I've did a more deep-dive into the code, and I don't think we have the correct solution for the issue here.

Let me explain:
createEngineLockForSnapshotRemove()
Creates 2 locks:

  • LockingGroup.DISK for the disk, which is disk specific. So this always needs to be created for that particular disk. As this prevents from doing other actions on that disk during the snapshot removal.
  • LockingGroup.VM -> A shared! lock on the VM, to well, lock the VM :)

The lockVmWithWait in MoveDisk creates a LockingGroup.LIVE_STORAGE_MIGRATION lock. Which is again an exclusive lock.

I think now (should validate in a test scenario), the LIVE_STORAGE_MIGRATION lock is exclusive, causing (at least the snapshots), to be created/removed one by one, Which causes you to 'fix' the issue.
Can you check the logs for the Failed to acquire VM lock, will retry on the next polling cycle message during your test?

The resolution can I think be either also aquire the getLock() (which is the LIVE_STORAGE_MIGRATION lock on the VM), so having 3 locks.

Or allow the code to create multiple snapshots concurrently if the disks doesn't overlap (which I think will be the hard way :))

Thanks
Jean-Louis

@dupondje
Copy link
Member

dupondje commented Mar 9, 2026

@sermakov-orion : Any follow-up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants