core: Use correct locks while running live migration of multiple disks of one VM#1010
Conversation
JasperB-TeamBlue
left a comment
There was a problem hiding this comment.
Looks good to me, fixes the issue provided. Tested out the API and no functionality affected to move disks there.
|
seems like a rather important fix of a very common workflow which many competing products have been doing correctly for years (vmware, hyperv, etc). How many more months have to pass to get maintainer approval? @peter-boden et. al.? |
…ne VM. The live disks migration may fail with an error "The VM is performing an operation on a Snapshot". This is caused by the oVirt#670 (https://bugzilla.redhat.com/show_bug.cgi?id=2110186) fix. There may be two different scenarios of the LiveMigrateDiskCommand: 1. Generic scenario, when the live migration initiated via UI/API. In this case needed lock is created at the MoveDiskCommand.lockVmWithWait. And it should be used at the LiveMigrateDiskCommand.performNextOperation as it was before the "Bug 2110186" fix. 2. "ovirt-engine restarted during the live disks migration" scenario. In this case another lock (introduced by the "Bug 2110186" fix) should be used to cleanup interrupted commands. Signed-off-by: Stepan Ermakov <sermakov@orionsoft.ru>
3b514a5 to
fcb1779
Compare
|
/ost |
|
⏳ Running ost suite 'basic-suite-master' on distro 'el9stream'. Follow the progress here. |
|
😭💔 ost suite 'basic-suite-master' on distro 'el9stream' failed. (details) |
|
/ost |
|
⏳ Running ost suite 'basic-suite-master' on distro 'el9stream'. Follow the progress here. |
|
😭💔 ost suite 'basic-suite-master' on distro 'el9stream' failed. (details) |
|
Hi @JasperB-TeamBlue |
dupondje
left a comment
There was a problem hiding this comment.
Hi,
Sorry for the long waiting time, but I lack some time to review PR's.
I've did a more deep-dive into the code, and I don't think we have the correct solution for the issue here.
Let me explain:
createEngineLockForSnapshotRemove()
Creates 2 locks:
- LockingGroup.DISK for the disk, which is disk specific. So this always needs to be created for that particular disk. As this prevents from doing other actions on that disk during the snapshot removal.
- LockingGroup.VM -> A shared! lock on the VM, to well, lock the VM :)
The lockVmWithWait in MoveDisk creates a LockingGroup.LIVE_STORAGE_MIGRATION lock. Which is again an exclusive lock.
I think now (should validate in a test scenario), the LIVE_STORAGE_MIGRATION lock is exclusive, causing (at least the snapshots), to be created/removed one by one, Which causes you to 'fix' the issue.
Can you check the logs for the Failed to acquire VM lock, will retry on the next polling cycle message during your test?
The resolution can I think be either also aquire the getLock() (which is the LIVE_STORAGE_MIGRATION lock on the VM), so having 3 locks.
Or allow the code to create multiple snapshots concurrently if the disks doesn't overlap (which I think will be the hard way :))
Thanks
Jean-Louis
|
@sermakov-orion : Any follow-up? |
The live disks migration may fail with an error "The VM is performing an operation on a Snapshot".
Steps to reproduce:
Some of the disks will be moved successfully. But some disks migration would fail with the "The VM is performing an operation on a Snapshot" error message:
This is caused by the #670 (https://bugzilla.redhat.com/show_bug.cgi?id=2110186) fix.
There may be two different scenarios of the LiveMigrateDiskCommand:
Changes introduced with this PR
Use locks created by MoveDiskCommand.lockVmWithWait during the generic scenario
Use locks introduced in the "Bug 2110186" fix during the "ovirt-engine restarted during the live disks migration" scenario
Are you the owner of the code you are sending in, or do you have permission of the owner?
y