Corentin xcpng 9/xenserver rebased patchqueue#1
Corentin xcpng 9/xenserver rebased patchqueue#1CocoElAbricot wants to merge 34 commits intoCorentin-UEK8/u1/basefrom
Conversation
|
Hey Corentin, This is a big chunk of work, thanks for diving into this! As we discussed on matrix, I think it is worth re-doing the rebase following the practices laid out on:
This will allow you to record information on conflict resolutions, dropped commits, and also double check that your changes build properly, so that the review can be much smoother. I had started reviewing the branch, so I will paste my unfinished review comments here in case they are helpful when you re-do your work on top of the UEK8 branch:
Dropped because upstream commit 4b0ebbc ("net: gro: move L3 flush
Dropped because upstream commit ad16248 ("x86/xen/time: Reduce Xen
Sould be dropped because upstream commit fc42892 ("sched/fair: Bump
I am assuming XenServer folks disabled those mitigations because in the What's the rationale for not rebasing it?
Dropped, though a comment on why would be good for the record. I agree
Conflicts in your rebased branch caused by 217759b ("xen: port block
Why is this one missing on your rebased branch? It applies cleanly.
Nit: it would have been nice to not change the coding styles of lines where
It looks like you added a change from XenServer with the extra conditional
The driver changed significantly enough from the base branch that I
Why are you adding back all this code to common.c in your rebased commit I'm surprised this would build fine as we have multiple global symbols with Now, if you really needed to add a bunch of code as a pre-requisite to a
In the above hunk, please keep the annotation
The original code patched Upon closer look, upstream commit 770352e ("xen-pciback: provide a |
toolstack. Relative to the VBD, the xenstore key is: queue-NNN/kthread-pid where NNN is the queue number starting from 0. Signed-off-by: Ross Lagerwall <[email protected]>
interface. These may fail if the interface is reconfigured some time after boot due to external memory fragmentation. These buffers must be contiguous (for DMA) and they cannot be pre-allocated at boot because they are dependent on the number of queues, which is configurable (apart from wasting memory if the interface is not up). Use the __GFP_RETRY_MAYFAIL flag when performing these allocations so that direct reclaim and compaction has a better chance of working, making the allocations less likely to fail. Signed-off-by: Ross Lagerwall <[email protected]>
Commit 239c78d ("net: clear local_df when passing skb between namespaces") cleared local_df/ignore_df for tunnels. This broke, last resort, fragmentation of GRE tunneled packets. Clear ignore_df, only when packet is being injected to another namespace. Signed-off-by: Anoob Soman <[email protected]>
86b7451 to
07f59c1
Compare
|
Het @casasnovas Here is the updated patchqueue on top of UEK8/u1 Dropped commits: commit commit commit commits |
casasnovas
left a comment
There was a problem hiding this comment.
Nice work, my early comments inlined.
Can you also explain what sort of testing you did so far (build tested + dom0 boot with this kernel)?
arch/x86/xen/enlighten_pv.c
Outdated
| bool inhcall = __this_cpu_read(xen_in_preemptible_hcall); | ||
|
|
||
| __this_cpu_write(xen_in_preemptible_hcall, false); | ||
| raw_cpu_write(xen_in_preemptible_hcall, false); |
There was a problem hiding this comment.
Can you please explain why you made this change here? I don't find it in the original XS patch.
There was a problem hiding this comment.
Removed This changed to match the original patch
There was a problem hiding this comment.
I wasn't necessarily saying it should be reverted, it was more to understand the rationale as maybe it is indeed required, but we need to have an audit trail of why we thought it was.
kernel/module/sysfs.c
Outdated
| * The caller checked the pos and count against our size. | ||
| */ | ||
| memcpy(buf, bin_attr->private + pos, count); | ||
| return count; |
There was a problem hiding this comment.
I think you inadvertently added this module_notes_read function from your conflict resolution.
There was a problem hiding this comment.
Removed module_notes_read function
There was a problem hiding this comment.
Nit: I think there's now one extra new line I think (I am being picky so that we don't get unnecessary changes to decrease likeliness of conflicts with future patches).
drivers/xen/xen-pciback/pci_stub.c
Outdated
|
|
||
| list_for_each_entry(pdev, &dev->bus->devices, bus_list) { | ||
| if (pdev != dev && (!pdev->driver | ||
| || strcmp(pdev->driver->name, "pciback"))) |
There was a problem hiding this comment.
You've altered the indentation from original commit here.
There was a problem hiding this comment.
It still isn't right and changed from the original commit AFAICS.
drivers/xen/xen-pciback/pci_stub.c
Outdated
| if (dev->quarantined) | ||
| dev->unsafe = 0; | ||
| if (!xen_pv_domain()) | ||
| xen_reset_device(dev); |
There was a problem hiding this comment.
In your extra commit description you mention you've inlined these, but can you explain why?
After some digging and AIUI, upstream commit 88801d0 ("xen/pci: Add a function to reset device for xen") slightly reworked how to do a pci_reset and you're trying to mimic it here - why not simply calling pcistub_reset_device_state directly?
Also I think your inlining is not correct, the conditional for !xen_pv_domain() calling xen_reset_device() was called unconditionally, not only when the return of __pci_reset_function_locked is zero (so using the original helper instead of deleting + inlining here should also fix this manual error).
There was a problem hiding this comment.
As an extra suggestion, you could also rename pcistub_reset_device_state, to be defensive against future addition of calls to that function, i.e. by renaming it, we'll get build errors allowing us to call our own wrappers or decide to call the original one, but it won't go silently unnoticed.
There was a problem hiding this comment.
I have changed __pcistub_reset_function() and add a call to pcistub_reset_device_state() instead of deleting it.
| menuconfig MODULES | ||
| bool "Enable loadable module support" | ||
| modules | ||
| select EXECMEM |
There was a problem hiding this comment.
This was a mistake, reverted the change.
What about these two?
In:
Can you prefix your own commit descriptions with your name and enclose them in square brackets? Also, please wrap the lines so they are more readable :) Same comment for:
where you're using different styles/conventions. As another general comment when you add extra information to the commit description, explain the context and the why rather than the what (the what is usually already visible from the diff, but your intent and the information you had at the time you made a decision are important, like referencing upstream commits that lead to your conflict).
A reference to the commit changing the Example extra commit description ticking all boxes for me:
After giving it a bit of thought, we should probably go all-in with the UEK way, so maybe converting the I haven't had a chance to check each hunk and missed xs_padding to make sure there is a UEK one, will do when we agree on how to go forward. |
fallen into a state of disrepair, partly due to dead code, partly
as a result of the dynamic preemption feature whereby kernel
preemption mode is selected at kernel boot time. This patch
re-enables preemptible hypercalls and will do the following:
1) Update specific feature selection from !PREEMPTION to !PREEMPT.
The following table shows the relationship between different preemption
features and their indicators/selectors (Y = "=Y", N = "is not set",
. = absent):
| np-s | np-d | vp-s | vp-d | fp-s | fp-d
CONFIG_PREEMPT_DYNAMIC N Y N Y N Y
CONFIG_PREEMPTION . Y . Y Y Y
CONFIG_PREEMPT N N N N Y Y
CONFIG_PREEMPT_VOLUNTARY N N Y Y N N
CONFIG_PREEMPT_NONE Y Y N N N N
Unless PREEMPT is set, we need to enable the fixes.
2) Update flag access from __this_cpu_XXX() to raw_cpu_XXX().
The long-running hypercalls are flagged by way of a per-cpu variable
which is set before and cleared after the relevant calls. This elicits
a warning "BUG: using __this_cpu_write() in preemptible [00000000] code",
but xen_pv_evtchn_do_upcall() deals specifically with this.
3) Update irqentry_exit_cond_resched() to raw_irqentry_exit_cond_resched().
The code will call irqentry_exit_cond_resched() if the flag (as noted
above) is set, but the dynamic preemption feature will livepatch that
function to a no-op unless full preemption is selected. The code is
therefore updated to call raw_irqentry_exit_cond_resched().
[ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based):
following commit d30f467 ("x86/xen: Move Xen upcall handler"),
this code now lives in arch/x86/xen/enlighten_pv.c. The fix is applied
there rather than arch/x86/entry/common.c where it originally resided.]
When kernel runs as Xen PV dom0, it does not have access to the required MSRs and so cpuidle is explicitly disabled in xen_arch_setup(). Skip intel_idle initialization early if cpuidle is disabled to avoid spurious warnings reading MSRs. [ 11.706584] <TASK> [ 11.706629] ? ex_handler_msr+0x128/0x140 [ 11.706716] ? fixup_exception+0x85/0x340 [ 11.706803] ? exc_general_protection+0x136/0x400 [ 11.706907] ? asm_exc_general_protection+0x22/0x30 [ 11.707055] ? xen_do_read_msr+0x81/0xa0 [ 11.707182] xen_read_msr+0x20/0x30 [ 11.707340] intel_idle_init+0xa5c/0xc90 [ 11.707642] ? _raw_spin_unlock_irqrestore+0x18/0x20 [ 11.708159] ? __pfx_intel_idle_init+0x10/0x10 [ 11.708451] do_one_initcall+0x41/0x2e0 [ 11.708580] kernel_init_freeable+0x1c0/0x2c0 [ 11.708748] ? __pfx_kernel_init+0x10/0x10 [ 11.709013] kernel_init+0x16/0x1b0 [ 11.709193] ret_from_fork+0x2d/0x50 [ 11.709389] ? __pfx_kernel_init+0x10/0x10 [ 11.709656] ret_from_fork_asm+0x1b/0x30 [ 11.709844] </TASK> Signed-off-by: Chunjie Zhu <[email protected]>
kexec/kdump must pass acpi_rsdp (physical address of ACPI RSDP table) to the crash kernel, especially in EFI case, otherwise, the crash kernel fails to locate ACPI RSDP table. Consequently, ACPI init runs into error, and APIC mmio read page fault happens, finally crash kernel gets stuck. Signed-off-by: Chunjie Zhu <[email protected]>
4be73cd to
cc9d604
Compare
|
I have updated the commits description to have uniform style. 1e245e7 ("pv-iommu-support.patch") |
cc9d604 to
654f209
Compare
crashdump needs it to navigate around vmcore to dump driver information Signed-off-by: Chunjie Zhu <[email protected]> [ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based): re-introduce __counted_by(notes) annotation which was present on the original struct (removed in a subsequent hunk), as it allows runtime bounds checking on flexible arrays]
On systems where DMA addresses and physical addresses are not 1:1 (such as Xen PV guests), the generic dma_get_required_mask() will not return the correct mask (since the xen swiotlb does not provide a get_required_mask callback, it falls back to using 32 bits). Some device drivers (such as mpt3sas) use dma_get_required_mask() to set device DMA masks to allow them to use only 32-bit DMA addresses in hardware structures. This results in unnecessary use of the SWIOTLB if DMA addresses are more than 32-bits, impacting performance significantly. Under Xen, the required DMA mask can then always be set to 64-bits. Signed-off-by: David Vrabel <[email protected]>
The xen-pciback needs to test if a per-function reset is available so it can provide an interface for a safe bus reset instead. Signed-off-by: David Vrabel <[email protected]>
…g at any time. Any access by Dom0, DomU or Xen to the PCI device may cause a Fatal PCI Express error. Root ports can be configured to signal BIOS code that such an error has occured and the BIOS code can injected NMI interrupts after detecting the Fatal PCI Express error. The injected NMI interrupt may lead to a host crash as Xen may not be able to determine the cause of the NMI interrupt. Thus a guest with a PCI device passed through may be able to trigger a host crash. Disabling Fatal Error reporting in the PCI Express root port prevents the error from propogating to system firmware and removes that attack vector for triggering host crashes.
some devices to fallback to an alternate reset method. Signed-off-by: Ross Lagerwall <[email protected]>
"slabs", each of which is physically contiguous, in 32-bit address space,
to service DMA for devices that are limited to a 32-bit address space.
Ideally we want a total of 128M (a separate patch updates this from the
kernel default of 64M), but we are content with less, and will retry
three times, halving the allocation each time, before giving up.
A unique Xen problem results from memory being exchanged with the
hypervisor, whereby the kernel offers an amount of memory to the
hypervisor, and the hypervisor returns different memory, in this case
contiguous, low-address memory. This enables VMs (including dom0) to
remain of the exact same size, which has significant benefits in terms
of VM and memory management.
If, however, not enough of the desired memory is available, the hypervisor
will return what it's got, and we have a partial success. In this case
retrying is guaranteed to fail, since the hypervisor has already exchanged
everything it can, and we would end up giving up. To avoid this we
record how much memory we actually got by way of an extra paramemeter
("contig_pages"), and if this is enough (at time of writing we whink
down to 1/8th of the default request is OK) we take it and run.
In kernel version v4.19.19, the default number of RSS queues is determined using num_online_cpus(). However, in kernel v6.6.x, the logic is updated to use CPU topology information to determine the number of physical cores. For Xen dom0 kernels, this change introduces a problem: the dom0 kernel cannot retrieve valid CPU topology data from Xen. As a result, the default number of RSS queues is always set to 1, which degrades networking performance of guests using xen-netback. This patch introduces a fallback mechanism: if CPU topology information is unavailable, it reverts to using num_online_cpus(). To avoid excessive queue allocation, the number of RSS queues is capped at 8. Signed-off-by: Ming Lu <[email protected]>
After some recent upstream changes a lot of what used to be done in build.c is
now done in header.S and setup.ld.
The existing code lays out the PE header like this
CONFIG_EFI_MIXED=n CONFIG_EFI_MIXED=y
```
0x0000 +--------+ --- 0x0000 +--------+ ---
| | | | | |
0x1000 +--------+ | 0x1000 +--------+ |
| | | | | |
| .setup | | | .setup | |
| | | setup_size +--------+ | setup_size
| | | | .compat| |
0x4000 +--------+ --- 0x4000 +--------+ ---
| .text | | .text |
| | | |
... ...
| | | |
0xabb000 +--------+ 0xabb000 +--------+
| .data | | .data |
+--------+ +--------+
```
The .text section doesn't move and compat eats backwards into .setup. This
looks a little weird to me but ok. The code in build.c typically copies more
than 0x3000 bytes of setup data so it sure looks like putting .compat at 0x3000
is causing something to be overwritten. But I have left this behavior alone.
The changes in this PR extend setup_size defined in setup.ld by 0x1000 bytes
and then puts .sbat in the resulting gap between .setup and .text or .compat
and .text.
CONFIG_EFI_MIXED=n CONFIG_EFI_MIXED=y
```
0x0000 +--------+ --- 0x0000 +--------+ ---
| | | | | |
0x1000 +--------+ | 0x1000 +--------+ |
| | | | | |
| .setup | | | .setup | |
| | | setup_size +--------+ | setup_size
| | | | .compat| |
0x5000 +--------+ | 0x5000 +--------+ |
| .sbat | | | .sbat | |
0x6000 +--------+ --- 0x6000 +--------+ ---
| .text | | .text |
| | | |
... ...
| | | |
0xabb000 +--------+ 0xabb000 +--------+
| .data | | .data |
+--------+ +--------+
```
This change is careful to allocate new space for .sbat and to leave .compat to
continue doing what it did before.
Even though the .sbat data will be much smaller than 0x1000 bytes, the UEFI
standard apparently wants each section to be 4kB aligned with no gaps between
sections so the size of each section except the last needs to be a multiple of
0x1000.
With CONFIG_EFI_MIXED=n (XenServer default)
```
$ objdump -s -j .sbat arch/x86/boot/bzImage
0 .setup 00004000 0000000000001000 0000000000001000 00001000 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .sbat 00001000 0000000000005000 0000000000005000 00005000 2**4
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .text 00ab5000 0000000000006000 0000000000006000 00006000 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
3 .data 00000400 0000000000abb000 0000000000abb000 00abb000 2**4
CONTENTS, ALLOC, LOAD, DATA
```
```
$ objdump -s -j .sbat arch/x86/boot/bzImage
arch/x86/boot/bzImage: file format pei-x86-64
Contents of section .sbat:
5000 73626174 2c312c53 42415420 56657273 sbat,1,SBAT Vers
5010 696f6e2c 73626174 2c312c68 74747073 ion,sbat,1,https
5020 3a2f2f67 69746875 622e636f 6d2f7268 ://github.com/rh
5030 626f6f74 2f736869 6d2f626c 6f622f6d boot/shim/blob/m
5040 61696e2f 53424154 2e6d640a 6c696e75 ain/SBAT.md.linu
5050 782e7873 2c312c43 6c6f7564 20536f66 x.xs,1,Cloud Sof
5060 74776172 65204772 6f75702c 6c696e75 tware Group,linu
5070 782c312c 6d61696c 746f3a73 65637572 x,1,mailto:secur
5080 65406369 74726978 2e636f6d 0a000000 [email protected]....
5090 00000000 00000000 00000000 00000000 ................
50a0 00000000 00000000 00000000 00000000 ................
50b0 00000000 00000000 00000000 00000000 ................
...
5fd0 00000000 00000000 00000000 00000000 ................
5fe0 00000000 00000000 00000000 00000000 ................
5ff0 00000000 00000000 00000000 00000000 ................
```
With CONFIG_EFI_MIXED=y
```
Sections:
Idx Name Size VMA LMA File off Algn
0 .setup 00003000 0000000000001000 0000000000001000 00001000 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .compat 00001000 0000000000004000 0000000000004000 00004000 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .sbat 00001000 0000000000005000 0000000000005000 00005000 2**4
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .text 00ab9000 0000000000006000 0000000000006000 00006000 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
4 .data 00000400 0000000000abf000 0000000000abf000 00abf000 2**4
CONTENTS, ALLOC, LOAD, DATA
```
I have tested that kernels built with CONFIG_EFI_MIXED=y and CONFIG_EFI_MIXED=n
both boot fine.
Signed-off-by: Kevin Lampis <[email protected]>
It can be disabled by passing lockdown_disable on the kernel command-line, only if Secure Boot is not enabled. Signed-off-by: Ross Lagerwall <[email protected]> [ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based): - 260017f `("lsm: use default hook return value in call_int_hook()") simplified the `call_int_hook` macro, so align our usage in `security_lock_kernel_down()`]
…ways enabled unless Secure Boot is disabled and sig_enforce=false is passed. This is implemented by setting CONFIG_MODULE_SIG_FORCE but allowing it to be disabled from the command-line with this patch. The command-line setting is overridden in arch_get_ima_policy() so if Secure Boot is enabled, sig_enforce will always be true. Signed-off-by: Kevin Lampis <[email protected]> Signed-off-by: Ross Lagerwall <[email protected]>
Shim is supposed to create an EFI table containing the MoK configuration which the kernel accesses via the efi_mokvar_* functions. In some cases (possibly due to Xen), this table is not available. The configuration is also made available via EFI variables and the kernel has fallback paths that use the variables if the table is not available. Fall back to using the EFI variable in this case too. With this change, MoK keys correctly get added to the machine keying which is trusted and can be used to verify modules (rather than the untrusted platform keyring). Signed-off-by: Ross Lagerwall <[email protected]>
…abled. By default the kernel does not allow to access (read and/or write) any writeable file in debugfs filesystem. However is useful for user space to access these file for statistics and debugging issues. So, if lockdown mode is enabled provide read-only access to these files. Signed-off-by: Frediano Ziglio <[email protected]>
These hypercall filter should build against these headers. If the version of Xen changes, the headers here should be updated along with the hypercall filter. Signed-off-by: Ross Lagerwall <[email protected]>
Signed-off-by: Chunjie Zhu <[email protected]> [ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based): Dropped xs_kabi_padding evry time UEK_KABI_RESERVE(n) was already present, keeped the hunks that were missing]
Signed-off-by: Chunjie Zhu <[email protected]>
The standard implementation of pci_reset_function() does not try an SBR if there are other sibling devices. This is a common configuration for multi-function devices (e.g., GPUs with a HDMI audio device function). If all the functions are co-assigned to the same domain at the same time, then it is safe to perform an SBR to reset all functions at the same time. Add a "reset" sysfs file with the same interface as the standard one (i.e., write 1 to reset) that will try an SBR if all sibling devices are unbound or bound to pciback. Note that this is weaker than the requirement for functions to be simultaneously co-assigned, but the toolstack is expected to ensure this. Signed-off-by: David Vrabel <[email protected]> [ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based): commit 88801d0 ("xen/pci: Add a function to reset device for xen") slightly reworked how to do a pci_reset, integrate these changes into __pcistub_reset_function(), then call __pcistub_reset_function() instead of pcistub_reset_device_state() ]
not expecting the LUN change. Lest it may trigger udev rules which in turn trigger more LUN change events. Signed-off-by: Gerald Elder-Vass <[email protected]>
Xen cannot determine whether a hypercall originates from userspace or kernel space. A malicious userspace may use Xen to attack the dom0 kernel. To avoid this, instead of allowing all hypercalls verbatim, block all hypercalls by default. Allow through certain hypercalls with checks. Checking involves: 1) Any pointers need to point at valid userspace memory. For this, we also need to know the size of the buffer. 2) Buffers containing pointers need to be bounced to avoid userspace changing the memory after it has been checked. After the hypercall completes, the memory needs to be bounced back. In addition, for the kexec_load call, verify the kernel to maintain Secure Boot integrity. This patch is technical debt and should be replaced by a stable hypervisor ABI and a generic mechanism for the dom0 kernel to check hypercalls made from userspace. Signed-off-by: Ross Lagerwall <[email protected]> Signed-off-by: Frediano Ziglio <[email protected]> [ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based): - 260017f `("lsm: use default hook return value in call_int_hook()") simplified the `call_int_hook` macro, so align our usage in `security_locked_down_nowarn()`]
654f209 to
01e920f
Compare
First rebase of XS9-preview patchqueue on top of linux 6.12.74