Skip to content

Corentin xcpng 9/xenserver rebased patchqueue#1

Draft
CocoElAbricot wants to merge 34 commits intoCorentin-UEK8/u1/basefrom
corentin-xcpng-9/xenserver-rebased-patchqueue
Draft

Corentin xcpng 9/xenserver rebased patchqueue#1
CocoElAbricot wants to merge 34 commits intoCorentin-UEK8/u1/basefrom
corentin-xcpng-9/xenserver-rebased-patchqueue

Conversation

@CocoElAbricot
Copy link
Copy Markdown

First rebase of XS9-preview patchqueue on top of linux 6.12.74

@casasnovas casasnovas assigned casasnovas and unassigned casasnovas Mar 19, 2026
@casasnovas casasnovas self-requested a review March 19, 2026 16:34
@casasnovas
Copy link
Copy Markdown

Hey Corentin,

This is a big chunk of work, thanks for diving into this!

As we discussed on matrix, I think it is worth re-doing the rebase following the practices laid out on:

This will allow you to record information on conflict resolutions, dropped commits, and also double check that your changes build properly, so that the review can be much smoother.

I had started reviewing the branch, so I will paste my unfinished review comments here in case they are helpful when you re-do your work on top of the UEK8 branch:

8c2cd4e net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment

Dropped because upstream commit 4b0ebbc ("net: gro: move L3 flush
checks to tcp_gro_receive and udp_gro_receive_segment") is ancestor of your
rebased branch, ack.

f3240e2 x86/xen/time: Reduce Xen timer tick

Dropped because upstream commit ad16248 ("x86/xen/time: Reduce Xen
timer tick") is ancestor of your branch, ack.

97e223a sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails

Sould be dropped because upstream commit fc42892 ("sched/fair: Bump
sd->max_newidle_lb_cost when newidle balance fails") is an ancestor and the original
commit was reverted in 8134361 ("sched/fair: Revert max_newidle_lb_cost bump")

d175d44 disable-mitigations-by-default

I am assuming XenServer folks disabled those mitigations because in the
context of the dom0 kernel they don't apply and incur an unwanted
performance cost?

What's the rationale for not rebasing it?

2d854a0 expose-xsversion

Dropped, though a comment on why would be good for the record. I agree
this is likely not needed, but maybe worth checking it's not used by XAPI
for example.

d076860 blkback-kthread-pid

Conflicts in your rebased branch caused by 217759b ("xen: port block
device access to file"), all good.

1764637 tg3-alloc-repeat
a81ba45 net: Do not scrub ignore_df within the same name space

Why is this one missing on your rebased branch? It applies cleanly.

427d6df enable-fragmention-gre-packets

--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -525,8 +525,8 @@ static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,

        gre_build_header(skb, tunnel_hlen, flags, proto,
                         tunnel_id_to_key32(tun_info->key.tun_id),
-                        test_bit(IP_TUNNEL_SEQ_BIT, flags) ?
-                        htonl(atomic_fetch_inc(&tunnel->o_seqno)) : 0);
+                        test_bit(IP_TUNNEL_SEQ_BIT, flags) ? htonl(atomic_fetch_inc(&tunnel->o_seqno)) : 0);
+       skb->ignore_df = 1;

Nit: it would have been nice to not change the coding styles of lines where
you had conflicts for easier review that they haven't changed in, here the
only real change should be the extra line with ignore_df = 1.

a8bbfd1 call-kexec-before-offlining-noncrashing-cpus

--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -361,6 +367,13 @@ void panic(const char *fmt, ...)
         */
        kgdb_panic(buf);

+#ifdef CONFIG_XEN
+       if (xen_initial_domain()) {
+               xen_kexec_exec_t xke = { KEXEC_TYPE_CRASH };
+               if (HYPERVISOR_kexec_op(KEXEC_CMD_kexec, &xke))
+       pr_warn("HYPERVISOR_kexec_op failed\n");
+       }

It looks like you added a change from XenServer with the extra conditional
plus pr_warn(), why?

bb88d6e lpfc-fallback-to-sli-2

The driver changed significantly enough from the base branch that I
understand it's been dropped. On top of that the original author says it
is temporary band-aid to make a specific device work, so probably not
needed.

97379fb fix-hypercall-preemption

Why are you adding back all this code to common.c in your rebased commit
df7c38a ("fix-hypercall-preemption")? The code has been moved into
arch/x86/xen/enlighten_pv.c in d30f467 ("x86/xen: Move Xen upcall
handler").

I'm surprised this would build fine as we have multiple global symbols with
the same name that will get linked into vmlinux. And the one in
enlighten_pv.c does not contain the fix.

Now, if you really needed to add a bunch of code as a pre-requisite to a
commit you are rebasing/cherry-picking, I think it would be worth isolating
in it in their own commit with a good description.

b6653be skip cpuidle driver init if cpuidle function disabled

--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -407,6 +407,18 @@ struct klp_modinfo {
 };
 #endif

+#ifdef CONFIG_KALLSYMS
+/*
+ * /sys/module/foo/notes/.section.name gives contents of SHT_NOTE sections.
+ */
+
+struct module_notes_attrs {
+       struct kobject *dir;
+       unsigned int notes;
+       struct bin_attribute attrs[];
+};
+#endif
+

In the above hunk, please keep the annotation __counted_by(notes) which
was present on the original struct (removed in a subsequent hunk), as it
allows runtime bounds checking on flexible arrays, which is something we
might want to enable if our dom0 kernel if not already enabled.

f0c7aa5 CA-392853: fix kdump kernel cannot find ACPI RSDP
d2100f8 CA-415346: export module symbol offsets
669c430 x86,xen: correct dma_get_required_mask() for Xen PV guests
7f78a04 pci: export pci_probe_reset_function()
e16408f xen-pciback: provide a "reset" sysfs file to try harder at an SBR

The original code patched pcistub_device_release and
pcistub_put_pci_dev when yours is patching another function.

Upon closer look, upstream commit 770352e ("xen-pciback: provide a
"reset" sysfs file to try harder at an SBR") already is an ancestor of your
branch, so you should drop this rebased commit.

Vates Git Importer and others added 4 commits March 20, 2026 10:19
toolstack.  Relative to the VBD, the xenstore key is: queue-NNN/kthread-pid
where NNN is the queue number starting from 0.

Signed-off-by: Ross Lagerwall <[email protected]>
interface. These may fail if the interface is reconfigured some time
after boot due to external memory fragmentation. These buffers must be
contiguous (for DMA) and they cannot be pre-allocated at boot because
they are dependent on the number of queues, which is configurable (apart
from wasting memory if the interface is not up).  Use the
__GFP_RETRY_MAYFAIL flag when performing these allocations so that
direct reclaim and compaction has a better chance of working, making the
allocations less likely to fail.

Signed-off-by: Ross Lagerwall <[email protected]>
Commit 239c78d ("net: clear local_df when passing skb between namespaces")
cleared local_df/ignore_df for tunnels. This broke, last resort, fragmentation
of GRE tunneled packets.

Clear ignore_df, only when packet is being injected to another namespace.

Signed-off-by: Anoob Soman <[email protected]>
@CocoElAbricot CocoElAbricot force-pushed the corentin-xcpng-9/xenserver-rebased-patchqueue branch from 86b7451 to 07f59c1 Compare March 26, 2026 13:55
@CocoElAbricot CocoElAbricot changed the base branch from corentin-xcpng-9/base to Corentin-UEK8/u1/base March 26, 2026 13:56
@CocoElAbricot
Copy link
Copy Markdown
Author

Het @casasnovas Here is the updated patchqueue on top of UEK8/u1
The original patchqueue is in the branch kernel/xenserver-6.6.98-19/base

Dropped commits:
commit b3c34f5ee600 net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment
already upstream in: 4b0ebbc

commit 97e223a sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails
reverted in upstream by: 8134361 ("sched/fair: Revert max_newidle_lb_cost bump")

commit f3240e2 x86/xen/time: Reduce Xen timer tick
Dropped because upstream commit ad16248 ("x86/xen/time: Reduce Xen timer tick")

commit bb88d6e lpfc-fallback-to-sli-2
The driver changed significantly enough from the base branch that I
understand it's been dropped. On top of that the original author says it
is temporary band-aid to make a specific device work, so probably not
needed.

commits 02f7f1d, b152f10, e0fc31b, de9285f, ff47f40 and ac5a534 are part of GFS2 related patchset, have been dropped because XCP-ng don't support GFS2

Copy link
Copy Markdown

@casasnovas casasnovas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, my early comments inlined.

Can you also explain what sort of testing you did so far (build tested + dom0 boot with this kernel)?

bool inhcall = __this_cpu_read(xen_in_preemptible_hcall);

__this_cpu_write(xen_in_preemptible_hcall, false);
raw_cpu_write(xen_in_preemptible_hcall, false);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why you made this change here? I don't find it in the original XS patch.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed This changed to match the original patch

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't necessarily saying it should be reverted, it was more to understand the rationale as maybe it is indeed required, but we need to have an audit trail of why we thought it was.

* The caller checked the pos and count against our size.
*/
memcpy(buf, bin_attr->private + pos, count);
return count;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you inadvertently added this module_notes_read function from your conflict resolution.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed module_notes_read function

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think there's now one extra new line I think (I am being picky so that we don't get unnecessary changes to decrease likeliness of conflicts with future patches).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now it should be correct


list_for_each_entry(pdev, &dev->bus->devices, bus_list) {
if (pdev != dev && (!pdev->driver
|| strcmp(pdev->driver->name, "pciback")))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've altered the indentation from original commit here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still isn't right and changed from the original commit AFAICS.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's now correct

if (dev->quarantined)
dev->unsafe = 0;
if (!xen_pv_domain())
xen_reset_device(dev);
Copy link
Copy Markdown

@casasnovas casasnovas Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your extra commit description you mention you've inlined these, but can you explain why?

After some digging and AIUI, upstream commit 88801d0 ("xen/pci: Add a function to reset device for xen") slightly reworked how to do a pci_reset and you're trying to mimic it here - why not simply calling pcistub_reset_device_state directly?

Also I think your inlining is not correct, the conditional for !xen_pv_domain() calling xen_reset_device() was called unconditionally, not only when the return of __pci_reset_function_locked is zero (so using the original helper instead of deleting + inlining here should also fix this manual error).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an extra suggestion, you could also rename pcistub_reset_device_state, to be defensive against future addition of calls to that function, i.e. by renaming it, we'll get build errors allowing us to call our own wrappers or decide to call the original one, but it won't go silently unnoticed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed __pcistub_reset_function() and add a call to pcistub_reset_device_state() instead of deleting it.

menuconfig MODULES
bool "Enable loadable module support"
modules
select EXECMEM
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you remove this select?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake, reverted the change.

@casasnovas
Copy link
Copy Markdown

casasnovas commented Mar 26, 2026

Het @casasnovas Here is the updated patchqueue on top of UEK8/u1

Dropped commits:
...

What about these two?

  • 1e245e7 ("pv-iommu-support.patch")
  • 664ea14 ("Reserve the crashkernel region in the E820. This prevents oustanding DMA")

In:

7f2b042 ("pciback-mask-root-port-comp-timeout.patch")

Can you prefix your own commit descriptions with your name and enclose them in square brackets? Also, please wrap the lines so they are more readable :)

Same comment for:

  • 75761b9 ("xen-pciback: provide a "reset" sysfs file to try harder at an SBR")
  • 9dcf6f3 ("CA-415346: export module symbol offsets")
  • b6142b9 ("At some point since kernel 4.19 the preemption of hypercalls has fallen into a s…")
  • 6de9272 ("Enable lockdown integrity by default")
  • 9b7239b ("Filter and check hypercalls from userspace")
  • 1ebe51c ("module-hash-revocation.patch") and
  • 157ef6b ("CP-46343: common data structure padding")

where you're using different styles/conventions.

As another general comment when you add extra information to the commit description, explain the context and the why rather than the what (the what is usually already visible from the diff, but your intent and the information you had at the time you made a decision are important, like referencing upstream commits that lead to your conflict).

6de9272 ("Enable lockdown integrity by default")

A reference to the commit changing the call_int_hook would have been handy during review, please add it to your commit description: 260017f ("lsm: use default hook return value in call_int_hook()").

Example extra commit description ticking all boxes for me:

[ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based):
- 260017f31a8c `("lsm: use default hook return value in call_int_hook()") simplified the `call_int_hook`
macro, so align our usage in `security_lock_kernel_down()`]

157ef6b ("CP-46343: common data structure padding")

After giving it a bit of thought, we should probably go all-in with the UEK way, so maybe converting the xs_padding pointers to the UEK macros so that we have a consistent way to add padding bytes would be prefered.

I haven't had a chance to check each hunk and missed xs_padding to make sure there is a UEK one, will do when we agree on how to go forward.

Vates Git Importer and others added 4 commits March 27, 2026 10:13
fallen into a state of disrepair, partly due to dead code, partly
as a result of the dynamic preemption feature whereby kernel
preemption mode is selected at kernel boot time.  This patch
re-enables preemptible hypercalls and will do the following:

1) Update specific feature selection from !PREEMPTION to !PREEMPT.
The following table shows the relationship between different preemption
features and their indicators/selectors (Y = "=Y", N = "is not set",
. = absent):

                            | np-s | np-d | vp-s | vp-d | fp-s | fp-d
    CONFIG_PREEMPT_DYNAMIC      N      Y      N      Y      N      Y
         CONFIG_PREEMPTION      .      Y      .      Y      Y      Y
            CONFIG_PREEMPT      N      N      N      N      Y      Y
  CONFIG_PREEMPT_VOLUNTARY      N      N      Y      Y      N      N
       CONFIG_PREEMPT_NONE      Y      Y      N      N      N      N

Unless PREEMPT is set, we need to enable the fixes.

2) Update flag access from __this_cpu_XXX() to raw_cpu_XXX().
The long-running hypercalls are flagged by way of a per-cpu variable
which is set before and cleared after the relevant calls.  This elicits
a warning "BUG: using __this_cpu_write() in preemptible [00000000] code",
but xen_pv_evtchn_do_upcall() deals specifically with this.

3) Update irqentry_exit_cond_resched() to raw_irqentry_exit_cond_resched().
The code will call irqentry_exit_cond_resched() if the flag (as noted
above) is set, but the dynamic preemption feature will livepatch that
function to a no-op unless full preemption is selected.  The code is
therefore updated to call raw_irqentry_exit_cond_resched().

[ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based):
following commit d30f467 ("x86/xen: Move Xen upcall handler"),
this code now lives in arch/x86/xen/enlighten_pv.c. The fix is applied
there rather than arch/x86/entry/common.c where it originally resided.]
When kernel runs as Xen PV dom0, it does not have access to the required MSRs
and so cpuidle is explicitly disabled in xen_arch_setup(). Skip intel_idle
initialization early if cpuidle is disabled to avoid spurious warnings reading
MSRs.

[   11.706584]  <TASK>
[   11.706629]  ? ex_handler_msr+0x128/0x140
[   11.706716]  ? fixup_exception+0x85/0x340
[   11.706803]  ? exc_general_protection+0x136/0x400
[   11.706907]  ? asm_exc_general_protection+0x22/0x30
[   11.707055]  ? xen_do_read_msr+0x81/0xa0
[   11.707182]  xen_read_msr+0x20/0x30
[   11.707340]  intel_idle_init+0xa5c/0xc90
[   11.707642]  ? _raw_spin_unlock_irqrestore+0x18/0x20
[   11.708159]  ? __pfx_intel_idle_init+0x10/0x10
[   11.708451]  do_one_initcall+0x41/0x2e0
[   11.708580]  kernel_init_freeable+0x1c0/0x2c0
[   11.708748]  ? __pfx_kernel_init+0x10/0x10
[   11.709013]  kernel_init+0x16/0x1b0
[   11.709193]  ret_from_fork+0x2d/0x50
[   11.709389]  ? __pfx_kernel_init+0x10/0x10
[   11.709656]  ret_from_fork_asm+0x1b/0x30
[   11.709844]  </TASK>

Signed-off-by: Chunjie Zhu <[email protected]>
kexec/kdump must pass acpi_rsdp (physical address of ACPI RSDP table) to the
crash kernel, especially in EFI case, otherwise, the crash kernel fails to
locate ACPI RSDP table. Consequently, ACPI init runs into error, and APIC
mmio read page fault happens, finally crash kernel gets stuck.

Signed-off-by: Chunjie Zhu <[email protected]>
@CocoElAbricot CocoElAbricot force-pushed the corentin-xcpng-9/xenserver-rebased-patchqueue branch 2 times, most recently from 4be73cd to cc9d604 Compare March 27, 2026 14:42
@CocoElAbricot
Copy link
Copy Markdown
Author

I have updated the commits description to have uniform style.

1e245e7 ("pv-iommu-support.patch")
664ea14 ("Reserve the crashkernel region in the E820. This prevents oustanding DMA")
These 2 commits are part of Xenserver specific PVIOMMU patchset, as discussed with the team earlier they are not needed in XCP-ng

@CocoElAbricot CocoElAbricot force-pushed the corentin-xcpng-9/xenserver-rebased-patchqueue branch from cc9d604 to 654f209 Compare March 27, 2026 14:51
chunjiez and others added 9 commits March 27, 2026 16:26
  crashdump needs it to navigate around vmcore to dump driver
  information

Signed-off-by: Chunjie Zhu <[email protected]>

[ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based):
re-introduce __counted_by(notes) annotation which
was present on the original struct (removed in a subsequent hunk), as it
allows runtime bounds checking on flexible arrays]
On systems where DMA addresses and physical addresses are not 1:1
(such as Xen PV guests), the generic dma_get_required_mask() will not
return the correct mask (since the xen swiotlb does not provide a
get_required_mask callback, it falls back to using 32 bits).

Some device drivers (such as mpt3sas) use dma_get_required_mask() to set
device DMA masks to allow them to use only 32-bit DMA addresses in
hardware structures.  This results in unnecessary use of the SWIOTLB if
DMA addresses are more than 32-bits, impacting performance
significantly.

Under Xen, the required DMA mask can then always be set to 64-bits.

Signed-off-by: David Vrabel <[email protected]>
The xen-pciback needs to test if a per-function reset is available so
it can provide an interface for a safe bus reset instead.

Signed-off-by: David Vrabel <[email protected]>
…g at any

time.

Any access by Dom0, DomU or Xen to the PCI device may cause a Fatal PCI Express
error. Root ports can be configured to signal BIOS code that such an error has
occured and the BIOS code can injected NMI interrupts after detecting the
Fatal PCI Express error.

The injected NMI interrupt may lead to a host crash as Xen may not be able to
determine the cause of the NMI interrupt.

Thus a guest with a PCI device passed through may be able to trigger a host
crash.

Disabling Fatal Error reporting in the PCI Express root port prevents the
error from propogating to system firmware and removes that attack vector for
triggering host crashes.
some devices to fallback to an alternate reset method.

Signed-off-by: Ross Lagerwall <[email protected]>
Vates Git Importer and others added 17 commits March 27, 2026 16:26
"slabs", each of which is physically contiguous, in 32-bit address space,
to service DMA for devices that are limited to a 32-bit address space.
Ideally we want a total of 128M (a separate patch updates this from the
kernel default of 64M), but we are content with less, and will retry
three times, halving the allocation each time, before giving up.

A unique Xen problem results from memory being exchanged with the
hypervisor, whereby the kernel offers an amount of memory to the
hypervisor, and the hypervisor returns different memory, in this case
contiguous, low-address memory.  This enables VMs (including dom0) to
remain of the exact same size, which has significant benefits in terms
of VM and memory management.

If, however, not enough of the desired memory is available, the hypervisor
will return what it's got, and we have a partial success.  In this case
retrying is guaranteed to fail, since the hypervisor has already exchanged
everything it can, and we would end up giving up.  To avoid this we
record how much memory we actually got by way of an extra paramemeter
("contig_pages"), and if this is enough (at time of writing we whink
down to 1/8th of the default request is OK) we take it and run.
In kernel version v4.19.19, the default number of RSS queues is
determined using num_online_cpus(). However, in kernel v6.6.x, the logic
is updated to use CPU topology information to determine the number of
physical cores.

For Xen dom0 kernels, this change introduces a problem: the dom0 kernel
cannot retrieve valid CPU topology data from Xen. As a result, the
default number of RSS queues is always set to 1, which degrades
networking performance of guests using xen-netback.

This patch introduces a fallback mechanism: if CPU topology information
is unavailable, it reverts to using num_online_cpus(). To avoid excessive
queue allocation, the number of RSS queues is capped at 8.

Signed-off-by: Ming Lu <[email protected]>
After some recent upstream changes a lot of what used to be done in build.c is
now done in header.S and setup.ld.

The existing code lays out the PE header like this

     CONFIG_EFI_MIXED=n                 CONFIG_EFI_MIXED=y
```
   0x0000 +--------+ ---             0x0000 +--------+ ---
          |        |  |                     |        |  |
   0x1000 +--------+  |              0x1000 +--------+  |
          |        |  |                     |        |  |
          | .setup |  |                     | .setup |  |
          |        |  | setup_size          +--------+  | setup_size
          |        |  |                     | .compat|  |
   0x4000 +--------+ ---             0x4000 +--------+ ---
          | .text  |                        | .text  |
          |        |                        |        |
             ...                               ...
          |        |                        |        |
 0xabb000 +--------+               0xabb000 +--------+
          | .data  |                        | .data  |
          +--------+                        +--------+
```
The .text section doesn't move and compat eats backwards into .setup. This
looks a little weird to me but ok. The code in build.c typically copies more
than 0x3000 bytes of setup data so it sure looks like putting .compat at 0x3000
is causing something to be overwritten. But I have left this behavior alone.

The changes in this PR extend setup_size defined in setup.ld by 0x1000 bytes
and then puts .sbat in the resulting gap between .setup and .text or .compat
and .text.

     CONFIG_EFI_MIXED=n                 CONFIG_EFI_MIXED=y
```
   0x0000 +--------+ ---             0x0000 +--------+ ---
          |        |  |                     |        |  |
   0x1000 +--------+  |              0x1000 +--------+  |
          |        |  |                     |        |  |
          | .setup |  |                     | .setup |  |
          |        |  | setup_size          +--------+  | setup_size
          |        |  |                     | .compat|  |
   0x5000 +--------+  |              0x5000 +--------+  |
          | .sbat  |  |                     | .sbat  |  |
   0x6000 +--------+ ---             0x6000 +--------+ ---
          | .text  |                        | .text  |
          |        |                        |        |
             ...                               ...
          |        |                        |        |
 0xabb000 +--------+               0xabb000 +--------+
          | .data  |                        | .data  |
          +--------+                        +--------+
```
This change is careful to allocate new space for .sbat and to leave .compat to
continue doing what it did before.

Even though the .sbat data will be much smaller than 0x1000 bytes, the UEFI
standard apparently wants each section to be 4kB aligned with no gaps between
sections so the size of each section except the last needs to be a multiple of
0x1000.

With CONFIG_EFI_MIXED=n (XenServer default)
```
$ objdump -s -j .sbat arch/x86/boot/bzImage
  0 .setup        00004000  0000000000001000  0000000000001000  00001000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  1 .sbat         00001000  0000000000005000  0000000000005000  00005000  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  2 .text         00ab5000  0000000000006000  0000000000006000  00006000  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  3 .data         00000400  0000000000abb000  0000000000abb000  00abb000  2**4
                  CONTENTS, ALLOC, LOAD, DATA
```

```
$ objdump -s -j .sbat arch/x86/boot/bzImage

arch/x86/boot/bzImage:     file format pei-x86-64

Contents of section .sbat:
 5000 73626174 2c312c53 42415420 56657273  sbat,1,SBAT Vers
 5010 696f6e2c 73626174 2c312c68 74747073  ion,sbat,1,https
 5020 3a2f2f67 69746875 622e636f 6d2f7268  ://github.com/rh
 5030 626f6f74 2f736869 6d2f626c 6f622f6d  boot/shim/blob/m
 5040 61696e2f 53424154 2e6d640a 6c696e75  ain/SBAT.md.linu
 5050 782e7873 2c312c43 6c6f7564 20536f66  x.xs,1,Cloud Sof
 5060 74776172 65204772 6f75702c 6c696e75  tware Group,linu
 5070 782c312c 6d61696c 746f3a73 65637572  x,1,mailto:secur
 5080 65406369 74726978 2e636f6d 0a000000  [email protected]....
 5090 00000000 00000000 00000000 00000000  ................
 50a0 00000000 00000000 00000000 00000000  ................
 50b0 00000000 00000000 00000000 00000000  ................
...
 5fd0 00000000 00000000 00000000 00000000  ................
 5fe0 00000000 00000000 00000000 00000000  ................
 5ff0 00000000 00000000 00000000 00000000  ................
 ```

With CONFIG_EFI_MIXED=y
```
Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .setup        00003000  0000000000001000  0000000000001000  00001000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  1 .compat       00001000  0000000000004000  0000000000004000  00004000  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  2 .sbat         00001000  0000000000005000  0000000000005000  00005000  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  3 .text         00ab9000  0000000000006000  0000000000006000  00006000  2**4
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  4 .data         00000400  0000000000abf000  0000000000abf000  00abf000  2**4
                  CONTENTS, ALLOC, LOAD, DATA
```

I have tested that kernels built with CONFIG_EFI_MIXED=y and CONFIG_EFI_MIXED=n
both boot fine.

Signed-off-by: Kevin Lampis <[email protected]>
It can be disabled by passing lockdown_disable on the kernel
command-line, only if Secure Boot is not enabled.

Signed-off-by: Ross Lagerwall <[email protected]>

[ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based):
- 260017f `("lsm: use default hook return value in call_int_hook()") simplified the `call_int_hook`
macro, so align our usage in `security_lock_kernel_down()`]
…ways

enabled unless Secure Boot is disabled and sig_enforce=false is passed.

This is implemented by setting CONFIG_MODULE_SIG_FORCE but allowing it to be
disabled from the command-line with this patch.  The command-line setting is
overridden in arch_get_ima_policy() so if Secure Boot is enabled, sig_enforce
will always be true.

Signed-off-by: Kevin Lampis <[email protected]>
Signed-off-by: Ross Lagerwall <[email protected]>
Shim is supposed to create an EFI table containing the MoK configuration which
the kernel accesses via the efi_mokvar_* functions. In some cases (possibly due
to Xen), this table is not available. The configuration is also made available
via EFI variables and the kernel has fallback paths that use the variables if
the table is not available. Fall back to using the EFI variable in this case
too.

With this change, MoK keys correctly get added to the machine keying which is
trusted and can be used to verify modules (rather than the untrusted platform
keyring).

Signed-off-by: Ross Lagerwall <[email protected]>
…abled.

By default the kernel does not allow to access (read and/or write) any
writeable file in debugfs filesystem.
However is useful for user space to access these file for statistics and
debugging issues.
So, if lockdown mode is enabled provide read-only access to these files.

Signed-off-by: Frediano Ziglio <[email protected]>
These hypercall filter should build against these headers. If the version of
Xen changes, the headers here should be updated along with the hypercall
filter.

Signed-off-by: Ross Lagerwall <[email protected]>
Signed-off-by: Chunjie Zhu <[email protected]>

[ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based):
Dropped xs_kabi_padding evry time UEK_KABI_RESERVE(n) was already present, keeped the hunks that were missing]
The standard implementation of pci_reset_function() does not try an
SBR if there are other sibling devices.  This is a common
configuration for multi-function devices (e.g., GPUs with a HDMI audio
device function).

If all the functions are co-assigned to the same domain at the same
time, then it is safe to perform an SBR to reset all functions at the
same time.

Add a "reset" sysfs file with the same interface as the standard one
(i.e., write 1 to reset) that will try an SBR if all sibling devices
are unbound or bound to pciback.

Note that this is weaker than the requirement for functions to be
simultaneously co-assigned, but the toolstack is expected to ensure
this.

Signed-off-by: David Vrabel <[email protected]>

[ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based):
commit 88801d0 ("xen/pci: Add a function to reset device for xen") slightly reworked how to do a pci_reset, integrate these changes into __pcistub_reset_function(), then call __pcistub_reset_function() instead of pcistub_reset_device_state() ]
not expecting the LUN change. Lest it may trigger udev rules which in
turn trigger more LUN change events.

Signed-off-by: Gerald Elder-Vass <[email protected]>
Xen cannot determine whether a hypercall originates from userspace or kernel
space. A malicious userspace may use Xen to attack the dom0 kernel.

To avoid this, instead of allowing all hypercalls verbatim, block all
hypercalls by default. Allow through certain hypercalls with checks.

Checking involves:

1) Any pointers need to point at valid userspace memory. For this, we also need
   to know the size of the buffer.
2) Buffers containing pointers need to be bounced to avoid userspace changing
   the memory after it has been checked. After the hypercall completes, the
   memory needs to be bounced back.

In addition, for the kexec_load call, verify the kernel to maintain Secure Boot
integrity.

This patch is technical debt and should be replaced by a stable hypervisor ABI
and a generic mechanism for the dom0 kernel to check hypercalls made from
userspace.

Signed-off-by: Ross Lagerwall <[email protected]>
Signed-off-by: Frediano Ziglio <[email protected]>

[ Corentin: rebase from XS-9 kernel (v6.6.98 based) to UEK8 kernel (v6.12 based):
- 260017f `("lsm: use default hook return value in call_int_hook()") simplified the `call_int_hook`
macro, so align our usage in `security_locked_down_nowarn()`]
@CocoElAbricot CocoElAbricot force-pushed the corentin-xcpng-9/xenserver-rebased-patchqueue branch from 654f209 to 01e920f Compare March 27, 2026 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants