Skip to content

Diff xenserver-4.19.19-8.0.45 xenserver-4.19.19-8.0.46#2

Draft
tescande wants to merge 190 commits intokernel/xenserver-4.19.19-8.0.45/basefrom
kernel/xenserver-4.19.19-8.0.46/base
Draft

Diff xenserver-4.19.19-8.0.45 xenserver-4.19.19-8.0.46#2
tescande wants to merge 190 commits intokernel/xenserver-4.19.19-8.0.45/basefrom
kernel/xenserver-4.19.19-8.0.46/base

Conversation

@tescande
Copy link

@tescande tescande commented Mar 27, 2026

Warning

NOT intended to be merged into kernel/xenserver-4.19.19-8.0.45/base, this is solely to review the new patch-queue.

To review the new patch-set from prior version:

git-review-rebase origin/kernel/xenserver-4.19.19-8.0.45/pre-base..origin/kernel/xenserver-4.19.19-8.0.45/base origin/kernel/xenserver-4.19.19-8.0.46/pre-base..origin/kernel/xenserver-4.19.19-8.0.46/base

chunjiez and others added 30 commits February 24, 2018 23:51
…1 [PATCH] SUNRPC: Replace direct task wakeups from softirq context

  Replace the direct task wakeups from inside a softirq context with
  wakeups from a process context.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
  The queue timer function, which walks the RPC queue in order to locate
  candidates for waking up is one of the current constraints against
  removing the bh-safe queue spin locks. Replace it with a delayed
  work queue, so that we can do the actual rpc task wake ups from an
  ordinary process context.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
1. nbd_put takes the mutex and drops nbd->ref to 0. It then does
idr_remove and drops the mutex.

2. nbd_genl_connect takes the mutex. idr_find/idr_for_each fails
to find an existing device, so it does nbd_dev_add.

3. just before the nbd_put could call nbd_dev_remove or not finished
totally, but if nbd_dev_add try to add_disk, we can hit:

debugfs: Directory 'nbd1' with parent 'block' already present!

This patch will make sure all the disk add/remove stuff are done
by holding the nbd_index_mutex lock.

Reported-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 4f8943f ("SUNRPC: Replace direct task wakeups from
softirq context") there has been a race to the value of the sk_err if both
XPRT_SOCK_WAKE_ERROR and XPRT_SOCK_WAKE_DISCONNECT are set.  In that case,
we may end up losing the sk_err value that existed when xs_error_report was
called.

Fix this by reverting to the previous behavior: instead of using SO_ERROR
to retrieve the value at a later time (which might also return sk_err_soft),
copy the sk_err value onto struct sock_xprt, and use that value to wake
pending tasks.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: 4f8943f ("SUNRPC: Replace direct task wakeups from softirq context")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We already do this for the most part, except in timeout and clear_req.
For the timeout case we take the lock after we grab a ref on the config,
but that isn't really necessary because we're safe to touch the cmd at
this point, so just move the order around.

For the clear_req cause this is initiated by the user, so again is safe.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We hit the following warning in production

print_req_error: I/O error, dev nbd0, sector 7213934408 flags 80700
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 25 PID: 32407 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60
Workqueue: knbd-recv recv_work [nbd]
RIP: 0010:refcount_sub_and_test_checked+0x53/0x60
Call Trace:
 blk_mq_free_request+0xb7/0xf0
 blk_mq_complete_request+0x62/0xf0
 recv_work+0x29/0xa1 [nbd]
 process_one_work+0x1f5/0x3f0
 worker_thread+0x2d/0x3d0
 ? rescuer_thread+0x340/0x340
 kthread+0x111/0x130
 ? kthread_create_on_node+0x60/0x60
 ret_from_fork+0x1f/0x30
---[ end trace b079c3c67f98bb7c ]---

This was preceded by us timing out everything and shutting down the
sockets for the device.  The problem is we had a request in the queue at
the same time, so we completed the request twice.  This can actually
happen in a lot of cases, we fail to get a ref on our config, we only
have one connection and just error out the command, etc.

Fix this by checking cmd->status in nbd_read_stat.  We only change this
under the cmd->lock, so we are safe to check this here and see if we've
already error'ed this command out, which would indicate that we've
completed it as well.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, after the forward channel connection goes away,
backchannel operations are causing soft lockups on the server
because call_transmit_status's SOFTCONN logic ignores ENOTCONN.
Such backchannel Calls are aggressively retried until the client
reconnects.

Backchannel Calls should use RPC_TASK_NOCONNECT rather than
RPC_TASK_SOFTCONN. If there is no forward connection, the server is
not capable of establishing a connection back to the client, thus
that backchannel request should fail before the server attempts to
send it. Commit 58255a4 ("NFSD: NFSv4 callback client should
use RPC_TASK_SOFTCONN") was merged several years before
RPC_TASK_NOCONNECT was available.

Because setup_callback_client() explicitly sets NOPING, the NFSv4.0
callback connection depends on the first callback RPC to initiate
a connection to the client. Thus NFSv4.0 needs to continue to use
RPC_TASK_SOFTCONN.

Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org> # v4.20+
bdget_disk needs to be paired with bdput to not leak a reference
on the block device inode.

Fixes: 08ba91e ("nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a race between iterating over requests in
nbd_clear_que() and completing requests in recv_work(),
which can lead to double completion of a request.

To fix it, flush the recv worker before iterating over
the requests and don't abort the completed request
while iterating.

Fixes: 96d97e1 ("nbd: clear_sock on netlink disconnect")
Reported-by: Jiang Yadong <jiangyadong@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210813151330.96-1-xieyongji@bytedance.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While handling a response message from server, nbd_read_stat() will
try to get request by tag, and then complete the request. However,
this is problematic if nbd haven't sent a corresponding request
message:

t1                      t2
                        submit_bio
                         nbd_queue_rq
                          blk_mq_start_request
recv_work
 nbd_read_stat
  blk_mq_tag_to_rq
 blk_mq_complete_request
                          nbd_send_cmd

Thus add a new cmd flag 'NBD_CMD_INFLIGHT', it will be set in
nbd_send_cmd() and checked in nbd_read_stat().

Noted that this patch can't fix that blk_mq_tag_to_rq() might
return a freed request, and this will be fixed in following
patches.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-2-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit cddce01 ("nbd: Aovid double completion of a request")
try to fix that nbd_clear_que() and recv_work() can complete a
request concurrently. However, the problem still exists:

t1                    t2                     t3

nbd_disconnect_and_put
 flush_workqueue
                      recv_work
                       blk_mq_complete_request
                        blk_mq_complete_request_remote -> this is true
                         WRITE_ONCE(rq->state, MQ_RQ_COMPLETE)
                          blk_mq_raise_softirq
                                             blk_done_softirq
                                              blk_complete_reqs
                                               nbd_complete_rq
                                                blk_mq_end_request
                                                 blk_mq_free_request
                                                  WRITE_ONCE(rq->state, MQ_RQ_IDLE)
  nbd_clear_que
   blk_mq_tagset_busy_iter
    nbd_clear_req
                                                   __blk_mq_free_request
                                                    blk_mq_put_tag
     blk_mq_complete_request -> complete again

There are three places where request can be completed in nbd:
recv_work(), nbd_clear_que() and nbd_xmit_timeout(). Since they
all hold cmd->lock before completing the request, it's easy to
avoid the problem by setting and checking a cmd flag.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-3-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit bfe9b9d ("cdc_ether: Improve ZTE MF823/831/910 handling")
introduces a workaround for certain ZTE modems reporting invalid MAC
addresses over CDC-ECM.
The same issue was present on their RNDIS interface,which was fixed in
commit a5a18bd ("rndis_host: Set valid random MAC on buggy devices").

However, internal modem of ZTE MF286R router, on its RNDIS interface, also
exhibits a second issue fixed already in CDC-ECM, of the device not
respecting configured random MAC address. In order to share the fixup for
this with rndis_host driver, export the workaround function, which will
be re-used in the following commit in rndis_host.

Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Bjørn Mork <bjorn@mork.no>
Cc: Oliver Neukum <oliver@neukum.org>
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Certain ZTE modems, namely: MF823. MF831, MF910, built-in modem from
MF286R, expose both CDC-ECM and RNDIS network interfaces.
They have a trait of ignoring the locally-administered MAC address
configured on the interface both in CDC-ECM and RNDIS part,
and this leads to dropping of incoming traffic by the host.
However, the workaround was only present in CDC-ECM, and MF286R
explicitly requires it in RNDIS mode.

Re-use the workaround in rndis_host as well, to fix operation of MF286R
module, some versions of which expose only the RNDIS interface. Do so by
introducing new flag, RNDIS_DRIVER_DATA_DST_MAC_FIXUP, and testing for it
in rndis_rx_fixup. This is required, as RNDIS uses frame batching, and all
of the packets inside the batch need the fixup. This might introduce a
performance penalty, because test is done for every returned Ethernet
frame.

Apply the workaround to both "flavors" of RNDIS interfaces, as older ZTE
modems, like MF823 found in the wild, report the USB_CLASS_COMM class
interfaces, while MF286R reports USB_CLASS_WIRELESS_CONTROLLER.

Suggested-by: Bjørn Mork <bjorn@mork.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Oliver Neukum <oliver@neukum.org>
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reporting of bogus MAC addresses and ignoring configuration of new
destination address wasn't observed outside of a range of ZTE devices,
among which this seems to be the common bug. Align rndis_host driver
with implementation found in cdc_ether, which also limits this workaround
to ZTE devices.

Suggested-by: Bjørn Mork <bjorn@mork.no>
Cc: Kristian Evensen <kristian.evensen@gmail.com>
Cc: Oliver Neukum <oliver@neukum.org>
Signed-off-by: Lech Perczak <lech.perczak@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Rather than resetting state variables in socket state_change() callback,
do it in the sunrpc TCP connect function itself.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Commands get stuck while Host NVMe-oF controller is in reconnect state.
The controller enters into reconnect state when it loses connection with
the target.  It tries to reconnect every 10 seconds (default) until
a successful reconnect or until the reconnect time-out is reached.
The default reconnect time out is 10 minutes.

Applications are expecting commands to complete with success or error
within a certain timeout (30 seconds by default).  The NVMe host is
enforcing that timeout while it is connected, but during reconnect the
timeout is not enforced and commands may get stuck for a long period or
even forever.

To fix this long delay due to the default timeout, introduce new
"fast_io_fail_tmo" session parameter.  The timeout is measured in seconds
from the controller reconnect and any command beyond that timeout is
rejected.  The new parameter value may be passed during 'connect'.
The default value of -1 means no timeout (similar to current behavior).

Signed-off-by: Victor Gladkov <victor.gladkov@kioxia.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
And rename them to avoid name conflicts with the original kpatch ones.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
If handling logics of watch events are slower than the events enqueue
logic and the events can be created from the guests, the guests could
trigger memory pressure by intensively inducing the events, because it
will create a huge number of pending events that exhausting the memory.
This is known as XSA-349.

Fortunately, some watch events could be ignored, depending on its
handler callback.  For example, if the callback has interest in only one
single path, the watch wouldn't want multiple pending events.  Or, some
watches could ignore events to same path.

To let such watches to volutarily help avoiding the memory pressure
situation, this commit introduces new watch callback, 'will_handle'.  If
it is not NULL, it will be called for each new event just before
enqueuing it.  Then, if the callback returns false, the event will be
discarded.  No watch is using the callback for now, though.

Signed-off-by: Author Redacted <security@xen.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Some code does not directly make 'xenbus_watch' object and call
'register_xenbus_watch()' but use 'xenbus_watch_path()' instead.  This
commit adds support of 'will_handle' callback in the
'xenbus_watch_path()' and it's wrapper, 'xenbus_watch_pathfmt()'.

Signed-off-by: Author Redacted <security@xen.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
This commit adds support of the 'will_handle' watch callback for
'xen_bus_type' users.

Signed-off-by: Author Redacted <security@xen.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
This commit adds a counter of pending messages for each watch in the
struct.  It is used to skip unnecessary pending messages lookup in
'unregister_xenbus_watch()'.  It could also be used in 'will_handle'
callback.

Signed-off-by: Author Redacted <security@xen.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
'xenbus_backend' watches 'state' of devices, which is writable by
guests.  Hence, if guests intensively updates it, dom0 will have lots of
pending events that exhausting memory of dom0.  In other words, guests
can trigger dom0 memory pressure.  This is known as XSA-349.  However,
the watch callback of it, 'frontend_changed()', reads only 'state', so
doesn't need to have the pending events.

To avoid the problem, this commit disallows pending watch messages for
'xenbus_backend' using the 'will_handle()' watch callback.

Signed-off-by: Author Redacted <security@xen.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
[ Upstream commit d742db7 ]

On some architectures (like ARM), virt_to_gfn cannot be used for
vmalloc'd memory because of its reliance on virt_to_phys. This patch
introduces a check for vmalloc'd addresses and obtains the PFN using
vmalloc_to_pfn in that case.

Signed-off-by: Simon Leiner <simon@leiner.me>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Link: https://lore.kernel.org/r/20200825093153.35500-1-simon@leiner.me
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 1d5c76e upstream.

There's no reason to request physically contiguous memory for those
allocations.

[boris: added CC to stable]

Cc: stable@vger.kernel.org
Reported-by: Ian Jackson <ian.jackson@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 589b728 upstream.

Clang warns:

../drivers/block/xen-blkfront.c:1117:4: warning: misleading indentation;
statement is not part of the previous 'if' [-Wmisleading-indentation]
                nr_parts = PARTS_PER_DISK;
                ^
../drivers/block/xen-blkfront.c:1115:3: note: previous statement is here
                if (err)
                ^

This is because there is a space at the beginning of this line; remove
it so that the indentation is consistent according to the Linux kernel
coding style and clang no longer warns.

While we are here, the previous line has some trailing whitespace; clean
that up as well.

Fixes: c80a420 ("xen-blkfront: handle Xen major numbers other than XENVBD")
Link: ClangBuiltLinux/linux#791
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3a169c0 upstream.

Commit 1d5c76e ("xen-blkfront: switch kcalloc to kvcalloc for
large array allocation") didn't fix the issue it was meant to, as the
flags for allocating the memory are GFP_NOIO, which will lead the
memory allocation falling back to kmalloc().

So instead of GFP_NOIO use GFP_KERNEL and do all the memory allocation
in blkfront_setup_indirect() in a memalloc_noio_{save,restore} section.

Fixes: 1d5c76e ("xen-blkfront: switch kcalloc to kvcalloc for large array allocation")
Cc: stable@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20200403090034.8753-1-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0549cd6 upstream.

This is inline with the specification described in blkif.h:

 * discard-granularity: should be set to the physical block size if
   node is not present.
 * discard-alignment, discard-secure: should be set to 0 if node not
   present.

This was detected as QEMU would only create the discard-granularity
node but not discard-alignment, and thus the setup done in
blkfront_setup_discard would fail.

Fix blkfront_setup_discard to not fail on missing nodes, and also fix
blkif_set_queue_limits to set the discard granularity to the physical
block size if none is specified in xenbus.

Fixes: ed30bf3 ('xen-blkfront: Handle discard requests.')
Reported-by: Arthur Borsboom <arthurborsboom@gmail.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-By: Arthur Borsboom <arthurborsboom@gmail.com>
Link: https://lore.kernel.org/r/20210119105727.95173-1-roger.pau@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 629a5d8 upstream.

Sync include/xen/interface/io/ring.h with Xen's newest version in
order to get the RING_COPY_RESPONSE() and RING_RESPONSE_PROD_OVERFLOW()
macros.

Note that this will correct the wrong license info by adding the
missing original copyright notice.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 71b6624 upstream.

In order to avoid problems in case the backend is modifying a response
on the ring page while the frontend has already seen it, just read the
response into a local buffer in one go and then operate on that buffer
only.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20210730103854.12681-2-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8f5a695 upstream.

In order to avoid a malicious backend being able to influence the local
copy of a request build the request locally first and then copy it to
the ring page instead of doing it the other way round as today.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20210730103854.12681-3-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Vates Git Importer and others added 27 commits February 24, 2018 23:51
…RPOSES!

I've noticed a performance regression with upstream kernels when used as Dom0
under Xen. The classic kernel can utilize the whole bandwidth of a 10G NIC
(ca. 9.3 Gbps), but upstream can reach only ca. 7 Gbps. I found that it
happens because SWIOTLB has to do double buffering. The per task frag
allocator introduced in 5640f7 creates 32 kb frags, which are not contiguous
in mfn space.
This patch provides a workaround by going back to the old way. The possible
ideas came up to solve this:

* make sure Dom0 memory is contiguous: it sounds trivial, but doesn't work with
driver domains, and there are lots of situations where this is not possible.
* use PVH Dom0: so we will have IOMMU. In the future sometime.
* use IOMMU with PV Dom0: this seems to happen earlier.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
…t be in the CPU's runqueue. If there's nothing in the specified CPU's runqueue, it will return 1. But if the CPU is doing work in the softirq context, it is wrong for idle_cpu to return 1.

I observed this to be a problem with netback kicking a kthread by executing wake_up from softirq context. The Completely Fair Scheduler's select_task_rq_fair was looking for an "idle sibling" of the CPU executing it by calling select_idle_sibling, passing the executing CPU as the 'target' parameter. The first thing that select_idle_sibling does is to check whether the 'target' CPU is idle, using idle_cpu, and to return that CPU if so. Despite the executing CPU being busy in softirq context, idle_cpu was returning 1, meaning that the scheduler would consistently try to run the kthread on the same CPU as the kick came from. Given that the softirq work was on-going, this led to a multi-millisecond delay before the scheduler eventually realised it should migrate the kthread to a different CPU.

A solution to this problem would be to make idle_cpu return 0 when the CPU is running in softirq context. I haven't got a patch for that because I couldn't find an easy way of querying whether an arbitrary CPU is doing this. (Perhaps I should look at the per-CPU softirq_work_list[]...?)

Instead, the following patch is a partial solution, only handling the case when the currently-executing CPU is in softirq context. This was sufficient to solve the problem I observed.
…ints SYN

flood warnings for the DLM port:
Dec 21 01:46:41 localhost kernel: [ 2146.516664] TCP: request_sock_TCP: Possible SYN flooding on port 21064. Sending cookies.  Check SNMP counters.

And then joining a DLM lockspace hangs:
Dec 21 01:49:00 localhost kernel: [ 2285.780913] INFO: task xapi-clusterd:17638 blocked for more than 120 seconds.                                                                     │
Dec 21 01:49:00 localhost kernel: [ 2285.786476]       Not tainted 4.4.0+10 #1                                                                                                         │
Dec 21 01:49:00 localhost kernel: [ 2285.789043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.                                                             │
Dec 21 01:49:00 localhost kernel: [ 2285.794611] xapi-clusterd   D ffff88001930bc58     0 17638      1 0x00000000                                                                      │
Dec 21 01:49:00 localhost kernel: [ 2285.794615]  ffff88001930bc58 ffff880025593800 ffff880022433800 ffff88001930c000                                                                  │
Dec 21 01:49:00 localhost kernel: [ 2285.794617]  ffff88000ef4a660 ffff88000ef4a658 ffff880022433800 ffff88000ef4a000                                                                  │
Dec 21 01:49:00 localhost kernel: [ 2285.794619]  ffff88001930bc70 ffffffff8159f6b4 7fffffffffffffff ffff88001930bd10
Dec 21 01:49:00 localhost kernel: [ 2285.794644]  [<ffffffff811570fe>] ? printk+0x4d/0x4f                                                                                              │
Dec 21 01:49:00 localhost kernel: [ 2285.794647]  [<ffffffff810b1741>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20                                                           │
Dec 21 01:49:00 localhost kernel: [ 2285.794649]  [<ffffffff815a085d>] wait_for_completion+0x9d/0x110                                                                                  │
Dec 21 01:49:00 localhost kernel: [ 2285.794653]  [<ffffffff810979e0>] ? wake_up_q+0x80/0x80                                                                                           │
Dec 21 01:49:00 localhost kernel: [ 2285.794661]  [<ffffffffa03fa4b8>] dlm_new_lockspace+0x908/0xac0 [dlm]                                                                             │
Dec 21 01:49:00 localhost kernel: [ 2285.794665]  [<ffffffff810aaa60>] ? prepare_to_wait_event+0x100/0x100                                                                             │
Dec 21 01:49:00 localhost kernel: [ 2285.794670]  [<ffffffffa0402e37>] device_write+0x497/0x6b0 [dlm]                                                                                  │
Dec 21 01:49:00 localhost kernel: [ 2285.794673]  [<ffffffff811834f0>] ? handle_mm_fault+0x7f0/0x13b0                                                                                  │
Dec 21 01:49:00 localhost kernel: [ 2285.794677]  [<ffffffff811b4438>] __vfs_write+0x28/0xd0                                                                                           │
Dec 21 01:49:00 localhost kernel: [ 2285.794679]  [<ffffffff811b4b7f>] ? rw_verify_area+0x6f/0xd0                                                                                      ┤
Dec 21 01:49:00 localhost kernel: [ 2285.794681]  [<ffffffff811b4dc1>] vfs_write+0xb1/0x190                                                                                            │
Dec 21 01:49:00 localhost kernel: [ 2285.794686]  [<ffffffff8105ffc2>] ? __do_page_fault+0x302/0x420                                                                                   │
Dec 21 01:49:00 localhost kernel: [ 2285.794688]  [<ffffffff811b5986>] SyS_write+0x46/0xa0                                                                                             │
Dec 21 01:49:00 localhost kernel: [ 2285.794690]  [<ffffffff815a31ae>] entry_SYSCALL_64_fastpath+0x12/0x71

Although the join hanging forever like that is still a bug, if the SYN cookies
consistently trigger it lets try to avoid the bug by avoiding the SYN cookies.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
From: Tim Smith <tim.smith@citrix.com>
Date: Sat Sep 19 2015 06:09:42 GMT+0000

When growing a number of files on the same cluster node from different
threads (e.g. fio with 20 or so jobs), all those threads pile into
gfs2_inplace_reserve() independently looking to claim a new resource
group and after a while they all synchronise, getting through the
gfs2_rgrp_used_recently()/gfs2_rgrp_congested() check together.

When this happens, write performance drops to about 1/5 on a single
node cluster, and on multi-node clusters it drops to near zero on
some nodes. The output from "glocktop -r -H -d 1" when this happens
begins to show many processes stuck in gfs2_inplace_reserve(), waiting
on a resource group lock.

This commit introduces a module parameter which, when set to a value
of 1, will introduce some random jitter into the first two passes of
gfs2_inplace_reserve() when trying to lock a new resource group,
skipping to the next one 1/2 the time with progressively lower
probability on each attempt.

Signed-off-by: Tim Smith <tim.smith@citrix.com>
allocation can result in glocks being bounced around between hosts.
Follow the example of inodes and if we have local waiters when asked
to demote the glock on a resource group add a delay. Additionally,
track when last asked to demote a lock and when assessing resource
groups in the allocator prefer, in the first two loop iterations, not
to use resource groups where we've been asked to demote the glock
within the last second.

Signed-off-by: Mark Syms <mark.syms@citrix.com>
We triggered an assert at some point in gfs2_add_inode_blocks(), and
from the trace it was one of the calls in sweep_bh_for_rgrps(). But
we don't know which and the assert was a bit broken anyway.

After fixing the assert, add some tracking intended to detect if
something else is messing with the inode at the same time, or if
we're somehow trying to remove more blocks than are allocated.
changes to page_ops). This results in the kABI changing. Fix it by
adding a new flag which specifies whether the page_done pointer is
actually a pointer to page_ops. GFS2 uses this flag while unmodified
callers still continue to work as before.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver.
It enables drivers to create an auxiliary_device and bind an
auxiliary_driver to it.

The bus supports probe/remove shutdown and suspend/resume callbacks.
Each auxiliary_device has a unique string based id; driver binds to
an auxiliary_device based on this id through the bus.

Co-developed-by: Kiran Patil <kiran.patil@intel.com>
Co-developed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Co-developed-by: Fred Oh <fred.oh@linux.intel.com>
Co-developed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Signed-off-by: Fred Oh <fred.oh@linux.intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Dave Ertman <david.m.ertman@intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Martin Habets <mhabets@solarflare.com>
Link: https://lore.kernel.org/r/20201113161859.1775473-2-david.m.ertman@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/160695681289.505290.8978295443574440604.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
No need to include slab.h in include/linux/auxiliary_bus.h, as it is not
needed there.  Move it to drivers/base/auxiliary.c instead.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Ertman <david.m.ertman@intel.com>
Cc: Fred Oh <fred.oh@linux.intel.com>
Cc: Kiran Patil <kiran.patil@intel.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Martin Habets <mhabets@solarflare.com>
Cc: Parav Pandit <parav@mellanox.com>
Cc: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Cc: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Cc: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/X8og8xi3WkoYXet9@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There's an effort to move the remove() callback in the driver core to
not return an int, as nothing can be done if this function fails.  To
make that effort easier, make the aux bus remove function void to start
with so that no users have to be changed sometime in the future.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Ertman <david.m.ertman@intel.com>
Cc: Fred Oh <fred.oh@linux.intel.com>
Cc: Kiran Patil <kiran.patil@intel.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Martin Habets <mhabets@solarflare.com>
Cc: Parav Pandit <parav@mellanox.com>
Cc: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Cc: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Cc: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/X8ohB1ks1NK7kPop@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For some reason, the original aux bus patch had some really long lines
in a few places, probably due to it being a very long-lived patch in
development by many different people.  Fix that up so that the two files
all have the same length lines and function formatting styles.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Ertman <david.m.ertman@intel.com>
Cc: Fred Oh <fred.oh@linux.intel.com>
Cc: Kiran Patil <kiran.patil@intel.com>
Cc: Leon Romanovsky <leonro@nvidia.com>
Cc: Martin Habets <mhabets@solarflare.com>
Cc: Parav Pandit <parav@mellanox.com>
Cc: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Cc: Ranjani Sridharan <ranjani.sridharan@linux.intel.com>
Cc: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/X8oiSFTpYHw1xE/o@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the probe of the auxdrv failed, the device->driver is set to NULL.
During kernel shutdown, the bus shutdown will call auxdrv->shutdown and
cause an invalid ptr dereference. Add check to make sure device->driver is
not NULL before we proceed.

Fixes: 7de3697 ("Add auxiliary bus support")
Cc: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/160710040926.1889434.8840329810698403478.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the auxiliary device code is built into the kernel, it can be executed
before the auxiliary bus is registered. This causes bus->p to be not
allocated and triggers a NULL pointer dereference when the auxiliary bus
device gets added with bus_add_device(). Call the auxiliary_bus_init()
under driver_init() so the bus is initialized before devices.

Below is the kernel splat for the bug:
[ 1.948215] BUG: kernel NULL pointer dereference, address: 0000000000000060
[ 1.950670] #PF: supervisor read access in kernel mode
[ 1.950670] #PF: error_code(0x0000) - not-present page
[ 1.950670] PGD 0
[ 1.950670] Oops: 0000 1 SMP NOPTI
[ 1.950670] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.10.0-intel-nextsvmtest+ #2205
[ 1.950670] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1.950670] RIP: 0010:bus_add_device+0x64/0x140
[ 1.950670] Code: 00 49 8b 75 20 48 89 df e8 59 a1 ff ff 41 89 c4 85 c0 75 7b 48 8b 53 50 48 85 d2 75 03 48 8b 13 49 8b 85 a0 00 00 00 48 89 de <48> 8
78 60 48 83 c7 18 e8 ef d9 a9 ff 41 89 c4 85 c0 75 45 48 8b
[ 1.950670] RSP: 0000:ff46032ac001baf8 EFLAGS: 00010246
[ 1.950670] RAX: 0000000000000000 RBX: ff4597f7414aa680 RCX: 0000000000000000
[ 1.950670] RDX: ff4597f74142bbc0 RSI: ff4597f7414aa680 RDI: ff4597f7414aa680
[ 1.950670] RBP: ff46032ac001bb10 R08: 0000000000000044 R09: 0000000000000228
[ 1.950670] R10: ff4597f741141b30 R11: ff4597f740182a90 R12: 0000000000000000
[ 1.950670] R13: ffffffffa5e936c0 R14: 0000000000000000 R15: 0000000000000000
[ 1.950670] FS: 0000000000000000(0000) GS:ff4597f7bba00000(0000) knlGS:0000000000000000
[ 1.950670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.950670] CR2: 0000000000000060 CR3: 000000002140c001 CR4: 0000000000f71ef0
[ 1.950670] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1.950670] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 1.950670] PKRU: 55555554
[ 1.950670] Call Trace:
[ 1.950670] device_add+0x3ee/0x850
[ 1.950670] __auxiliary_device_add+0x47/0x60
[ 1.950670] idxd_pci_probe+0xf77/0x1180
[ 1.950670] local_pci_probe+0x4a/0x90
[ 1.950670] pci_device_probe+0xff/0x1b0
[ 1.950670] really_probe+0x1cf/0x440
[ 1.950670] ? rdinit_setup+0x31/0x31
[ 1.950670] driver_probe_device+0xe8/0x150
[ 1.950670] device_driver_attach+0x58/0x60
[ 1.950670] __driver_attach+0x8f/0x150
[ 1.950670] ? device_driver_attach+0x60/0x60
[ 1.950670] ? device_driver_attach+0x60/0x60
[ 1.950670] bus_for_each_dev+0x79/0xc0
[ 1.950670] ? kmem_cache_alloc_trace+0x323/0x430
[ 1.950670] driver_attach+0x1e/0x20
[ 1.950670] bus_add_driver+0x154/0x1f0
[ 1.950670] driver_register+0x70/0xc0
[ 1.950670] __pci_register_driver+0x54/0x60
[ 1.950670] idxd_init_module+0xe2/0xfc
[ 1.950670] ? idma64_platform_driver_init+0x19/0x19
[ 1.950670] do_one_initcall+0x4a/0x1e0
[ 1.950670] kernel_init_freeable+0x1fc/0x25c
[ 1.950670] ? rest_init+0xba/0xba
[ 1.950670] kernel_init+0xe/0x116
[ 1.950670] ret_from_fork+0x1f/0x30
[ 1.950670] Modules linked in:
[ 1.950670] CR2: 0000000000000060
[ 1.950670] --[ end trace cd7d1b226d3ca901 ]--

Fixes: 7de3697 ("Add auxiliary bus support")
Reported-by: Jacob Pan <jacob.jun.pan@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Dave Ertman <david.m.ertman@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/20210210201611.1611074-1-dave.jiang@intel.com
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Remove module bits in the auxiliary bus code since the auxiliary bus
cannot be built as a module and the relevant code is not needed.

Cc: Dave Ertman <david.m.ertman@intel.com>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/161307488980.1896017.15627190714413338196.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If driver_register() returns with error we need to free the memory
allocated for auxdrv->driver.name before returning from
__auxiliary_driver_register()

Fixes: 7de3697 ("Add auxiliary bus support")
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://lore.kernel.org/r/20210713093438.3173-1-peter.ujfalusi@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The documentation for creating an auxiliary device is a 3 step not a 2
step process.  Specifically the requirements of setting the name, id,
dev.release, and dev.parent fields was not clear as a precursor to the '2
step' process documented.

Clarify by declaring this a 3 step process starting with setting the
fields of struct auxiliary_device correctly.

Also add some sample code and tie the change into the rest of the
documentation.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20211202044305.4006853-2-ira.weiny@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
__auxiliary_driver_register is not intended to be called directly unless
a custom name is required.  Add documentation for this fact.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20211202044305.4006853-5-ira.weiny@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
… device

auxiliary_find_device() takes a proper get_device() reference on the
device before returning the matched device.

Users of this call should be informed that they need to properly release
this reference with put_device().

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20211202044305.4006853-7-ira.weiny@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The code and documentation are more difficult to maintain when kept
separately.  This is further compounded when the standard structure
documentation infrastructure is not used.

Move the documentation into the code, use the standard documentation
infrastructure, add current documented functions, and reference the text
in the rst file.

Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20211202044305.4006853-8-ira.weiny@intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Stephen Cheng <stephen.cheng@citrix.com>
not expecting the LUN change. Lest it may trigger udev rules which in
turn trigger more LUN change events.

Signed-off-by: Gerald Elder-Vass <gerald.elder-vass@cloud.com>
Due to some backport trying to fix a possible deadlock (see CA-399062
for details) some patches which introduced many more deadlock were
merged to XS8.
Revert these patches and fix the original potential deadlock too.

These deadlock were generated by the same lock taken by no-IRQ context
and nested lock attempt on IRQ context as IRQ were not disabled on
first lock. Calling spin_lock_bh instead of spin_lock prevent IRQ
code nested to no-IRQ code.

The removal of "transport_lock" in "xs_tcp_state_change" is enabled by
commit 13f6175 ("XSI-2150: Remove added chunk causing deadlock",
specifically "SUNRPC: Move reset of TCP state variables into the
reconnect code"). A review of "xs_tcp_state_change" shows that similar
accesses through the "transport" pointer (notably for "TCP_CLOSE_WAIT"
and "TCP_CLOSE") are already performed without this lock. State bit
updates are atomic, and the work queue is adequately protected by
other synchronization mechanisms, including additional spinlocks and
CPU-local variables.
@casasnovas casasnovas changed the base branch from kernel/xenserver-4.19.19-8.0.45/base to kernel/xenserver-4.19.19-8.0.46/pre-base March 27, 2026 09:01
@casasnovas casasnovas changed the base branch from kernel/xenserver-4.19.19-8.0.46/pre-base to kernel/xenserver-4.19.19-8.0.45/base March 27, 2026 09:02
@casasnovas casasnovas self-requested a review March 27, 2026 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.