Diff xenserver-4.19.19-8.0.45 xenserver-4.19.19-8.0.46#2
Draft
tescande wants to merge 190 commits intokernel/xenserver-4.19.19-8.0.45/basefrom
Draft
Diff xenserver-4.19.19-8.0.45 xenserver-4.19.19-8.0.46#2tescande wants to merge 190 commits intokernel/xenserver-4.19.19-8.0.45/basefrom
tescande wants to merge 190 commits intokernel/xenserver-4.19.19-8.0.45/basefrom
Conversation
…1 [PATCH] SUNRPC: Replace direct task wakeups from softirq context Replace the direct task wakeups from inside a softirq context with wakeups from a process context. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
The queue timer function, which walks the RPC queue in order to locate candidates for waking up is one of the current constraints against removing the bh-safe queue spin locks. Replace it with a delayed work queue, so that we can do the actual rpc task wake ups from an ordinary process context. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
1. nbd_put takes the mutex and drops nbd->ref to 0. It then does idr_remove and drops the mutex. 2. nbd_genl_connect takes the mutex. idr_find/idr_for_each fails to find an existing device, so it does nbd_dev_add. 3. just before the nbd_put could call nbd_dev_remove or not finished totally, but if nbd_dev_add try to add_disk, we can hit: debugfs: Directory 'nbd1' with parent 'block' already present! This patch will make sure all the disk add/remove stuff are done by holding the nbd_index_mutex lock. Reported-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 4f8943f ("SUNRPC: Replace direct task wakeups from softirq context") there has been a race to the value of the sk_err if both XPRT_SOCK_WAKE_ERROR and XPRT_SOCK_WAKE_DISCONNECT are set. In that case, we may end up losing the sk_err value that existed when xs_error_report was called. Fix this by reverting to the previous behavior: instead of using SO_ERROR to retrieve the value at a later time (which might also return sk_err_soft), copy the sk_err value onto struct sock_xprt, and use that value to wake pending tasks. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: 4f8943f ("SUNRPC: Replace direct task wakeups from softirq context") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We already do this for the most part, except in timeout and clear_req. For the timeout case we take the lock after we grab a ref on the config, but that isn't really necessary because we're safe to touch the cmd at this point, so just move the order around. For the clear_req cause this is initiated by the user, so again is safe. Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
We hit the following warning in production print_req_error: I/O error, dev nbd0, sector 7213934408 flags 80700 ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 25 PID: 32407 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60 Workqueue: knbd-recv recv_work [nbd] RIP: 0010:refcount_sub_and_test_checked+0x53/0x60 Call Trace: blk_mq_free_request+0xb7/0xf0 blk_mq_complete_request+0x62/0xf0 recv_work+0x29/0xa1 [nbd] process_one_work+0x1f5/0x3f0 worker_thread+0x2d/0x3d0 ? rescuer_thread+0x340/0x340 kthread+0x111/0x130 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x1f/0x30 ---[ end trace b079c3c67f98bb7c ]--- This was preceded by us timing out everything and shutting down the sockets for the device. The problem is we had a request in the queue at the same time, so we completed the request twice. This can actually happen in a lot of cases, we fail to get a ref on our config, we only have one connection and just error out the command, etc. Fix this by checking cmd->status in nbd_read_stat. We only change this under the cmd->lock, so we are safe to check this here and see if we've already error'ed this command out, which would indicate that we've completed it as well. Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, after the forward channel connection goes away, backchannel operations are causing soft lockups on the server because call_transmit_status's SOFTCONN logic ignores ENOTCONN. Such backchannel Calls are aggressively retried until the client reconnects. Backchannel Calls should use RPC_TASK_NOCONNECT rather than RPC_TASK_SOFTCONN. If there is no forward connection, the server is not capable of establishing a connection back to the client, thus that backchannel request should fail before the server attempts to send it. Commit 58255a4 ("NFSD: NFSv4 callback client should use RPC_TASK_SOFTCONN") was merged several years before RPC_TASK_NOCONNECT was available. Because setup_callback_client() explicitly sets NOPING, the NFSv4.0 callback connection depends on the first callback RPC to initiate a connection to the client. Thus NFSv4.0 needs to continue to use RPC_TASK_SOFTCONN. Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> # v4.20+
bdget_disk needs to be paired with bdput to not leak a reference on the block device inode. Fixes: 08ba91e ("nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a race between iterating over requests in nbd_clear_que() and completing requests in recv_work(), which can lead to double completion of a request. To fix it, flush the recv worker before iterating over the requests and don't abort the completed request while iterating. Fixes: 96d97e1 ("nbd: clear_sock on netlink disconnect") Reported-by: Jiang Yadong <jiangyadong@bytedance.com> Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210813151330.96-1-xieyongji@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
While handling a response message from server, nbd_read_stat() will
try to get request by tag, and then complete the request. However,
this is problematic if nbd haven't sent a corresponding request
message:
t1 t2
submit_bio
nbd_queue_rq
blk_mq_start_request
recv_work
nbd_read_stat
blk_mq_tag_to_rq
blk_mq_complete_request
nbd_send_cmd
Thus add a new cmd flag 'NBD_CMD_INFLIGHT', it will be set in
nbd_send_cmd() and checked in nbd_read_stat().
Noted that this patch can't fix that blk_mq_tag_to_rq() might
return a freed request, and this will be fixed in following
patches.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210916093350.1410403-2-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
commit cddce01 ("nbd: Aovid double completion of a request") try to fix that nbd_clear_que() and recv_work() can complete a request concurrently. However, the problem still exists: t1 t2 t3 nbd_disconnect_and_put flush_workqueue recv_work blk_mq_complete_request blk_mq_complete_request_remote -> this is true WRITE_ONCE(rq->state, MQ_RQ_COMPLETE) blk_mq_raise_softirq blk_done_softirq blk_complete_reqs nbd_complete_rq blk_mq_end_request blk_mq_free_request WRITE_ONCE(rq->state, MQ_RQ_IDLE) nbd_clear_que blk_mq_tagset_busy_iter nbd_clear_req __blk_mq_free_request blk_mq_put_tag blk_mq_complete_request -> complete again There are three places where request can be completed in nbd: recv_work(), nbd_clear_que() and nbd_xmit_timeout(). Since they all hold cmd->lock before completing the request, it's easy to avoid the problem by setting and checking a cmd flag. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-3-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit bfe9b9d ("cdc_ether: Improve ZTE MF823/831/910 handling") introduces a workaround for certain ZTE modems reporting invalid MAC addresses over CDC-ECM. The same issue was present on their RNDIS interface,which was fixed in commit a5a18bd ("rndis_host: Set valid random MAC on buggy devices"). However, internal modem of ZTE MF286R router, on its RNDIS interface, also exhibits a second issue fixed already in CDC-ECM, of the device not respecting configured random MAC address. In order to share the fixup for this with rndis_host driver, export the workaround function, which will be re-used in the following commit in rndis_host. Cc: Kristian Evensen <kristian.evensen@gmail.com> Cc: Bjørn Mork <bjorn@mork.no> Cc: Oliver Neukum <oliver@neukum.org> Signed-off-by: Lech Perczak <lech.perczak@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Certain ZTE modems, namely: MF823. MF831, MF910, built-in modem from MF286R, expose both CDC-ECM and RNDIS network interfaces. They have a trait of ignoring the locally-administered MAC address configured on the interface both in CDC-ECM and RNDIS part, and this leads to dropping of incoming traffic by the host. However, the workaround was only present in CDC-ECM, and MF286R explicitly requires it in RNDIS mode. Re-use the workaround in rndis_host as well, to fix operation of MF286R module, some versions of which expose only the RNDIS interface. Do so by introducing new flag, RNDIS_DRIVER_DATA_DST_MAC_FIXUP, and testing for it in rndis_rx_fixup. This is required, as RNDIS uses frame batching, and all of the packets inside the batch need the fixup. This might introduce a performance penalty, because test is done for every returned Ethernet frame. Apply the workaround to both "flavors" of RNDIS interfaces, as older ZTE modems, like MF823 found in the wild, report the USB_CLASS_COMM class interfaces, while MF286R reports USB_CLASS_WIRELESS_CONTROLLER. Suggested-by: Bjørn Mork <bjorn@mork.no> Cc: Kristian Evensen <kristian.evensen@gmail.com> Cc: Oliver Neukum <oliver@neukum.org> Signed-off-by: Lech Perczak <lech.perczak@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reporting of bogus MAC addresses and ignoring configuration of new destination address wasn't observed outside of a range of ZTE devices, among which this seems to be the common bug. Align rndis_host driver with implementation found in cdc_ether, which also limits this workaround to ZTE devices. Suggested-by: Bjørn Mork <bjorn@mork.no> Cc: Kristian Evensen <kristian.evensen@gmail.com> Cc: Oliver Neukum <oliver@neukum.org> Signed-off-by: Lech Perczak <lech.perczak@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Rather than resetting state variables in socket state_change() callback, do it in the sunrpc TCP connect function itself. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Commands get stuck while Host NVMe-oF controller is in reconnect state. The controller enters into reconnect state when it loses connection with the target. It tries to reconnect every 10 seconds (default) until a successful reconnect or until the reconnect time-out is reached. The default reconnect time out is 10 minutes. Applications are expecting commands to complete with success or error within a certain timeout (30 seconds by default). The NVMe host is enforcing that timeout while it is connected, but during reconnect the timeout is not enforced and commands may get stuck for a long period or even forever. To fix this long delay due to the default timeout, introduce new "fast_io_fail_tmo" session parameter. The timeout is measured in seconds from the controller reconnect and any command beyond that timeout is rejected. The new parameter value may be passed during 'connect'. The default value of -1 means no timeout (similar to current behavior). Signed-off-by: Victor Gladkov <victor.gladkov@kioxia.com> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chao Leng <lengchao@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
And rename them to avoid name conflicts with the original kpatch ones. Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
If handling logics of watch events are slower than the events enqueue logic and the events can be created from the guests, the guests could trigger memory pressure by intensively inducing the events, because it will create a huge number of pending events that exhausting the memory. This is known as XSA-349. Fortunately, some watch events could be ignored, depending on its handler callback. For example, if the callback has interest in only one single path, the watch wouldn't want multiple pending events. Or, some watches could ignore events to same path. To let such watches to volutarily help avoiding the memory pressure situation, this commit introduces new watch callback, 'will_handle'. If it is not NULL, it will be called for each new event just before enqueuing it. Then, if the callback returns false, the event will be discarded. No watch is using the callback for now, though. Signed-off-by: Author Redacted <security@xen.org> Reviewed-by: Juergen Gross <jgross@suse.com>
Some code does not directly make 'xenbus_watch' object and call 'register_xenbus_watch()' but use 'xenbus_watch_path()' instead. This commit adds support of 'will_handle' callback in the 'xenbus_watch_path()' and it's wrapper, 'xenbus_watch_pathfmt()'. Signed-off-by: Author Redacted <security@xen.org> Reviewed-by: Juergen Gross <jgross@suse.com>
This commit adds support of the 'will_handle' watch callback for 'xen_bus_type' users. Signed-off-by: Author Redacted <security@xen.org> Reviewed-by: Juergen Gross <jgross@suse.com>
This commit adds a counter of pending messages for each watch in the struct. It is used to skip unnecessary pending messages lookup in 'unregister_xenbus_watch()'. It could also be used in 'will_handle' callback. Signed-off-by: Author Redacted <security@xen.org> Reviewed-by: Juergen Gross <jgross@suse.com>
'xenbus_backend' watches 'state' of devices, which is writable by guests. Hence, if guests intensively updates it, dom0 will have lots of pending events that exhausting memory of dom0. In other words, guests can trigger dom0 memory pressure. This is known as XSA-349. However, the watch callback of it, 'frontend_changed()', reads only 'state', so doesn't need to have the pending events. To avoid the problem, this commit disallows pending watch messages for 'xenbus_backend' using the 'will_handle()' watch callback. Signed-off-by: Author Redacted <security@xen.org> Reviewed-by: Juergen Gross <jgross@suse.com>
[ Upstream commit d742db7 ] On some architectures (like ARM), virt_to_gfn cannot be used for vmalloc'd memory because of its reliance on virt_to_phys. This patch introduces a check for vmalloc'd addresses and obtains the PFN using vmalloc_to_pfn in that case. Signed-off-by: Simon Leiner <simon@leiner.me> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Link: https://lore.kernel.org/r/20200825093153.35500-1-simon@leiner.me Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 1d5c76e upstream. There's no reason to request physically contiguous memory for those allocations. [boris: added CC to stable] Cc: stable@vger.kernel.org Reported-by: Ian Jackson <ian.jackson@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 589b728 upstream. Clang warns: ../drivers/block/xen-blkfront.c:1117:4: warning: misleading indentation; statement is not part of the previous 'if' [-Wmisleading-indentation] nr_parts = PARTS_PER_DISK; ^ ../drivers/block/xen-blkfront.c:1115:3: note: previous statement is here if (err) ^ This is because there is a space at the beginning of this line; remove it so that the indentation is consistent according to the Linux kernel coding style and clang no longer warns. While we are here, the previous line has some trailing whitespace; clean that up as well. Fixes: c80a420 ("xen-blkfront: handle Xen major numbers other than XENVBD") Link: ClangBuiltLinux/linux#791 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 3a169c0 upstream. Commit 1d5c76e ("xen-blkfront: switch kcalloc to kvcalloc for large array allocation") didn't fix the issue it was meant to, as the flags for allocating the memory are GFP_NOIO, which will lead the memory allocation falling back to kmalloc(). So instead of GFP_NOIO use GFP_KERNEL and do all the memory allocation in blkfront_setup_indirect() in a memalloc_noio_{save,restore} section. Fixes: 1d5c76e ("xen-blkfront: switch kcalloc to kvcalloc for large array allocation") Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20200403090034.8753-1-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 0549cd6 upstream. This is inline with the specification described in blkif.h: * discard-granularity: should be set to the physical block size if node is not present. * discard-alignment, discard-secure: should be set to 0 if node not present. This was detected as QEMU would only create the discard-granularity node but not discard-alignment, and thus the setup done in blkfront_setup_discard would fail. Fix blkfront_setup_discard to not fail on missing nodes, and also fix blkif_set_queue_limits to set the discard granularity to the physical block size if none is specified in xenbus. Fixes: ed30bf3 ('xen-blkfront: Handle discard requests.') Reported-by: Arthur Borsboom <arthurborsboom@gmail.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Tested-By: Arthur Borsboom <arthurborsboom@gmail.com> Link: https://lore.kernel.org/r/20210119105727.95173-1-roger.pau@citrix.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 629a5d8 upstream. Sync include/xen/interface/io/ring.h with Xen's newest version in order to get the RING_COPY_RESPONSE() and RING_RESPONSE_PROD_OVERFLOW() macros. Note that this will correct the wrong license info by adding the missing original copyright notice. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 71b6624 upstream. In order to avoid problems in case the backend is modifying a response on the ring page while the frontend has already seen it, just read the response into a local buffer in one go and then operate on that buffer only. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20210730103854.12681-2-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 8f5a695 upstream. In order to avoid a malicious backend being able to influence the local copy of a request build the request locally first and then copy it to the ring page instead of doing it the other way round as today. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20210730103854.12681-3-jgross@suse.com Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…RPOSES! I've noticed a performance regression with upstream kernels when used as Dom0 under Xen. The classic kernel can utilize the whole bandwidth of a 10G NIC (ca. 9.3 Gbps), but upstream can reach only ca. 7 Gbps. I found that it happens because SWIOTLB has to do double buffering. The per task frag allocator introduced in 5640f7 creates 32 kb frags, which are not contiguous in mfn space. This patch provides a workaround by going back to the old way. The possible ideas came up to solve this: * make sure Dom0 memory is contiguous: it sounds trivial, but doesn't work with driver domains, and there are lots of situations where this is not possible. * use PVH Dom0: so we will have IOMMU. In the future sometime. * use IOMMU with PV Dom0: this seems to happen earlier. Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
…t be in the CPU's runqueue. If there's nothing in the specified CPU's runqueue, it will return 1. But if the CPU is doing work in the softirq context, it is wrong for idle_cpu to return 1. I observed this to be a problem with netback kicking a kthread by executing wake_up from softirq context. The Completely Fair Scheduler's select_task_rq_fair was looking for an "idle sibling" of the CPU executing it by calling select_idle_sibling, passing the executing CPU as the 'target' parameter. The first thing that select_idle_sibling does is to check whether the 'target' CPU is idle, using idle_cpu, and to return that CPU if so. Despite the executing CPU being busy in softirq context, idle_cpu was returning 1, meaning that the scheduler would consistently try to run the kthread on the same CPU as the kick came from. Given that the softirq work was on-going, this led to a multi-millisecond delay before the scheduler eventually realised it should migrate the kthread to a different CPU. A solution to this problem would be to make idle_cpu return 0 when the CPU is running in softirq context. I haven't got a patch for that because I couldn't find an easy way of querying whether an arbitrary CPU is doing this. (Perhaps I should look at the per-CPU softirq_work_list[]...?) Instead, the following patch is a partial solution, only handling the case when the currently-executing CPU is in softirq context. This was sufficient to solve the problem I observed.
…ints SYN flood warnings for the DLM port: Dec 21 01:46:41 localhost kernel: [ 2146.516664] TCP: request_sock_TCP: Possible SYN flooding on port 21064. Sending cookies. Check SNMP counters. And then joining a DLM lockspace hangs: Dec 21 01:49:00 localhost kernel: [ 2285.780913] INFO: task xapi-clusterd:17638 blocked for more than 120 seconds. │ Dec 21 01:49:00 localhost kernel: [ 2285.786476] Not tainted 4.4.0+10 #1 │ Dec 21 01:49:00 localhost kernel: [ 2285.789043] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. │ Dec 21 01:49:00 localhost kernel: [ 2285.794611] xapi-clusterd D ffff88001930bc58 0 17638 1 0x00000000 │ Dec 21 01:49:00 localhost kernel: [ 2285.794615] ffff88001930bc58 ffff880025593800 ffff880022433800 ffff88001930c000 │ Dec 21 01:49:00 localhost kernel: [ 2285.794617] ffff88000ef4a660 ffff88000ef4a658 ffff880022433800 ffff88000ef4a000 │ Dec 21 01:49:00 localhost kernel: [ 2285.794619] ffff88001930bc70 ffffffff8159f6b4 7fffffffffffffff ffff88001930bd10 Dec 21 01:49:00 localhost kernel: [ 2285.794644] [<ffffffff811570fe>] ? printk+0x4d/0x4f │ Dec 21 01:49:00 localhost kernel: [ 2285.794647] [<ffffffff810b1741>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 │ Dec 21 01:49:00 localhost kernel: [ 2285.794649] [<ffffffff815a085d>] wait_for_completion+0x9d/0x110 │ Dec 21 01:49:00 localhost kernel: [ 2285.794653] [<ffffffff810979e0>] ? wake_up_q+0x80/0x80 │ Dec 21 01:49:00 localhost kernel: [ 2285.794661] [<ffffffffa03fa4b8>] dlm_new_lockspace+0x908/0xac0 [dlm] │ Dec 21 01:49:00 localhost kernel: [ 2285.794665] [<ffffffff810aaa60>] ? prepare_to_wait_event+0x100/0x100 │ Dec 21 01:49:00 localhost kernel: [ 2285.794670] [<ffffffffa0402e37>] device_write+0x497/0x6b0 [dlm] │ Dec 21 01:49:00 localhost kernel: [ 2285.794673] [<ffffffff811834f0>] ? handle_mm_fault+0x7f0/0x13b0 │ Dec 21 01:49:00 localhost kernel: [ 2285.794677] [<ffffffff811b4438>] __vfs_write+0x28/0xd0 │ Dec 21 01:49:00 localhost kernel: [ 2285.794679] [<ffffffff811b4b7f>] ? rw_verify_area+0x6f/0xd0 ┤ Dec 21 01:49:00 localhost kernel: [ 2285.794681] [<ffffffff811b4dc1>] vfs_write+0xb1/0x190 │ Dec 21 01:49:00 localhost kernel: [ 2285.794686] [<ffffffff8105ffc2>] ? __do_page_fault+0x302/0x420 │ Dec 21 01:49:00 localhost kernel: [ 2285.794688] [<ffffffff811b5986>] SyS_write+0x46/0xa0 │ Dec 21 01:49:00 localhost kernel: [ 2285.794690] [<ffffffff815a31ae>] entry_SYSCALL_64_fastpath+0x12/0x71 Although the join hanging forever like that is still a bug, if the SYN cookies consistently trigger it lets try to avoid the bug by avoiding the SYN cookies. Signed-off-by: Edwin Török <edvin.torok@citrix.com>
From: Tim Smith <tim.smith@citrix.com> Date: Sat Sep 19 2015 06:09:42 GMT+0000 When growing a number of files on the same cluster node from different threads (e.g. fio with 20 or so jobs), all those threads pile into gfs2_inplace_reserve() independently looking to claim a new resource group and after a while they all synchronise, getting through the gfs2_rgrp_used_recently()/gfs2_rgrp_congested() check together. When this happens, write performance drops to about 1/5 on a single node cluster, and on multi-node clusters it drops to near zero on some nodes. The output from "glocktop -r -H -d 1" when this happens begins to show many processes stuck in gfs2_inplace_reserve(), waiting on a resource group lock. This commit introduces a module parameter which, when set to a value of 1, will introduce some random jitter into the first two passes of gfs2_inplace_reserve() when trying to lock a new resource group, skipping to the next one 1/2 the time with progressively lower probability on each attempt. Signed-off-by: Tim Smith <tim.smith@citrix.com>
allocation can result in glocks being bounced around between hosts. Follow the example of inodes and if we have local waiters when asked to demote the glock on a resource group add a delay. Additionally, track when last asked to demote a lock and when assessing resource groups in the allocator prefer, in the first two loop iterations, not to use resource groups where we've been asked to demote the glock within the last second. Signed-off-by: Mark Syms <mark.syms@citrix.com>
We triggered an assert at some point in gfs2_add_inode_blocks(), and from the trace it was one of the calls in sweep_bh_for_rgrps(). But we don't know which and the assert was a bit broken anyway. After fixing the assert, add some tracking intended to detect if something else is messing with the inode at the same time, or if we're somehow trying to remove more blocks than are allocated.
changes to page_ops). This results in the kABI changing. Fix it by adding a new flag which specifies whether the page_done pointer is actually a pointer to page_ops. GFS2 uses this flag while unmodified callers still continue to work as before. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Add support for the Auxiliary Bus, auxiliary_device and auxiliary_driver. It enables drivers to create an auxiliary_device and bind an auxiliary_driver to it. The bus supports probe/remove shutdown and suspend/resume callbacks. Each auxiliary_device has a unique string based id; driver binds to an auxiliary_device based on this id through the bus. Co-developed-by: Kiran Patil <kiran.patil@intel.com> Co-developed-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Co-developed-by: Fred Oh <fred.oh@linux.intel.com> Co-developed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Kiran Patil <kiran.patil@intel.com> Signed-off-by: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Signed-off-by: Fred Oh <fred.oh@linux.intel.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Martin Habets <mhabets@solarflare.com> Link: https://lore.kernel.org/r/20201113161859.1775473-2-david.m.ertman@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/160695681289.505290.8978295443574440604.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
No need to include slab.h in include/linux/auxiliary_bus.h, as it is not needed there. Move it to drivers/base/auxiliary.c instead. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Ertman <david.m.ertman@intel.com> Cc: Fred Oh <fred.oh@linux.intel.com> Cc: Kiran Patil <kiran.patil@intel.com> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Martin Habets <mhabets@solarflare.com> Cc: Parav Pandit <parav@mellanox.com> Cc: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Cc: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Cc: Shiraz Saleem <shiraz.saleem@intel.com> Link: https://lore.kernel.org/r/X8og8xi3WkoYXet9@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There's an effort to move the remove() callback in the driver core to not return an int, as nothing can be done if this function fails. To make that effort easier, make the aux bus remove function void to start with so that no users have to be changed sometime in the future. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Ertman <david.m.ertman@intel.com> Cc: Fred Oh <fred.oh@linux.intel.com> Cc: Kiran Patil <kiran.patil@intel.com> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Martin Habets <mhabets@solarflare.com> Cc: Parav Pandit <parav@mellanox.com> Cc: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Cc: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Cc: Shiraz Saleem <shiraz.saleem@intel.com> Link: https://lore.kernel.org/r/X8ohB1ks1NK7kPop@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
For some reason, the original aux bus patch had some really long lines in a few places, probably due to it being a very long-lived patch in development by many different people. Fix that up so that the two files all have the same length lines and function formatting styles. Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Ertman <david.m.ertman@intel.com> Cc: Fred Oh <fred.oh@linux.intel.com> Cc: Kiran Patil <kiran.patil@intel.com> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Martin Habets <mhabets@solarflare.com> Cc: Parav Pandit <parav@mellanox.com> Cc: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Cc: Ranjani Sridharan <ranjani.sridharan@linux.intel.com> Cc: Shiraz Saleem <shiraz.saleem@intel.com> Link: https://lore.kernel.org/r/X8oiSFTpYHw1xE/o@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the probe of the auxdrv failed, the device->driver is set to NULL. During kernel shutdown, the bus shutdown will call auxdrv->shutdown and cause an invalid ptr dereference. Add check to make sure device->driver is not NULL before we proceed. Fixes: 7de3697 ("Add auxiliary bus support") Cc: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/160710040926.1889434.8840329810698403478.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the auxiliary device code is built into the kernel, it can be executed before the auxiliary bus is registered. This causes bus->p to be not allocated and triggers a NULL pointer dereference when the auxiliary bus device gets added with bus_add_device(). Call the auxiliary_bus_init() under driver_init() so the bus is initialized before devices. Below is the kernel splat for the bug: [ 1.948215] BUG: kernel NULL pointer dereference, address: 0000000000000060 [ 1.950670] #PF: supervisor read access in kernel mode [ 1.950670] #PF: error_code(0x0000) - not-present page [ 1.950670] PGD 0 [ 1.950670] Oops: 0000 1 SMP NOPTI [ 1.950670] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.10.0-intel-nextsvmtest+ #2205 [ 1.950670] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 1.950670] RIP: 0010:bus_add_device+0x64/0x140 [ 1.950670] Code: 00 49 8b 75 20 48 89 df e8 59 a1 ff ff 41 89 c4 85 c0 75 7b 48 8b 53 50 48 85 d2 75 03 48 8b 13 49 8b 85 a0 00 00 00 48 89 de <48> 8 78 60 48 83 c7 18 e8 ef d9 a9 ff 41 89 c4 85 c0 75 45 48 8b [ 1.950670] RSP: 0000:ff46032ac001baf8 EFLAGS: 00010246 [ 1.950670] RAX: 0000000000000000 RBX: ff4597f7414aa680 RCX: 0000000000000000 [ 1.950670] RDX: ff4597f74142bbc0 RSI: ff4597f7414aa680 RDI: ff4597f7414aa680 [ 1.950670] RBP: ff46032ac001bb10 R08: 0000000000000044 R09: 0000000000000228 [ 1.950670] R10: ff4597f741141b30 R11: ff4597f740182a90 R12: 0000000000000000 [ 1.950670] R13: ffffffffa5e936c0 R14: 0000000000000000 R15: 0000000000000000 [ 1.950670] FS: 0000000000000000(0000) GS:ff4597f7bba00000(0000) knlGS:0000000000000000 [ 1.950670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.950670] CR2: 0000000000000060 CR3: 000000002140c001 CR4: 0000000000f71ef0 [ 1.950670] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1.950670] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 1.950670] PKRU: 55555554 [ 1.950670] Call Trace: [ 1.950670] device_add+0x3ee/0x850 [ 1.950670] __auxiliary_device_add+0x47/0x60 [ 1.950670] idxd_pci_probe+0xf77/0x1180 [ 1.950670] local_pci_probe+0x4a/0x90 [ 1.950670] pci_device_probe+0xff/0x1b0 [ 1.950670] really_probe+0x1cf/0x440 [ 1.950670] ? rdinit_setup+0x31/0x31 [ 1.950670] driver_probe_device+0xe8/0x150 [ 1.950670] device_driver_attach+0x58/0x60 [ 1.950670] __driver_attach+0x8f/0x150 [ 1.950670] ? device_driver_attach+0x60/0x60 [ 1.950670] ? device_driver_attach+0x60/0x60 [ 1.950670] bus_for_each_dev+0x79/0xc0 [ 1.950670] ? kmem_cache_alloc_trace+0x323/0x430 [ 1.950670] driver_attach+0x1e/0x20 [ 1.950670] bus_add_driver+0x154/0x1f0 [ 1.950670] driver_register+0x70/0xc0 [ 1.950670] __pci_register_driver+0x54/0x60 [ 1.950670] idxd_init_module+0xe2/0xfc [ 1.950670] ? idma64_platform_driver_init+0x19/0x19 [ 1.950670] do_one_initcall+0x4a/0x1e0 [ 1.950670] kernel_init_freeable+0x1fc/0x25c [ 1.950670] ? rest_init+0xba/0xba [ 1.950670] kernel_init+0xe/0x116 [ 1.950670] ret_from_fork+0x1f/0x30 [ 1.950670] Modules linked in: [ 1.950670] CR2: 0000000000000060 [ 1.950670] --[ end trace cd7d1b226d3ca901 ]-- Fixes: 7de3697 ("Add auxiliary bus support") Reported-by: Jacob Pan <jacob.jun.pan@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/20210210201611.1611074-1-dave.jiang@intel.com Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Remove module bits in the auxiliary bus code since the auxiliary bus cannot be built as a module and the relevant code is not needed. Cc: Dave Ertman <david.m.ertman@intel.com> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/161307488980.1896017.15627190714413338196.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If driver_register() returns with error we need to free the memory allocated for auxdrv->driver.name before returning from __auxiliary_driver_register() Fixes: 7de3697 ("Add auxiliary bus support") Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Link: https://lore.kernel.org/r/20210713093438.3173-1-peter.ujfalusi@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The documentation for creating an auxiliary device is a 3 step not a 2 step process. Specifically the requirements of setting the name, id, dev.release, and dev.parent fields was not clear as a precursor to the '2 step' process documented. Clarify by declaring this a 3 step process starting with setting the fields of struct auxiliary_device correctly. Also add some sample code and tie the change into the rest of the documentation. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20211202044305.4006853-2-ira.weiny@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
__auxiliary_driver_register is not intended to be called directly unless a custom name is required. Add documentation for this fact. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20211202044305.4006853-5-ira.weiny@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
… device auxiliary_find_device() takes a proper get_device() reference on the device before returning the matched device. Users of this call should be informed that they need to properly release this reference with put_device(). Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20211202044305.4006853-7-ira.weiny@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The code and documentation are more difficult to maintain when kept separately. This is further compounded when the standard structure documentation infrastructure is not used. Move the documentation into the code, use the standard documentation infrastructure, add current documented functions, and reference the text in the rst file. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20211202044305.4006853-8-ira.weiny@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Stephen Cheng <stephen.cheng@citrix.com>
not expecting the LUN change. Lest it may trigger udev rules which in turn trigger more LUN change events. Signed-off-by: Gerald Elder-Vass <gerald.elder-vass@cloud.com>
Due to some backport trying to fix a possible deadlock (see CA-399062
for details) some patches which introduced many more deadlock were
merged to XS8.
Revert these patches and fix the original potential deadlock too.
These deadlock were generated by the same lock taken by no-IRQ context
and nested lock attempt on IRQ context as IRQ were not disabled on
first lock. Calling spin_lock_bh instead of spin_lock prevent IRQ
code nested to no-IRQ code.
The removal of "transport_lock" in "xs_tcp_state_change" is enabled by
commit 13f6175 ("XSI-2150: Remove added chunk causing deadlock",
specifically "SUNRPC: Move reset of TCP state variables into the
reconnect code"). A review of "xs_tcp_state_change" shows that similar
accesses through the "transport" pointer (notably for "TCP_CLOSE_WAIT"
and "TCP_CLOSE") are already performed without this lock. State bit
updates are atomic, and the work queue is adequately protected by
other synchronization mechanisms, including additional spinlocks and
CPU-local variables.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Warning
NOT intended to be merged into
kernel/xenserver-4.19.19-8.0.45/base, this is solely to review the new patch-queue.To review the new patch-set from prior version:
Note
The actual diff between the two trees can be visualized with:
https://github.com/xcp-ng/linux/compare/kernel/xenserver-4.19.19-8.0.45/base..kernel/xenserver-4.19.19-8.0.46/base