- Dec 25, 2022
-
-
Steven Rostedt (Google) authored
Due to several bugs caused by timers being re-armed after they are shutdown and just before they are freed, a new state of timers was added called "shutdown". After a timer is set to this state, then it can no longer be re-armed. The following script was run to find all the trivial locations where del_timer() or del_timer_sync() is called in the same function that the object holding the timer is freed. It also ignores any locations where the timer->function is modified between the del_timer*() and the free(), as that is not considered a "trivial" case. This was created by using a coccinelle script and the following commands: $ cat timer.cocci @@ expression ptr, slab; identifier timer, rfield; @@ ( - del_timer(&ptr->timer); + timer_shutdown(&ptr->timer); | - del_timer_sync(&ptr->timer); + timer_shutdown_sync(&ptr->timer); ) ... when strict when != ptr->timer ( kfree_rcu(ptr, rfield); | kmem_cache_free(slab, ptr); | kfree(ptr); ) $ spatch timer.cocci . > /tmp/t.patch $ patch -p1 < /tmp/t.patch Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/ Signed-off-by:
Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ] Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ] Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ] Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Dec 20, 2022
-
-
Jason A. Donenfeld authored
Convert the final two users of prandom_u32_max() that slipped in during 6.2-rc1 to use get_random_u32_below(). Then, with no more users left, we can finally remove the deprecated function. Signed-off-by:
Jason A. Donenfeld <Jason@zx2c4.com>
-
Benjamin Coddington authored
Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the GFP_NOIO flag on sk_allocation which the networking system uses to decide when it is safe to use current->task_frag. The results of this are unexpected corruption in task_frag when SUNRPC is involved in memory reclaim. The corruption can be seen in crashes, but the root cause is often difficult to ascertain as a crashing machine's stack trace will have no evidence of being near NFS or SUNRPC code. I believe this problem to be much more pervasive than reports to the community may indicate. Fix this by having kernel users of sockets that may corrupt task_frag due to reclaim set sk_use_task_frag = false. Preemptively correcting this situation for users that still set sk_allocation allows them to convert to memalloc_nofs_save/restore without the same unexpected corruptions that are sure to follow, unlikely to show up in testing, and difficult to bisect. CC: Philipp Reisner <philipp.reisner@linbit.com> CC: Lars Ellenberg <lars.ellenberg@linbit.com> CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com> CC: Jens Axboe <axboe@kernel.dk> CC: Josef Bacik <josef@toxicpanda.com> CC: Keith Busch <kbusch@kernel.org> CC: Christoph Hellwig <hch@lst.de> CC: Sagi Grimberg <sagi@grimberg.me> CC: Lee Duncan <lduncan@suse.com> CC: Chris Leech <cleech@redhat.com> CC: Mike Christie <michael.christie@oracle.com> CC: "James E.J. Bottomley" <jejb@linux.ibm.com> CC: "Martin K. Petersen" <martin.petersen@oracle.com> CC: Valentina Manea <valentina.manea.m@gmail.com> CC: Shuah Khan <shuah@kernel.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: David Howells <dhowells@redhat.com> CC: Marc Dionne <marc.dionne@auristor.com> CC: Steve French <sfrench@samba.org> CC: Christine Caulfield <ccaulfie@redhat.com> CC: David Teigland <teigland@redhat.com> CC: Mark Fasheh <mark@fasheh.com> CC: Joel Becker <jlbec@evilplan.org> CC: Joseph Qi <joseph.qi@linux.alibaba.com> CC: Eric Van Hensbergen <ericvh@gmail.com> CC: Latchesar Ionkov <lucho@ionkov.net> CC: Dominique Martinet <asmadeus@codewreck.org> CC: Ilya Dryomov <idryomov@gmail.com> CC: Xiubo Li <xiubli@redhat.com> CC: Chuck Lever <chuck.lever@oracle.com> CC: Jeff Layton <jlayton@kernel.org> CC: Trond Myklebust <trond.myklebust@hammerspace.com> CC: Anna Schumaker <anna@kernel.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au> Suggested-by:
Guillaume Nault <gnault@redhat.com> Signed-off-by:
Benjamin Coddington <bcodding@redhat.com> Reviewed-by:
Guillaume Nault <gnault@redhat.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Guillaume Nault authored
Sockets that can be used while recursing into memory reclaim, like those used by network block devices and file systems, mustn't use current->task_frag: if the current process is already using it, then the inner memory reclaim call would corrupt the task_frag structure. To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets that mustn't use current->task_frag, assuming that those used during memory reclaim had their allocation constraints reflected in ->sk_allocation. This unfortunately doesn't cover all cases: in an attempt to remove all usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in ->sk_allocation, and used memalloc_nofs critical sections instead. This breaks the sk_page_frag() heuristic since the allocation constraints are now stored in current->flags, which sk_page_frag() can't read without risking triggering a cache miss and slowing down TCP's fast path. This patch creates a new field in struct sock, named sk_use_task_frag, which sockets with memory reclaim constraints can set to false if they can't safely use current->task_frag. In such cases, sk_page_frag() now always returns the socket's page_frag (->sk_frag). The first user is sunrpc, which needs to avoid using current->task_frag but can keep ->sk_allocation set to GFP_KERNEL otherwise. Eventually, it might be possible to simplify sk_page_frag() by only testing ->sk_use_task_frag and avoid relying on the ->sk_allocation heuristic entirely (assuming other sockets will set ->sk_use_task_frag according to their constraints in the future). The new ->sk_use_task_frag field is placed in a hole in struct sock and belongs to a cache line shared with ->sk_shutdown. Therefore it should be hot and shouldn't have negative performance impacts on TCP's fast path (sk_shutdown is tested just before the while() loop in tcp_sendmsg_locked()). Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/ Signed-off-by:
Guillaume Nault <gnault@redhat.com> Reviewed-by:
Benjamin Coddington <bcodding@redhat.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Matt Johnston authored
The unregister check could be incorrectly triggered if a netdev changes its type after register. That is possible for a tun device using TUNSETLINK ioctl, resulting in mctp unregister failing and the netdev unregister waiting forever. This was encountered by https://github.com/openthread/openthread/issues/8523 Neither check at register or unregister is required. They were added in an attempt to track down mctp_ptr being set unexpectedly, which should not happen in normal operation. Fixes: 7b1871af ("mctp: Warn if pointer is set for a wrong dev type") Signed-off-by:
Matt Johnston <matt@codeconstruct.com.au> Link: https://lore.kernel.org/r/20221215054933.2403401-1-matt@codeconstruct.com.au Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Dec 19, 2022
-
-
Eric Dumazet authored
Changheon Lee reported TCP socket leaks, with a nice repro. It seems we leak TCP sockets with the following sequence: 1) SOF_TIMESTAMPING_TX_ACK is enabled on the socket. Each ACK will cook an skb put in error queue, from __skb_tstamp_tx(). __skb_tstamp_tx() is using skb_clone(), unless SOF_TIMESTAMPING_OPT_TSONLY was also requested. 2) If the application is also using MSG_ZEROCOPY, then we put in the error queue cloned skbs that had a struct ubuf_info attached to them. Whenever an struct ubuf_info is allocated, sock_zerocopy_alloc() does a sock_hold(). As long as the cloned skbs are still in sk_error_queue, socket refcount is kept elevated. 3) Application closes the socket, while error queue is not empty. Since tcp_close() no longer purges the socket error queue, we might end up with a TCP socket with at least one skb in error queue keeping the socket alive forever. This bug can be (ab)used to consume all kernel memory and freeze the host. We need to purge the error queue, with proper synchronization against concurrent writers. Fixes: 24bcbe1c ("net: stream: don't purge sk_error_queue in sk_stream_kill_queues()") Reported-by:
Changheon Lee <darklight2357@icloud.com> Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Howells authored
Dan Carpenter sayeth[1]: The patch 5e6ef4f1: "rxrpc: Make the I/O thread take over the call and local processor work" from Jan 23, 2020, leads to the following Smatch static checker warning: net/rxrpc/io_thread.c:283 rxrpc_input_packet() warn: bool is not less than zero. Fix this (for now) by changing rxrpc_new_incoming_call() to return an int with 0 or error code rather than bool. Note that the actual return value of rxrpc_input_packet() is currently ignored. I have a separate patch to clean that up. Fixes: 5e6ef4f1 ("rxrpc: Make the I/O thread take over the call and local processor work") Reported-by:
Dan Carpenter <error27@gmail.com> Signed-off-by:
David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: http://lists.infradead.org/pipermail/linux-afs/2022-December/006123.html [1] Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Howells authored
Dan Carpenter sayeth[1]: The patch 75bfdbf2: "rxrpc: Implement an in-kernel rxperf server for testing purposes" from Nov 3, 2022, leads to the following Smatch static checker warning: net/rxrpc/rxperf.c:337 rxperf_deliver_to_call() error: uninitialized symbol 'ret'. Fix this by initialising ret to 0. The value is only used for tracing purposes in the rxperf server. Fixes: 75bfdbf2 ("rxrpc: Implement an in-kernel rxperf server for testing purposes") Reported-by:
Dan Carpenter <error27@gmail.com> Signed-off-by:
David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: http://lists.infradead.org/pipermail/linux-afs/2022-December/006124.html [1] Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Howells authored
The rxrpc I/O thread checks to see if there's any work it needs to do, and if not, checks kthread_should_stop() before scheduling, and if it should stop, breaks out of the loop and tries to clean up and exit. This can, however, race with socket destruction, wherein outstanding calls are aborted and released from the socket and then the socket unuses the local endpoint, causing kthread_stop() to be issued. The abort is deferred to the I/O thread and the event can by issued between the I/O thread checking if there's any work to be done (such as processing call aborts) and the stop being seen. This results in the I/O thread stopping processing of events whilst call cleanup events are still outstanding, leading to connections or other objects still being around and uncleaned up, which can result in assertions being triggered, e.g.: rxrpc: AF_RXRPC: Leaked client conn 00000000e8009865 {2} ------------[ cut here ]------------ kernel BUG at net/rxrpc/conn_client.c:64! Fix this by retrieving the kthread_should_stop() indication, then checking to see if there's more work to do, and going back round the loop if there is, and breaking out of the loop only if there wasn't. This was triggered by a syzbot test that produced some other symptom[1]. Fixes: a275da62 ("rxrpc: Create a per-local endpoint receive queue and I/O thread") Signed-off-by:
David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/0000000000002b4a9f05ef2b616f@google.com/ [1] Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Howells authored
Fix the switched parameters on rxrpc_alloc_peer() and rxrpc_get_peer(). The ref argument and the why argument got mixed. Fixes: 47c810a7 ("rxrpc: trace: Don't use __builtin_return_address for rxrpc_peer tracing") Signed-off-by:
David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Howells authored
Now that rxrpc_put_local() may call kthread_stop(), it can't be called under spinlock as it might sleep. This can cause a problem in the peer keepalive code in rxrpc as it tries to avoid dropping the peer_hash_lock from the point it needs to re-add peer->keepalive_link to going round the loop again in rxrpc_peer_keepalive_dispatch(). Fix this by just dropping the lock when we don't need it and accepting that we'll have to take it again. This code is only called about every 20s for each peer, so not very often. This allows rxrpc_put_peer_unlocked() to be removed also. If triggered, this bug produces an oops like the following, as reproduced by a syzbot reproducer for a different oops[1]: BUG: sleeping function called from invalid context at kernel/sched/completion.c:101 ... RCU nest depth: 0, expected: 0 3 locks held by kworker/u9:0/50: #0: ffff88810e74a138 ((wq_completion)krxrpcd){+.+.}-{0:0}, at: process_one_work+0x294/0x636 #1: ffff8881013a7e20 ((work_completion)(&rxnet->peer_keepalive_work)){+.+.}-{0:0}, at: process_one_work+0x294/0x636 #2: ffff88817d366390 (&rxnet->peer_hash_lock){+.+.}-{2:2}, at: rxrpc_peer_keepalive_dispatch+0x2bd/0x35f ... Call Trace: <TASK> dump_stack_lvl+0x4c/0x5f __might_resched+0x2cf/0x2f2 __wait_for_common+0x87/0x1e8 kthread_stop+0x14d/0x255 rxrpc_peer_keepalive_dispatch+0x333/0x35f rxrpc_peer_keepalive_worker+0x2e9/0x449 process_one_work+0x3c1/0x636 worker_thread+0x25f/0x359 kthread+0x1a6/0x1b5 ret_from_fork+0x1f/0x30 Fixes: a275da62 ("rxrpc: Create a per-local endpoint receive queue and I/O thread") Signed-off-by:
David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/0000000000002b4a9f05ef2b616f@google.com/ [1] Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Howells authored
When starting a kthread, the __kthread_create_on_node() function, as called from kthread_run(), waits for a completion to indicate that the task_struct (or failure state) of the new kernel thread is available before continuing. This does not wait, however, for the thread function to be invoked and, indeed, will skip it if kthread_stop() gets called before it gets there. If this happens, though, kthread_run() will have returned successfully, indicating that the thread was started and returning the task_struct pointer. The actual error indication is returned by kthread_stop(). Note that this is ambiguous, as the caller cannot tell whether the -EINTR error code came from kthread() or from the thread function. This was encountered in the new rxrpc I/O thread, where if the system is being pounded hard by, say, syzbot, the check of KTHREAD_SHOULD_STOP can be delayed long enough for kthread_stop() to get called when rxrpc releases a socket - and this causes an oops because the I/O thread function doesn't get started and thus doesn't remove the rxrpc_local struct from the local_endpoints list. Fix this by using a completion to wait for the thread to actually enter rxrpc_io_thread(). This makes sure the thread can't be prematurely stopped and makes sure the relied-upon cleanup is done. Fixes: a275da62 ("rxrpc: Create a per-local endpoint receive queue and I/O thread") Reported-by:
<syzbot+3538a6a72efa8b059c38@syzkaller.appspotmail.com> Signed-off-by:
David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Hillf Danton <hdanton@sina.com> Link: https://lore.kernel.org/r/000000000000229f1505ef2b6159@google.com/ Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Howells authored
Fix rxrpc_unuse_local() to get the debug_id *after* checking to see if local is NULL. Fixes: a2cf3264 ("rxrpc: Fold __rxrpc_unuse_local() into rxrpc_unuse_local()") Reported-by:
<syzbot+3538a6a72efa8b059c38@syzkaller.appspotmail.com> Signed-off-by:
David Howells <dhowells@redhat.com> Tested-by:
<syzbot+3538a6a72efa8b059c38@syzkaller.appspotmail.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Howells authored
Fix the propagation of the security settings from sendmsg to the rxrpc_call struct. Fixes: f3441d41 ("rxrpc: Copy client call parameters into rxrpc_call earlier") Signed-off-by:
David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Howells authored
One of the error paths in rxrpc_do_sendmsg() doesn't unlock the call mutex before returning. Fix it to do this. Note that this still doesn't get rid of the checker warning: ../net/rxrpc/sendmsg.c:617:5: warning: context imbalance in 'rxrpc_do_sendmsg' - wrong count at exit I think the interplay between the socket lock and the call's user_mutex may be too complicated for checker to analyse, especially as rxrpc_new_client_call_for_sendmsg(), which it calls, returns with the call's user_mutex if successful but unconditionally drops the socket lock. Fixes: e754eba6 ("rxrpc: Provide a cmsg to specify the amount of Tx data for a call") Signed-off-by:
David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Cong Wang authored
When TCF_EM_SIMPLE was introduced, it is supposed to be convenient for ematch implementation: https://lore.kernel.org/all/20050105110048.GO26856@postel.suug.ch/ "You don't have to, providing a 32bit data chunk without TCF_EM_SIMPLE set will simply result in allocating & copy. It's an optimization, nothing more." So if an ematch module provides ops->datalen that means it wants a complex data structure (saved in its em->data) instead of a simple u32 value. We should simply reject such a combination, otherwise this u32 could be misinterpreted as a pointer. Fixes: 1da177e4 ("Linux-2.6.12-rc2") Reported-and-tested-by:
<syzbot+4caeae4c7103813598ae@syzkaller.appspotmail.com> Reported-by:
Jun Nie <jun.nie@linaro.org> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Paolo Abeni <pabeni@redhat.com> Signed-off-by:
Cong Wang <cong.wang@bytedance.com> Acked-by:
Paolo Abeni <pabeni@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Dec 17, 2022
-
-
Subash Abhinov Kasiviswanathan authored
Extending the tail can have some unexpected side effects if a program uses a helper like BPF_FUNC_skb_pull_data to read partial content beyond the head skb headlen when all the skbs in the gso frag_list are linear with no head_frag - kernel BUG at net/core/skbuff.c:4219! pc : skb_segment+0xcf4/0xd2c lr : skb_segment+0x63c/0xd2c Call trace: skb_segment+0xcf4/0xd2c __udp_gso_segment+0xa4/0x544 udp4_ufo_fragment+0x184/0x1c0 inet_gso_segment+0x16c/0x3a4 skb_mac_gso_segment+0xd4/0x1b0 __skb_gso_segment+0xcc/0x12c udp_rcv_segment+0x54/0x16c udp_queue_rcv_skb+0x78/0x144 udp_unicast_rcv_skb+0x8c/0xa4 __udp4_lib_rcv+0x490/0x68c udp_rcv+0x20/0x30 ip_protocol_deliver_rcu+0x1b0/0x33c ip_local_deliver+0xd8/0x1f0 ip_rcv+0x98/0x1a4 deliver_ptype_list_skb+0x98/0x1ec __netif_receive_skb_core+0x978/0xc60 Fix this by marking these skbs as GSO_DODGY so segmentation can handle the tail updates accordingly. Fixes: 3dcbdb13 ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list") Signed-off-by:
Sean Tranchetti <quic_stranche@quicinc.com> Signed-off-by:
Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> Reviewed-by:
Alexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/1671084718-24796-1-git-send-email-quic_subashab@quicinc.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Jakub Kicinski authored
Take the instance lock around devlink_nl_fill() when dumping, doit takes it already. We are only dumping basic info so in the worst case we were risking data races around the reload statistics. Until the big devlink mutex was removed all relevant code was protected by it, so the missing instance lock was not exposed. Fixes: d3efc2a6 ("net: devlink: remove devlink_mutex") Reviewed-by:
Jiri Pirko <jiri@nvidia.com> Reviewed-by:
Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20221216044122.1863550-1-kuba@kernel.org Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Dec 16, 2022
-
-
Eelco Chaudron authored
The commit mentioned below causes the ovs_flow_tbl_lookup() function to be called with the masked key. However, it's supposed to be called with the unmasked key. This due to the fact that the datapath supports installing wider flows, and OVS relies on this behavior. For example if ipv4(src=1.1.1.1/192.0.0.0, dst=1.1.1.2/192.0.0.0) exists, a wider flow (smaller mask) of ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/ 128.0.0.0) is allowed to be added. However, if we try to add a wildcard rule, the installation fails: $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2 $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \ ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2 ovs-vswitchd: updating flow table (File exists) The reason is that the key used to determine if the flow is already present in the system uses the original key ANDed with the mask. This results in the IP address not being part of the (miniflow) key, i.e., being substituted with an all-zero value. When doing the actual lookup, this results in the key wrongfully matching the first flow, and therefore the flow does not get installed. This change reverses the commit below, but rather than having the key on the stack, it's allocated. Fixes: 190aa3e7 ("openvswitch: Fix Frame-size larger than 1024 bytes warning.") Signed-off-by:
Eelco Chaudron <echaudro@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
Netdevsim triggers a splat on reload, when it destroys regions with snapshots pending: WARNING: CPU: 1 PID: 787 at net/core/devlink.c:6291 devlink_region_snapshot_del+0x12e/0x140 CPU: 1 PID: 787 Comm: devlink Not tainted 6.1.0-07460-g7ae9888d6e1c #580 RIP: 0010:devlink_region_snapshot_del+0x12e/0x140 Call Trace: <TASK> devl_region_destroy+0x70/0x140 nsim_dev_reload_down+0x2f/0x60 [netdevsim] devlink_reload+0x1f7/0x360 devlink_nl_cmd_reload+0x6ce/0x860 genl_family_rcv_msg_doit.isra.0+0x145/0x1c0 This is the locking assert in devlink_region_snapshot_del(), we're supposed to be holding the region->snapshot_lock here. Fixes: 2dec18ad ("net: devlink: remove region snapshots list dependency on devlink->lock") Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Reviewed-by:
Jiri Pirko <jiri@nvidia.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Dec 15, 2022
-
-
Peter Zijlstra authored
Because endlessly repeating: set_memory_ro() set_memory_x() is getting tedious. Suggested-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/Y1jek64pXOsougmz@hirez.programming.kicks-ass.net
-
There is a race resulting in alive SOCK_SEQPACKET socket may change its state from TCP_ESTABLISHED to TCP_CLOSE: unix_release_sock(peer) unix_dgram_sendmsg(sk) sock_orphan(peer) sock_set_flag(peer, SOCK_DEAD) sock_alloc_send_pskb() if !(sk->sk_shutdown & SEND_SHUTDOWN) OK if sock_flag(peer, SOCK_DEAD) sk->sk_state = TCP_CLOSE sk->sk_shutdown = SHUTDOWN_MASK After that socket sk remains almost normal: it is able to connect, listen, accept and recvmsg, while it can't sendmsg. Since this is the only possibility for alive SOCK_SEQPACKET to change the state in such way, we should better fix this strange and potentially danger corner case. Note, that we will return EPIPE here like this is normally done in sock_alloc_send_pskb(). Originally used ECONNREFUSED looks strange, since it's strange to return a specific retval in dependence of race in kernel, when user can't affect on this. Also, move TCP_CLOSE assignment for SOCK_DGRAM sockets under state lock to fix race with unix_dgram_connect(): unix_dgram_connect(other) unix_dgram_sendmsg(sk) unix_peer(sk) = NULL unix_state_unlock(sk) unix_state_double_lock(sk, other) sk->sk_state = TCP_ESTABLISHED unix_peer(sk) = other unix_state_double_unlock(sk, other) sk->sk_state = TCP_CLOSED This patch fixes both of these races. Fixes: 83301b53 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too") Signed-off-by:
Kirill Tkhai <tkhai@ya.ru> Link: https://lore.kernel.org/r/135fda25-22d5-837a-782b-ceee50e19844@ya.ru Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-
- Dec 13, 2022
-
-
Li Qiong authored
It is better to return the default switch case with '-EINVAL', in case new commands are added. otherwise, return a uninitialized value of ret. Signed-off-by:
Li Qiong <liqiong@nfschina.com> Reviewed-by:
Simon Horman <horms@verge.net.au> Acked-by:
Julian Anastasov <ja@ssi.bg> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
32 bit platforms without 64bit div generate the following warning: net/netfilter/ipvs/ip_vs_est.c: In function 'ip_vs_est_calc_limits': include/asm-generic/div64.h:222:35: warning: comparison of distinct pointer types lacks a cast 222 | (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \ | ^~ net/netfilter/ipvs/ip_vs_est.c:694:17: note: in expansion of macro 'do_div' 694 | do_div(val, loops); | ^~~~~~ include/asm-generic/div64.h:222:35: warning: comparison of distinct pointer types lacks a cast 222 | (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \ | ^~ net/netfilter/ipvs/ip_vs_est.c:700:33: note: in expansion of macro 'do_div' 700 | do_div(val, min_est); | ^~~~~~ first argument of do_div() should be unsigned. We can't just cast as do_div() updates it as well, so we need an lval. Make val unsigned in the first place, all paths check that the value they assign to this variables are non-negative already. Fixes: 705dd344 ("ipvs: use kthreads for stats estimation") Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20221213032037.844517-1-kuba@kernel.org Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-
Dominique Martinet authored
KCSAN reported a race between writing req->status in p9_client_cb and accessing it in p9_client_rpc's wait_event. Accesses to req itself is protected by the data barrier (writing req fields, write barrier, writing status // reading status, read barrier, reading other req fields), but status accesses themselves apparently also must be annotated properly with WRITE_ONCE/READ_ONCE when we access it without locks. Follows: - error paths writing status in various threads all can notify p9_client_rpc, so these all also need WRITE_ONCE - there's a similar read loop in trans_virtio for zc case that also needs READ_ONCE - other reads in trans_fd should be protected by the trans_fd lock and lists state machine, as corresponding writers all are within trans_fd and should be under the same lock. If KCSAN complains on them we likely will have something else to fix as well, so it's better to leave them unmarked and look again if required. Link: https://lkml.kernel.org/r/20221205124756.426350-1-asmadeus@codewreck.org Reported-by:
Naresh Kamboju <naresh.kamboju@linaro.org> Suggested-by:
Marco Elver <elver@google.com> Acked-by:
Marco Elver <elver@google.com> Reviewed-by:
Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by:
Dominique Martinet <asmadeus@codewreck.org>
-
- Dec 12, 2022
-
-
Coco Li authored
IPv6/TCP and GRO stacks can build big TCP packets with an added temporary Hop By Hop header. Is GSO is not involved, then the temporary header needs to be removed in the driver. This patch provides a generic helper for drivers that need to modify their headers in place. Tested: Compiled and ran with ethtool -K eth1 tso off Could send Big TCP packets Signed-off-by:
Coco Li <lixiaoyan@google.com> Link: https://lore.kernel.org/r/20221210041646.3587757-1-lixiaoyan@google.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Now that user space can specify additional attributes of port group entries such as filter mode and source list, it makes sense to allow user space to atomically modify these attributes by replacing entries instead of forcing user space to delete the entries and add them back. Replace MDB port group entries when the 'NLM_F_REPLACE' flag is specified in the netlink message header. When a (*, G) entry is replaced, update the following attributes: Source list, state, filter mode, protocol and flags. If the entry is temporary and in EXCLUDE mode, reset the group timer to the group membership interval. If the entry is temporary and in INCLUDE mode, reset the source timers of associated sources to the group membership interval. Examples: # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.2 filter_mode include # bridge -d -s mdb show dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.2 permanent filter_mode include proto static 0.00 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto static 0.00 dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode include source_list 192.0.2.2/0.00,192.0.2.1/0.00 proto static 0.00 # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra # bridge -d -s mdb show dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra blocked 0.00 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra blocked 0.00 dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra 0.00 # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 temp source_list 192.0.2.4,192.0.2.3 filter_mode include proto bgp # bridge -d -s mdb show dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.4 temp filter_mode include proto bgp 0.00 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 temp filter_mode include proto bgp 0.00 dev br0 port dummy10 grp 239.1.1.1 temp filter_mode include source_list 192.0.2.4/259.44,192.0.2.3/259.44 proto bgp 0.00 Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Add the 'MDBE_ATTR_RTPORT' attribute to allow user space to specify the routing protocol of the MDB port group entry. Enforce a minimum value of 'RTPROT_STATIC' to prevent user space from using protocol values that should only be set by the kernel (e.g., 'RTPROT_KERNEL'). Maintain backward compatibility by defaulting to 'RTPROT_STATIC'. The protocol is already visible to user space in RTM_NEWMDB responses and notifications via the 'MDBA_MDB_EATTR_RTPROT' attribute. The routing protocol allows a routing daemon to distinguish between entries configured by it and those configured by the administrator. Once MDB flush is supported, the protocol can be used as a criterion according to which the flush is performed. Examples: # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto kernel Error: integer out of range. # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto static # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent proto zebra # bridge mdb add dev br0 port dummy10 grp 239.1.1.2 permanent source_list 198.51.100.1,198.51.100.2 filter_mode include proto 250 # bridge -d mdb show dev br0 port dummy10 grp 239.1.1.2 src 198.51.100.2 permanent filter_mode include proto 250 dev br0 port dummy10 grp 239.1.1.2 src 198.51.100.1 permanent filter_mode include proto 250 dev br0 port dummy10 grp 239.1.1.2 permanent filter_mode include source_list 198.51.100.2/0.00,198.51.100.1/0.00 proto 250 dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude proto static Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Add new netlink attributes to the RTM_NEWMDB request that allow user space to add (*, G) with a source list and filter mode. The RTM_NEWMDB message can already dump such entries (created by the kernel) so there is no need to add dump support. However, the message contains a different set of attributes depending if it is a request or a response. The naming and structure of the new attributes try to follow the existing ones used in the response. Request: [ struct nlmsghdr ] [ struct br_port_msg ] [ MDBA_SET_ENTRY ] struct br_mdb_entry [ MDBA_SET_ENTRY_ATTRS ] [ MDBE_ATTR_SOURCE ] struct in_addr / struct in6_addr [ MDBE_ATTR_SRC_LIST ] // new [ MDBE_SRC_LIST_ENTRY ] [ MDBE_SRCATTR_ADDRESS ] struct in_addr / struct in6_addr [ ...] [ MDBE_ATTR_GROUP_MODE ] // new u8 Response: [ struct nlmsghdr ] [ struct br_port_msg ] [ MDBA_MDB ] [ MDBA_MDB_ENTRY ] [ MDBA_MDB_ENTRY_INFO ] struct br_mdb_entry [ MDBA_MDB_EATTR_TIMER ] u32 [ MDBA_MDB_EATTR_SOURCE ] struct in_addr / struct in6_addr [ MDBA_MDB_EATTR_RTPROT ] u8 [ MDBA_MDB_EATTR_SRC_LIST ] [ MDBA_MDB_SRCLIST_ENTRY ] [ MDBA_MDB_SRCATTR_ADDRESS ] struct in_addr / struct in6_addr [ MDBA_MDB_SRCATTR_TIMER ] u8 [...] [ MDBA_MDB_EATTR_GROUP_MODE ] u8 Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
In preparation for allowing user space to add (*, G) entries with a source list and associated filter mode, add the necessary plumbing to handle such requests. Extend the MDB configuration structure with a currently empty source array and filter mode that is currently hard coded to EXCLUDE. Add the source entries and the corresponding (S, G) entries before making the new (*, G) port group entry visible to the data path. Handle the creation of each source entry in a similar fashion to how it is created from the data path in response to received Membership Reports: Create the source entry, arm the source timer (if needed), add a corresponding (S, G) forwarding entry and finally mark the source entry as installed (by user space). Add the (S, G) entry by populating an MDB configuration structure and calling br_mdb_add_group_sg() as if a new entry is created by user space, with the sole difference that the 'src_entry' field is set to make sure that the group timer of such entries is never armed. Note that it is not currently possible to add more than 32 source entries to a port group entry. If this proves to be a problem we can either increase 'PG_SRC_ENT_LIMIT' or avoid forcing a limit on entries created by user space. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
User space will soon be able to install a (*, G) with a source list, prompting the creation of a (S, G) entry for each source. In this case, the group timer of the (S, G) entry should never be set. Solve this by adding a new field to the MDB configuration structure that denotes whether the (S, G) corresponds to a source or not. The field will be set in a subsequent patch where br_mdb_add_group_sg() is called in order to create a (S, G) entry for each user provided source. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
There are a few places where the bridge driver differentiates between (S, G) entries installed by the kernel (in response to Membership Reports) and those installed by user space. One of them is when deleting an (S, G) entry corresponding to a source entry that is being deleted. While user space cannot currently add a source entry to a (*, G), it can add an (S, G) entry that later corresponds to a source entry created by the reception of a Membership Report. If this source entry is later deleted because its source timer expired or because the (*, G) entry is being deleted, the bridge driver will not delete the corresponding (S, G) entry if it was added by user space as permanent. This is going to be a problem when the ability to install a (*, G) with a source list is exposed to user space. In this case, when user space installs the (*, G) as permanent, then all the (S, G) entries corresponding to its source list will also be installed as permanent. When user space deletes the (*, G), all the source entries will be deleted and the expectation is that the corresponding (S, G) entries will be deleted as well. Solve this by introducing a new source entry flag denoting that the entry was installed by user space. When the entry is deleted, delete the corresponding (S, G) entry even if it was installed by user space as permanent, as the flag tells us that it was installed in response to the source entry being created. The flag will be set in a subsequent patch where source entries are created in response to user requests. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Expose __br_multicast_del_group_src() which is symmetric to br_multicast_new_group_src() and does not remove the installed {S, G} forwarding entry, unlike br_multicast_del_group_src(). The function will be used in the error path when user space was able to add a new source entry, but failed to install a corresponding forwarding entry. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Currently, new group source entries are only created in response to received Membership Reports. Subsequent patches are going to allow user space to install (*, G) entries with a source list. As a preparatory step, expose br_multicast_new_group_src() so that it could later be invoked from the MDB code (i.e., br_mdb.c) that handles RTM_NEWMDB messages. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Subsequent patches will add memory allocations in br_mdb_config_init() as the MDB configuration structure will include a linked list of source entries. This memory will need to be freed regardless if br_mdb_add() succeeded or failed. As a preparation for this change, add a centralized error path where the memory will be freed. Note that br_mdb_del() already has one error path and therefore does not require any changes. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Subsequent patches are going to add additional validation functions and netlink policies. Some of these functions will need to perform parsing using nla_parse_nested() and the new policies. In order to keep all the policies next to each other, move the current policy to before the validation functions. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
When the bridge is using IGMP version 3 or MLD version 2, it handles the addition of (*, G) and (S, G) entries differently. When a new (S, G) port group entry is added, all the (*, G) EXCLUDE ports need to be added to the port group of the new entry. Similarly, when a new (*, G) EXCLUDE port group entry is added, the port needs to be added to the port group of all the matching (S, G) entries. Subsequent patches will create more differences between both entry types. Namely, filter mode and source list can only be specified for (*, G) entries. Given the current and future differences between both entry types, handle the addition of each entry type in a different function, thereby avoiding the creation of one complex function. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Currently, the filter mode (i.e., INCLUDE / EXCLUDE) of MDB entries cannot be set from user space. Instead, it is set by the kernel according to the entry type: (*, G) entries are treated as EXCLUDE and (S, G) entries are treated as INCLUDE. This allows the kernel to derive the entry type from its filter mode. Subsequent patches will allow user space to set the filter mode of (*, G) entries, making the current assumption incorrect. As a preparation, remove the current assumption and instead determine the entry type from its key, which is a more direct way. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Acked-by:
Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
If dsa_tag_8021q_setup() fails, for example due to the inability of the device to install a VLAN, the tag_8021q context of the switch will leak. Make sure it is freed on the error path. Fixes: 328621f6 ("net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register") Signed-off-by:
Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20221209235242.480344-1-vladimir.oltean@nxp.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Xin Long authored
Similar to Bonding and Team, to prevent ipv6 addrconf with IFF_NO_ADDRCONF in slave_dev->priv_flags for slave ports is also needed in net failover. Note that dev_open(slave_dev) is called in .slave_register, which is called after the IFF_NO_ADDRCONF flag is set in failover_slave_register(). Signed-off-by:
Xin Long <lucien.xin@gmail.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-