Skip to content
Snippets Groups Projects
  1. Dec 25, 2022
    • Steven Rostedt (Google)'s avatar
      treewide: Convert del_timer*() to timer_shutdown*() · 292a089d
      Steven Rostedt (Google) authored
      Due to several bugs caused by timers being re-armed after they are
      shutdown and just before they are freed, a new state of timers was added
      called "shutdown".  After a timer is set to this state, then it can no
      longer be re-armed.
      
      The following script was run to find all the trivial locations where
      del_timer() or del_timer_sync() is called in the same function that the
      object holding the timer is freed.  It also ignores any locations where
      the timer->function is modified between the del_timer*() and the free(),
      as that is not considered a "trivial" case.
      
      This was created by using a coccinelle script and the following
      commands:
      
          $ cat timer.cocci
          @@
          expression ptr, slab;
          identifier timer, rfield;
          @@
          (
          -       del_timer(&ptr->timer);
          +       timer_shutdown(&ptr->timer);
          |
          -       del_timer_sync(&ptr->timer);
          +       timer_shutdown_sync(&ptr->timer);
          )
            ... when strict
                when != ptr->timer
          (
                  kfree_rcu(ptr, rfield);
          |
                  kmem_cache_free(slab, ptr);
          |
                  kfree(ptr);
          )
      
          $ spatch timer.cocci . > /tmp/t.patch
          $ patch -p1 < /tmp/t.patch
      
      Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/
      
      
      Signed-off-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ]
      Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ]
      Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      292a089d
  2. Dec 20, 2022
    • Jason A. Donenfeld's avatar
      prandom: remove prandom_u32_max() · 3c202d14
      Jason A. Donenfeld authored
      
      Convert the final two users of prandom_u32_max() that slipped in during
      6.2-rc1 to use get_random_u32_below().
      
      Then, with no more users left, we can finally remove the deprecated
      function.
      
      Signed-off-by: default avatarJason A. Donenfeld <Jason@zx2c4.com>
      3c202d14
    • Benjamin Coddington's avatar
      Treewide: Stop corrupting socket's task_frag · 98123866
      Benjamin Coddington authored
      
      Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
      GFP_NOIO flag on sk_allocation which the networking system uses to decide
      when it is safe to use current->task_frag.  The results of this are
      unexpected corruption in task_frag when SUNRPC is involved in memory
      reclaim.
      
      The corruption can be seen in crashes, but the root cause is often
      difficult to ascertain as a crashing machine's stack trace will have no
      evidence of being near NFS or SUNRPC code.  I believe this problem to
      be much more pervasive than reports to the community may indicate.
      
      Fix this by having kernel users of sockets that may corrupt task_frag due
      to reclaim set sk_use_task_frag = false.  Preemptively correcting this
      situation for users that still set sk_allocation allows them to convert to
      memalloc_nofs_save/restore without the same unexpected corruptions that are
      sure to follow, unlikely to show up in testing, and difficult to bisect.
      
      CC: Philipp Reisner <philipp.reisner@linbit.com>
      CC: Lars Ellenberg <lars.ellenberg@linbit.com>
      CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Josef Bacik <josef@toxicpanda.com>
      CC: Keith Busch <kbusch@kernel.org>
      CC: Christoph Hellwig <hch@lst.de>
      CC: Sagi Grimberg <sagi@grimberg.me>
      CC: Lee Duncan <lduncan@suse.com>
      CC: Chris Leech <cleech@redhat.com>
      CC: Mike Christie <michael.christie@oracle.com>
      CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
      CC: "Martin K. Petersen" <martin.petersen@oracle.com>
      CC: Valentina Manea <valentina.manea.m@gmail.com>
      CC: Shuah Khan <shuah@kernel.org>
      CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      CC: David Howells <dhowells@redhat.com>
      CC: Marc Dionne <marc.dionne@auristor.com>
      CC: Steve French <sfrench@samba.org>
      CC: Christine Caulfield <ccaulfie@redhat.com>
      CC: David Teigland <teigland@redhat.com>
      CC: Mark Fasheh <mark@fasheh.com>
      CC: Joel Becker <jlbec@evilplan.org>
      CC: Joseph Qi <joseph.qi@linux.alibaba.com>
      CC: Eric Van Hensbergen <ericvh@gmail.com>
      CC: Latchesar Ionkov <lucho@ionkov.net>
      CC: Dominique Martinet <asmadeus@codewreck.org>
      CC: Ilya Dryomov <idryomov@gmail.com>
      CC: Xiubo Li <xiubli@redhat.com>
      CC: Chuck Lever <chuck.lever@oracle.com>
      CC: Jeff Layton <jlayton@kernel.org>
      CC: Trond Myklebust <trond.myklebust@hammerspace.com>
      CC: Anna Schumaker <anna@kernel.org>
      CC: Steffen Klassert <steffen.klassert@secunet.com>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      
      Suggested-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Reviewed-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      98123866
    • Guillaume Nault's avatar
      net: Introduce sk_use_task_frag in struct sock. · fb87bd47
      Guillaume Nault authored
      Sockets that can be used while recursing into memory reclaim, like
      those used by network block devices and file systems, mustn't use
      current->task_frag: if the current process is already using it, then
      the inner memory reclaim call would corrupt the task_frag structure.
      
      To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
      that mustn't use current->task_frag, assuming that those used during
      memory reclaim had their allocation constraints reflected in
      ->sk_allocation.
      
      This unfortunately doesn't cover all cases: in an attempt to remove all
      usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
      ->sk_allocation, and used memalloc_nofs critical sections instead.
      This breaks the sk_page_frag() heuristic since the allocation
      constraints are now stored in current->flags, which sk_page_frag()
      can't read without risking triggering a cache miss and slowing down
      TCP's fast path.
      
      This patch creates a new field in struct sock, named sk_use_task_frag,
      which sockets with memory reclaim constraints can set to false if they
      can't safely use current->task_frag. In such cases, sk_page_frag() now
      always returns the socket's page_frag (->sk_frag). The first user is
      sunrpc, which needs to avoid using current->task_frag but can keep
      ->sk_allocation set to GFP_KERNEL otherwise.
      
      Eventually, it might be possible to simplify sk_page_frag() by only
      testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
      heuristic entirely (assuming other sockets will set ->sk_use_task_frag
      according to their constraints in the future).
      
      The new ->sk_use_task_frag field is placed in a hole in struct sock and
      belongs to a cache line shared with ->sk_shutdown. Therefore it should
      be hot and shouldn't have negative performance impacts on TCP's fast
      path (sk_shutdown is tested just before the while() loop in
      tcp_sendmsg_locked()).
      
      Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
      
      
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fb87bd47
    • Matt Johnston's avatar
      mctp: Remove device type check at unregister · b389a902
      Matt Johnston authored
      The unregister check could be incorrectly triggered if a netdev
      changes its type after register. That is possible for a tun device
      using TUNSETLINK ioctl, resulting in mctp unregister failing
      and the netdev unregister waiting forever.
      
      This was encountered by https://github.com/openthread/openthread/issues/8523
      
      
      
      Neither check at register or unregister is required. They were added in
      an attempt to track down mctp_ptr being set unexpectedly, which should
      not happen in normal operation.
      
      Fixes: 7b1871af ("mctp: Warn if pointer is set for a wrong dev type")
      Signed-off-by: default avatarMatt Johnston <matt@codeconstruct.com.au>
      Link: https://lore.kernel.org/r/20221215054933.2403401-1-matt@codeconstruct.com.au
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b389a902
  3. Dec 19, 2022
    • Eric Dumazet's avatar
      net: stream: purge sk_error_queue in sk_stream_kill_queues() · e0c8bccd
      Eric Dumazet authored
      
      Changheon Lee reported TCP socket leaks, with a nice repro.
      
      It seems we leak TCP sockets with the following sequence:
      
      1) SOF_TIMESTAMPING_TX_ACK is enabled on the socket.
      
         Each ACK will cook an skb put in error queue, from __skb_tstamp_tx().
         __skb_tstamp_tx() is using skb_clone(), unless
         SOF_TIMESTAMPING_OPT_TSONLY was also requested.
      
      2) If the application is also using MSG_ZEROCOPY, then we put in the
         error queue cloned skbs that had a struct ubuf_info attached to them.
      
         Whenever an struct ubuf_info is allocated, sock_zerocopy_alloc()
         does a sock_hold().
      
         As long as the cloned skbs are still in sk_error_queue,
         socket refcount is kept elevated.
      
      3) Application closes the socket, while error queue is not empty.
      
      Since tcp_close() no longer purges the socket error queue,
      we might end up with a TCP socket with at least one skb in
      error queue keeping the socket alive forever.
      
      This bug can be (ab)used to consume all kernel memory
      and freeze the host.
      
      We need to purge the error queue, with proper synchronization
      against concurrent writers.
      
      Fixes: 24bcbe1c ("net: stream: don't purge sk_error_queue in sk_stream_kill_queues()")
      Reported-by: default avatarChangheon Lee <darklight2357@icloud.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0c8bccd
    • David Howells's avatar
      rxrpc: Fix the return value of rxrpc_new_incoming_call() · 31d35a02
      David Howells authored
      
      Dan Carpenter sayeth[1]:
      
        The patch 5e6ef4f1: "rxrpc: Make the I/O thread take over the
        call and local processor work" from Jan 23, 2020, leads to the
        following Smatch static checker warning:
      
      	net/rxrpc/io_thread.c:283 rxrpc_input_packet()
      	warn: bool is not less than zero.
      
      Fix this (for now) by changing rxrpc_new_incoming_call() to return an int
      with 0 or error code rather than bool.  Note that the actual return value
      of rxrpc_input_packet() is currently ignored.  I have a separate patch to
      clean that up.
      
      Fixes: 5e6ef4f1 ("rxrpc: Make the I/O thread take over the call and local processor work")
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: http://lists.infradead.org/pipermail/linux-afs/2022-December/006123.html
      
       [1]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      31d35a02
    • David Howells's avatar
      rxrpc: rxperf: Fix uninitialised variable · 11e1706b
      David Howells authored
      
      Dan Carpenter sayeth[1]:
      
        The patch 75bfdbf2: "rxrpc: Implement an in-kernel rxperf server
        for testing purposes" from Nov 3, 2022, leads to the following Smatch
        static checker warning:
      
      	net/rxrpc/rxperf.c:337 rxperf_deliver_to_call()
      	error: uninitialized symbol 'ret'.
      
      Fix this by initialising ret to 0.  The value is only used for tracing
      purposes in the rxperf server.
      
      Fixes: 75bfdbf2 ("rxrpc: Implement an in-kernel rxperf server for testing purposes")
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: http://lists.infradead.org/pipermail/linux-afs/2022-December/006124.html
      
       [1]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      11e1706b
    • David Howells's avatar
      rxrpc: Fix I/O thread stop · 743d1768
      David Howells authored
      
      The rxrpc I/O thread checks to see if there's any work it needs to do, and
      if not, checks kthread_should_stop() before scheduling, and if it should
      stop, breaks out of the loop and tries to clean up and exit.
      
      This can, however, race with socket destruction, wherein outstanding calls
      are aborted and released from the socket and then the socket unuses the
      local endpoint, causing kthread_stop() to be issued.  The abort is deferred
      to the I/O thread and the event can by issued between the I/O thread
      checking if there's any work to be done (such as processing call aborts)
      and the stop being seen.
      
      This results in the I/O thread stopping processing of events whilst call
      cleanup events are still outstanding, leading to connections or other
      objects still being around and uncleaned up, which can result in assertions
      being triggered, e.g.:
      
          rxrpc: AF_RXRPC: Leaked client conn 00000000e8009865 {2}
          ------------[ cut here ]------------
          kernel BUG at net/rxrpc/conn_client.c:64!
      
      Fix this by retrieving the kthread_should_stop() indication, then checking
      to see if there's more work to do, and going back round the loop if there
      is, and breaking out of the loop only if there wasn't.
      
      This was triggered by a syzbot test that produced some other symptom[1].
      
      Fixes: a275da62 ("rxrpc: Create a per-local endpoint receive queue and I/O thread")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/0000000000002b4a9f05ef2b616f@google.com/
      
       [1]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      743d1768
    • David Howells's avatar
      rxrpc: Fix switched parameters in peer tracing · c838f1a7
      David Howells authored
      
      Fix the switched parameters on rxrpc_alloc_peer() and rxrpc_get_peer().
      The ref argument and the why argument got mixed.
      
      Fixes: 47c810a7 ("rxrpc: trace: Don't use __builtin_return_address for rxrpc_peer tracing")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c838f1a7
    • David Howells's avatar
      rxrpc: Fix locking issues in rxrpc_put_peer_locked() · 608aecd1
      David Howells authored
      
      Now that rxrpc_put_local() may call kthread_stop(), it can't be called
      under spinlock as it might sleep.  This can cause a problem in the peer
      keepalive code in rxrpc as it tries to avoid dropping the peer_hash_lock
      from the point it needs to re-add peer->keepalive_link to going round the
      loop again in rxrpc_peer_keepalive_dispatch().
      
      Fix this by just dropping the lock when we don't need it and accepting that
      we'll have to take it again.  This code is only called about every 20s for
      each peer, so not very often.
      
      This allows rxrpc_put_peer_unlocked() to be removed also.
      
      If triggered, this bug produces an oops like the following, as reproduced
      by a syzbot reproducer for a different oops[1]:
      
      BUG: sleeping function called from invalid context at kernel/sched/completion.c:101
      ...
      RCU nest depth: 0, expected: 0
      3 locks held by kworker/u9:0/50:
       #0: ffff88810e74a138 ((wq_completion)krxrpcd){+.+.}-{0:0}, at: process_one_work+0x294/0x636
       #1: ffff8881013a7e20 ((work_completion)(&rxnet->peer_keepalive_work)){+.+.}-{0:0}, at: process_one_work+0x294/0x636
       #2: ffff88817d366390 (&rxnet->peer_hash_lock){+.+.}-{2:2}, at: rxrpc_peer_keepalive_dispatch+0x2bd/0x35f
      ...
      Call Trace:
       <TASK>
       dump_stack_lvl+0x4c/0x5f
       __might_resched+0x2cf/0x2f2
       __wait_for_common+0x87/0x1e8
       kthread_stop+0x14d/0x255
       rxrpc_peer_keepalive_dispatch+0x333/0x35f
       rxrpc_peer_keepalive_worker+0x2e9/0x449
       process_one_work+0x3c1/0x636
       worker_thread+0x25f/0x359
       kthread+0x1a6/0x1b5
       ret_from_fork+0x1f/0x30
      
      Fixes: a275da62 ("rxrpc: Create a per-local endpoint receive queue and I/O thread")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/0000000000002b4a9f05ef2b616f@google.com/
      
       [1]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      608aecd1
    • David Howells's avatar
      rxrpc: Fix I/O thread startup getting skipped · 8fbcc833
      David Howells authored
      
      When starting a kthread, the __kthread_create_on_node() function, as called
      from kthread_run(), waits for a completion to indicate that the task_struct
      (or failure state) of the new kernel thread is available before continuing.
      
      This does not wait, however, for the thread function to be invoked and,
      indeed, will skip it if kthread_stop() gets called before it gets there.
      
      If this happens, though, kthread_run() will have returned successfully,
      indicating that the thread was started and returning the task_struct
      pointer.  The actual error indication is returned by kthread_stop().
      
      Note that this is ambiguous, as the caller cannot tell whether the -EINTR
      error code came from kthread() or from the thread function.
      
      This was encountered in the new rxrpc I/O thread, where if the system is
      being pounded hard by, say, syzbot, the check of KTHREAD_SHOULD_STOP can be
      delayed long enough for kthread_stop() to get called when rxrpc releases a
      socket - and this causes an oops because the I/O thread function doesn't
      get started and thus doesn't remove the rxrpc_local struct from the
      local_endpoints list.
      
      Fix this by using a completion to wait for the thread to actually enter
      rxrpc_io_thread().  This makes sure the thread can't be prematurely
      stopped and makes sure the relied-upon cleanup is done.
      
      Fixes: a275da62 ("rxrpc: Create a per-local endpoint receive queue and I/O thread")
      Reported-by: default avatar <syzbot+3538a6a72efa8b059c38@syzkaller.appspotmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: Hillf Danton <hdanton@sina.com>
      Link: https://lore.kernel.org/r/000000000000229f1505ef2b6159@google.com/
      
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8fbcc833
    • David Howells's avatar
      rxrpc: Fix NULL deref in rxrpc_unuse_local() · eaa02390
      David Howells authored
      
      Fix rxrpc_unuse_local() to get the debug_id *after* checking to see if
      local is NULL.
      
      Fixes: a2cf3264 ("rxrpc: Fold __rxrpc_unuse_local() into rxrpc_unuse_local()")
      Reported-by: default avatar <syzbot+3538a6a72efa8b059c38@syzkaller.appspotmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatar <syzbot+3538a6a72efa8b059c38@syzkaller.appspotmail.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eaa02390
    • David Howells's avatar
      rxrpc: Fix security setting propagation · fdb99487
      David Howells authored
      
      Fix the propagation of the security settings from sendmsg to the rxrpc_call
      struct.
      
      Fixes: f3441d41 ("rxrpc: Copy client call parameters into rxrpc_call earlier")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fdb99487
    • David Howells's avatar
      rxrpc: Fix missing unlock in rxrpc_do_sendmsg() · 4feb2c44
      David Howells authored
      
      One of the error paths in rxrpc_do_sendmsg() doesn't unlock the call mutex
      before returning.  Fix it to do this.
      
      Note that this still doesn't get rid of the checker warning:
      
         ../net/rxrpc/sendmsg.c:617:5: warning: context imbalance in 'rxrpc_do_sendmsg' - wrong count at exit
      
      I think the interplay between the socket lock and the call's user_mutex may
      be too complicated for checker to analyse, especially as
      rxrpc_new_client_call_for_sendmsg(), which it calls, returns with the
      call's user_mutex if successful but unconditionally drops the socket lock.
      
      Fixes: e754eba6 ("rxrpc: Provide a cmsg to specify the amount of Tx data for a call")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Marc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4feb2c44
    • Cong Wang's avatar
      net_sched: reject TCF_EM_SIMPLE case for complex ematch module · 9cd3fd20
      Cong Wang authored
      When TCF_EM_SIMPLE was introduced, it is supposed to be convenient
      for ematch implementation:
      
      https://lore.kernel.org/all/20050105110048.GO26856@postel.suug.ch/
      
      
      
      "You don't have to, providing a 32bit data chunk without TCF_EM_SIMPLE
      set will simply result in allocating & copy. It's an optimization,
      nothing more."
      
      So if an ematch module provides ops->datalen that means it wants a
      complex data structure (saved in its em->data) instead of a simple u32
      value. We should simply reject such a combination, otherwise this u32
      could be misinterpreted as a pointer.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-and-tested-by: default avatar <syzbot+4caeae4c7103813598ae@syzkaller.appspotmail.com>
      Reported-by: default avatarJun Nie <jun.nie@linaro.org>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarCong Wang <cong.wang@bytedance.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9cd3fd20
  4. Dec 17, 2022
  5. Dec 16, 2022
    • Eelco Chaudron's avatar
      openvswitch: Fix flow lookup to use unmasked key · 68bb1010
      Eelco Chaudron authored
      
      The commit mentioned below causes the ovs_flow_tbl_lookup() function
      to be called with the masked key. However, it's supposed to be called
      with the unmasked key. This due to the fact that the datapath supports
      installing wider flows, and OVS relies on this behavior. For example
      if ipv4(src=1.1.1.1/192.0.0.0, dst=1.1.1.2/192.0.0.0) exists, a wider
      flow (smaller mask) of ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/
      128.0.0.0) is allowed to be added.
      
      However, if we try to add a wildcard rule, the installation fails:
      
      $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
        ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2
      $ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
        ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2
      ovs-vswitchd: updating flow table (File exists)
      
      The reason is that the key used to determine if the flow is already
      present in the system uses the original key ANDed with the mask.
      This results in the IP address not being part of the (miniflow) key,
      i.e., being substituted with an all-zero value. When doing the actual
      lookup, this results in the key wrongfully matching the first flow,
      and therefore the flow does not get installed.
      
      This change reverses the commit below, but rather than having the key
      on the stack, it's allocated.
      
      Fixes: 190aa3e7 ("openvswitch: Fix Frame-size larger than 1024 bytes warning.")
      
      Signed-off-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68bb1010
    • Jakub Kicinski's avatar
      devlink: hold region lock when flushing snapshots · b4cafb3d
      Jakub Kicinski authored
      
      Netdevsim triggers a splat on reload, when it destroys regions
      with snapshots pending:
      
        WARNING: CPU: 1 PID: 787 at net/core/devlink.c:6291 devlink_region_snapshot_del+0x12e/0x140
        CPU: 1 PID: 787 Comm: devlink Not tainted 6.1.0-07460-g7ae9888d6e1c #580
        RIP: 0010:devlink_region_snapshot_del+0x12e/0x140
        Call Trace:
         <TASK>
         devl_region_destroy+0x70/0x140
         nsim_dev_reload_down+0x2f/0x60 [netdevsim]
         devlink_reload+0x1f7/0x360
         devlink_nl_cmd_reload+0x6ce/0x860
         genl_family_rcv_msg_doit.isra.0+0x145/0x1c0
      
      This is the locking assert in devlink_region_snapshot_del(),
      we're supposed to be holding the region->snapshot_lock here.
      
      Fixes: 2dec18ad ("net: devlink: remove region snapshots list dependency on devlink->lock")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4cafb3d
  6. Dec 15, 2022
  7. Dec 13, 2022
  8. Dec 12, 2022
    • Coco Li's avatar
      IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver · 89300468
      Coco Li authored
      
      IPv6/TCP and GRO stacks can build big TCP packets with an added
      temporary Hop By Hop header.
      
      Is GSO is not involved, then the temporary header needs to be removed in
      the driver. This patch provides a generic helper for drivers that need
      to modify their headers in place.
      
      Tested:
      Compiled and ran with ethtool -K eth1 tso off
      Could send Big TCP packets
      
      Signed-off-by: default avatarCoco Li <lixiaoyan@google.com>
      Link: https://lore.kernel.org/r/20221210041646.3587757-1-lixiaoyan@google.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      89300468
    • Ido Schimmel's avatar
      bridge: mcast: Support replacement of MDB port group entries · 61f21835
      Ido Schimmel authored
      
      Now that user space can specify additional attributes of port group
      entries such as filter mode and source list, it makes sense to allow
      user space to atomically modify these attributes by replacing entries
      instead of forcing user space to delete the entries and add them back.
      
      Replace MDB port group entries when the 'NLM_F_REPLACE' flag is
      specified in the netlink message header.
      
      When a (*, G) entry is replaced, update the following attributes: Source
      list, state, filter mode, protocol and flags. If the entry is temporary
      and in EXCLUDE mode, reset the group timer to the group membership
      interval. If the entry is temporary and in INCLUDE mode, reset the
      source timers of associated sources to the group membership interval.
      
      Examples:
      
       # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.2 filter_mode include
       # bridge -d -s mdb show
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.2 permanent filter_mode include proto static     0.00
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto static     0.00
       dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode include source_list 192.0.2.2/0.00,192.0.2.1/0.00 proto static     0.00
      
       # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra
       # bridge -d -s mdb show
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra  blocked    0.00
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra  blocked    0.00
       dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra     0.00
      
       # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 temp source_list 192.0.2.4,192.0.2.3 filter_mode include proto bgp
       # bridge -d -s mdb show
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.4 temp filter_mode include proto bgp     0.00
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 temp filter_mode include proto bgp     0.00
       dev br0 port dummy10 grp 239.1.1.1 temp filter_mode include source_list 192.0.2.4/259.44,192.0.2.3/259.44 proto bgp     0.00
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      61f21835
    • Ido Schimmel's avatar
      bridge: mcast: Allow user space to specify MDB entry routing protocol · 1d7b66a7
      Ido Schimmel authored
      
      Add the 'MDBE_ATTR_RTPORT' attribute to allow user space to specify the
      routing protocol of the MDB port group entry. Enforce a minimum value of
      'RTPROT_STATIC' to prevent user space from using protocol values that
      should only be set by the kernel (e.g., 'RTPROT_KERNEL'). Maintain
      backward compatibility by defaulting to 'RTPROT_STATIC'.
      
      The protocol is already visible to user space in RTM_NEWMDB responses
      and notifications via the 'MDBA_MDB_EATTR_RTPROT' attribute.
      
      The routing protocol allows a routing daemon to distinguish between
      entries configured by it and those configured by the administrator. Once
      MDB flush is supported, the protocol can be used as a criterion
      according to which the flush is performed.
      
      Examples:
      
       # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto kernel
       Error: integer out of range.
      
       # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto static
      
       # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent proto zebra
      
       # bridge mdb add dev br0 port dummy10 grp 239.1.1.2 permanent source_list 198.51.100.1,198.51.100.2 filter_mode include proto 250
      
       # bridge -d mdb show
       dev br0 port dummy10 grp 239.1.1.2 src 198.51.100.2 permanent filter_mode include proto 250
       dev br0 port dummy10 grp 239.1.1.2 src 198.51.100.1 permanent filter_mode include proto 250
       dev br0 port dummy10 grp 239.1.1.2 permanent filter_mode include source_list 198.51.100.2/0.00,198.51.100.1/0.00 proto 250
       dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra
       dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude proto static
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1d7b66a7
    • Ido Schimmel's avatar
      bridge: mcast: Allow user space to add (*, G) with a source list and filter mode · 6afaae6d
      Ido Schimmel authored
      
      Add new netlink attributes to the RTM_NEWMDB request that allow user
      space to add (*, G) with a source list and filter mode.
      
      The RTM_NEWMDB message can already dump such entries (created by the
      kernel) so there is no need to add dump support. However, the message
      contains a different set of attributes depending if it is a request or a
      response. The naming and structure of the new attributes try to follow
      the existing ones used in the response.
      
      Request:
      
      [ struct nlmsghdr ]
      [ struct br_port_msg ]
      [ MDBA_SET_ENTRY ]
      	struct br_mdb_entry
      [ MDBA_SET_ENTRY_ATTRS ]
      	[ MDBE_ATTR_SOURCE ]
      		struct in_addr / struct in6_addr
      	[ MDBE_ATTR_SRC_LIST ]		// new
      		[ MDBE_SRC_LIST_ENTRY ]
      			[ MDBE_SRCATTR_ADDRESS ]
      				struct in_addr / struct in6_addr
      		[ ...]
      	[ MDBE_ATTR_GROUP_MODE ]	// new
      		u8
      
      Response:
      
      [ struct nlmsghdr ]
      [ struct br_port_msg ]
      [ MDBA_MDB ]
      	[ MDBA_MDB_ENTRY ]
      		[ MDBA_MDB_ENTRY_INFO ]
      			struct br_mdb_entry
      		[ MDBA_MDB_EATTR_TIMER ]
      			u32
      		[ MDBA_MDB_EATTR_SOURCE ]
      			struct in_addr / struct in6_addr
      		[ MDBA_MDB_EATTR_RTPROT ]
      			u8
      		[ MDBA_MDB_EATTR_SRC_LIST ]
      			[ MDBA_MDB_SRCLIST_ENTRY ]
      				[ MDBA_MDB_SRCATTR_ADDRESS ]
      					struct in_addr / struct in6_addr
      				[ MDBA_MDB_SRCATTR_TIMER ]
      					u8
      			[...]
      		[ MDBA_MDB_EATTR_GROUP_MODE ]
      			u8
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6afaae6d
    • Ido Schimmel's avatar
      bridge: mcast: Add support for (*, G) with a source list and filter mode · b1c8fec8
      Ido Schimmel authored
      
      In preparation for allowing user space to add (*, G) entries with a
      source list and associated filter mode, add the necessary plumbing to
      handle such requests.
      
      Extend the MDB configuration structure with a currently empty source
      array and filter mode that is currently hard coded to EXCLUDE.
      
      Add the source entries and the corresponding (S, G) entries before
      making the new (*, G) port group entry visible to the data path.
      
      Handle the creation of each source entry in a similar fashion to how it
      is created from the data path in response to received Membership
      Reports: Create the source entry, arm the source timer (if needed), add
      a corresponding (S, G) forwarding entry and finally mark the source
      entry as installed (by user space).
      
      Add the (S, G) entry by populating an MDB configuration structure and
      calling br_mdb_add_group_sg() as if a new entry is created by user
      space, with the sole difference that the 'src_entry' field is set to
      make sure that the group timer of such entries is never armed.
      
      Note that it is not currently possible to add more than 32 source
      entries to a port group entry. If this proves to be a problem we can
      either increase 'PG_SRC_ENT_LIMIT' or avoid forcing a limit on entries
      created by user space.
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b1c8fec8
    • Ido Schimmel's avatar
      bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source · 079afd66
      Ido Schimmel authored
      
      User space will soon be able to install a (*, G) with a source list,
      prompting the creation of a (S, G) entry for each source.
      
      In this case, the group timer of the (S, G) entry should never be set.
      
      Solve this by adding a new field to the MDB configuration structure that
      denotes whether the (S, G) corresponds to a source or not.
      
      The field will be set in a subsequent patch where br_mdb_add_group_sg()
      is called in order to create a (S, G) entry for each user provided
      source.
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      079afd66
    • Ido Schimmel's avatar
      bridge: mcast: Add a flag for user installed source entries · a01ecb17
      Ido Schimmel authored
      
      There are a few places where the bridge driver differentiates between
      (S, G) entries installed by the kernel (in response to Membership
      Reports) and those installed by user space. One of them is when deleting
      an (S, G) entry corresponding to a source entry that is being deleted.
      
      While user space cannot currently add a source entry to a (*, G), it can
      add an (S, G) entry that later corresponds to a source entry created by
      the reception of a Membership Report. If this source entry is later
      deleted because its source timer expired or because the (*, G) entry is
      being deleted, the bridge driver will not delete the corresponding (S,
      G) entry if it was added by user space as permanent.
      
      This is going to be a problem when the ability to install a (*, G) with
      a source list is exposed to user space. In this case, when user space
      installs the (*, G) as permanent, then all the (S, G) entries
      corresponding to its source list will also be installed as permanent.
      When user space deletes the (*, G), all the source entries will be
      deleted and the expectation is that the corresponding (S, G) entries
      will be deleted as well.
      
      Solve this by introducing a new source entry flag denoting that the
      entry was installed by user space. When the entry is deleted, delete the
      corresponding (S, G) entry even if it was installed by user space as
      permanent, as the flag tells us that it was installed in response to the
      source entry being created.
      
      The flag will be set in a subsequent patch where source entries are
      created in response to user requests.
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a01ecb17
    • Ido Schimmel's avatar
      bridge: mcast: Expose __br_multicast_del_group_src() · 083e3534
      Ido Schimmel authored
      
      Expose __br_multicast_del_group_src() which is symmetric to
      br_multicast_new_group_src() and does not remove the installed {S, G}
      forwarding entry, unlike br_multicast_del_group_src().
      
      The function will be used in the error path when user space was able to
      add a new source entry, but failed to install a corresponding forwarding
      entry.
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      083e3534
    • Ido Schimmel's avatar
      bridge: mcast: Expose br_multicast_new_group_src() · fd0c6961
      Ido Schimmel authored
      
      Currently, new group source entries are only created in response to
      received Membership Reports. Subsequent patches are going to allow user
      space to install (*, G) entries with a source list.
      
      As a preparatory step, expose br_multicast_new_group_src() so that it
      could later be invoked from the MDB code (i.e., br_mdb.c) that handles
      RTM_NEWMDB messages.
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fd0c6961
    • Ido Schimmel's avatar
      bridge: mcast: Add a centralized error path · 160dd931
      Ido Schimmel authored
      
      Subsequent patches will add memory allocations in br_mdb_config_init()
      as the MDB configuration structure will include a linked list of source
      entries. This memory will need to be freed regardless if br_mdb_add()
      succeeded or failed.
      
      As a preparation for this change, add a centralized error path where the
      memory will be freed.
      
      Note that br_mdb_del() already has one error path and therefore does not
      require any changes.
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      160dd931
    • Ido Schimmel's avatar
      bridge: mcast: Place netlink policy before validation functions · 1870a2d3
      Ido Schimmel authored
      
      Subsequent patches are going to add additional validation functions and
      netlink policies. Some of these functions will need to perform parsing
      using nla_parse_nested() and the new policies.
      
      In order to keep all the policies next to each other, move the current
      policy to before the validation functions.
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1870a2d3
    • Ido Schimmel's avatar
      bridge: mcast: Split (*, G) and (S, G) addition into different functions · 6ff1e68e
      Ido Schimmel authored
      
      When the bridge is using IGMP version 3 or MLD version 2, it handles the
      addition of (*, G) and (S, G) entries differently.
      
      When a new (S, G) port group entry is added, all the (*, G) EXCLUDE
      ports need to be added to the port group of the new entry. Similarly,
      when a new (*, G) EXCLUDE port group entry is added, the port needs to
      be added to the port group of all the matching (S, G) entries.
      
      Subsequent patches will create more differences between both entry
      types. Namely, filter mode and source list can only be specified for (*,
      G) entries.
      
      Given the current and future differences between both entry types,
      handle the addition of each entry type in a different function, thereby
      avoiding the creation of one complex function.
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6ff1e68e
    • Ido Schimmel's avatar
      bridge: mcast: Do not derive entry type from its filter mode · b63e3065
      Ido Schimmel authored
      
      Currently, the filter mode (i.e., INCLUDE / EXCLUDE) of MDB entries
      cannot be set from user space. Instead, it is set by the kernel
      according to the entry type: (*, G) entries are treated as EXCLUDE and
      (S, G) entries are treated as INCLUDE. This allows the kernel to derive
      the entry type from its filter mode.
      
      Subsequent patches will allow user space to set the filter mode of (*,
      G) entries, making the current assumption incorrect.
      
      As a preparation, remove the current assumption and instead determine
      the entry type from its key, which is a more direct way.
      
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b63e3065
    • Vladimir Oltean's avatar
      net: dsa: tag_8021q: avoid leaking ctx on dsa_tag_8021q_register() error path · e0954930
      Vladimir Oltean authored
      
      If dsa_tag_8021q_setup() fails, for example due to the inability of the
      device to install a VLAN, the tag_8021q context of the switch will leak.
      Make sure it is freed on the error path.
      
      Fixes: 328621f6 ("net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20221209235242.480344-1-vladimir.oltean@nxp.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e0954930
    • Xin Long's avatar
      net: failover: use IFF_NO_ADDRCONF flag to prevent ipv6 addrconf · cb54d392
      Xin Long authored
      
      Similar to Bonding and Team, to prevent ipv6 addrconf with
      IFF_NO_ADDRCONF in slave_dev->priv_flags for slave ports
      is also needed in net failover.
      
      Note that dev_open(slave_dev) is called in .slave_register,
      which is called after the IFF_NO_ADDRCONF flag is set in
      failover_slave_register().
      
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cb54d392
Loading