- Sep 23, 2021
-
-
Ido Schimmel authored
syzkaller discovered memory leaks [1] that can be reduced to the following commands: # ip nexthop add id 1 blackhole # devlink dev reload pci/0000:06:00.0 As part of the reload flow, mlxsw will unregister its netdevs and then unregister from the nexthop notification chain. Before unregistering from the notification chain, mlxsw will receive delete notifications for nexthop objects using netdevs registered by mlxsw or their uppers. mlxsw will not receive notifications for nexthops using netdevs that are not dismantled as part of the reload flow. For example, the blackhole nexthop above that internally uses the loopback netdev as its nexthop device. One way to fix this problem is to have listeners flush their nexthop tables after unregistering from the notification chain. This is error-prone as evident by this patch and also not symmetric with the registration path where a listener receives a dump of all the existing nexthops. Therefore, fix this problem by replaying delete notifications for the listener being unregistered. This is symmetric to the registration path and also consistent with the netdev notification chain. The above means that unregister_nexthop_notifier(), like register_nexthop_notifier(), will have to take RTNL in order to iterate over the existing nexthops and that any callers of the function cannot hold RTNL. This is true for mlxsw and netdevsim, but not for the VXLAN driver. To avoid a deadlock, change the latter to unregister its nexthop listener without holding RTNL, making it symmetric to the registration path. [1] unreferenced object 0xffff88806173d600 (size 512): comm "syz-executor.0", pid 1290, jiffies 4295583142 (age 143.507s) hex dump (first 32 bytes): 41 9d 1e 60 80 88 ff ff 08 d6 73 61 80 88 ff ff A..`......sa.... 08 d6 73 61 80 88 ff ff 01 00 00 00 00 00 00 00 ..sa............ backtrace: [<ffffffff81a6b576>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline] [<ffffffff81a6b576>] slab_post_alloc_hook+0x96/0x490 mm/slab.h:522 [<ffffffff81a716d3>] slab_alloc_node mm/slub.c:3206 [inline] [<ffffffff81a716d3>] slab_alloc mm/slub.c:3214 [inline] [<ffffffff81a716d3>] kmem_cache_alloc_trace+0x163/0x370 mm/slub.c:3231 [<ffffffff82e8681a>] kmalloc include/linux/slab.h:591 [inline] [<ffffffff82e8681a>] kzalloc include/linux/slab.h:721 [inline] [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_group_create drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:4918 [inline] [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_new drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:5054 [inline] [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_event+0x59a/0x2910 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:5239 [<ffffffff813ef67d>] notifier_call_chain+0xbd/0x210 kernel/notifier.c:83 [<ffffffff813f0662>] blocking_notifier_call_chain kernel/notifier.c:318 [inline] [<ffffffff813f0662>] blocking_notifier_call_chain+0x72/0xa0 kernel/notifier.c:306 [<ffffffff8384b9c6>] call_nexthop_notifiers+0x156/0x310 net/ipv4/nexthop.c:244 [<ffffffff83852bd8>] insert_nexthop net/ipv4/nexthop.c:2336 [inline] [<ffffffff83852bd8>] nexthop_add net/ipv4/nexthop.c:2644 [inline] [<ffffffff83852bd8>] rtm_new_nexthop+0x14e8/0x4d10 net/ipv4/nexthop.c:2913 [<ffffffff833e9a78>] rtnetlink_rcv_msg+0x448/0xbf0 net/core/rtnetlink.c:5572 [<ffffffff83608703>] netlink_rcv_skb+0x173/0x480 net/netlink/af_netlink.c:2504 [<ffffffff833de032>] rtnetlink_rcv+0x22/0x30 net/core/rtnetlink.c:5590 [<ffffffff836069de>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] [<ffffffff836069de>] netlink_unicast+0x5ae/0x7f0 net/netlink/af_netlink.c:1340 [<ffffffff83607501>] netlink_sendmsg+0x8e1/0xe30 net/netlink/af_netlink.c:1929 [<ffffffff832fde84>] sock_sendmsg_nosec net/socket.c:704 [inline] [<ffffffff832fde84>] sock_sendmsg net/socket.c:724 [inline] [<ffffffff832fde84>] ____sys_sendmsg+0x874/0x9f0 net/socket.c:2409 [<ffffffff83304a44>] ___sys_sendmsg+0x104/0x170 net/socket.c:2463 [<ffffffff83304c01>] __sys_sendmsg+0x111/0x1f0 net/socket.c:2492 [<ffffffff83304d5d>] __do_sys_sendmsg net/socket.c:2501 [inline] [<ffffffff83304d5d>] __se_sys_sendmsg net/socket.c:2499 [inline] [<ffffffff83304d5d>] __x64_sys_sendmsg+0x7d/0xc0 net/socket.c:2499 Fixes: 2a014b20 ("mlxsw: spectrum_router: Add support for nexthop objects") Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Reviewed-by:
Petr Machata <petrm@nvidia.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jun 22, 2021
-
-
Eric Dumazet authored
syzbot complained in neigh_reduce(), because rcu_read_lock_bh() is treated differently than rcu_read_lock() WARNING: suspicious RCU usage 5.13.0-rc6-syzkaller #0 Not tainted ----------------------------- include/net/addrconf.h:313 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by kworker/0:0/5: #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic64_set include/asm-generic/atomic-instrumented.h:856 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: atomic_long_set include/asm-generic/atomic-long.h:41 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:617 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline] #0: ffff888011064d38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x871/0x1600 kernel/workqueue.c:2247 #1: ffffc90000ca7da8 ((work_completion)(&port->wq)){+.+.}-{0:0}, at: process_one_work+0x8a5/0x1600 kernel/workqueue.c:2251 #2: ffffffff8bf795c0 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x1da/0x3130 net/core/dev.c:4180 stack backtrace: CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.13.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events ipvlan_process_multicast Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 __in6_dev_get include/net/addrconf.h:313 [inline] __in6_dev_get include/net/addrconf.h:311 [inline] neigh_reduce drivers/net/vxlan.c:2167 [inline] vxlan_xmit+0x34d5/0x4c30 drivers/net/vxlan.c:2919 __netdev_start_xmit include/linux/netdevice.h:4944 [inline] netdev_start_xmit include/linux/netdevice.h:4958 [inline] xmit_one net/core/dev.c:3654 [inline] dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3670 __dev_queue_xmit+0x2133/0x3130 net/core/dev.c:4246 ipvlan_process_multicast+0xa99/0xd70 drivers/net/ipvlan/ipvlan_core.c:287 process_one_work+0x98d/0x1600 kernel/workqueue.c:2276 worker_thread+0x64c/0x1120 kernel/workqueue.c:2422 kthread+0x3b1/0x4a0 kernel/kthread.c:313 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Fixes: f564f45c ("vxlan: add ipv6 proxy support") Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
syzbot <syzkaller@googlegroups.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 31, 2021
-
-
Paolo Abeni authored
When passing up an UDP GSO packet with L4 aggregation, there is no need to segment it at the vxlan level. We can propagate the packet untouched and let it be segmented later, if needed. Introduce an helper to allow let the UDP socket to accept any L4 aggregation and use it in the vxlan driver. v1 -> v2: - updated to use the newly introduced UDP socket 'accept*' fields Signed-off-by:
Paolo Abeni <pabeni@redhat.com> Reviewed-by:
Willem de Bruijn <willemb@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 26, 2021
-
-
Antoine Tenart authored
When the interface is part of a bridge or an Open vSwitch port and a packet exceed a PMTU estimate, an ICMP reply is sent to the sender. When using the external mode (collect metadata) the source and destination addresses are reversed, so that Open vSwitch can match the packet against an existing (reverse) flow. But inverting the source and destination addresses in the shared ip_tunnel_info will make following packets of the flow to use a wrong destination address (packets will be tunnelled to itself), if the flow isn't updated. Which happens with Open vSwitch, until the flow times out. Fixes this by uncloning the skb's ip_tunnel_info before inverting its source and destination addresses, so that the modification will only be made for the PTMU packet, not the following ones. Fixes: fc68c995 ("vxlan: Support for PMTU discovery on directly bridged links") Tested-by:
Eelco Chaudron <echaudro@redhat.com> Reviewed-by:
Eelco Chaudron <echaudro@redhat.com> Signed-off-by:
Antoine Tenart <atenart@kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 14, 2021
-
-
Sanjana Srinidhi authored
Added a blank line after structure declaration. This is done to maintain code uniformity. Signed-off-by:
Sanjana Srinidhi <sanjanasrinidhi1810@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Feb 23, 2021
-
-
Taehee Yoo authored
The debug check must be done after unregister_netdevice_many() call -- the hlist_del_rcu() for this is done inside .ndo_stop. This is the same with commit 0fda7600 ("geneve: move debug check after netdev unregister") Test commands: ip netns del A ip netns add A ip netns add B ip netns exec B ip link add vxlan0 type vxlan vni 100 local 10.0.0.1 \ remote 10.0.0.2 dstport 4789 srcport 4789 4789 ip netns exec B ip link set vxlan0 netns A ip netns exec A ip link set vxlan0 up ip netns del B Splat looks like: [ 73.176249][ T7] ------------[ cut here ]------------ [ 73.178662][ T7] WARNING: CPU: 4 PID: 7 at drivers/net/vxlan.c:4743 vxlan_exit_batch_net+0x52e/0x720 [vxlan] [ 73.182597][ T7] Modules linked in: vxlan openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 mlx5_core nfp mlxfw ixgbevf tls sch_fq_codel nf_tables nfnetlink ip_tables x_tables unix [ 73.190113][ T7] CPU: 4 PID: 7 Comm: kworker/u16:0 Not tainted 5.11.0-rc7+ #838 [ 73.193037][ T7] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 73.196986][ T7] Workqueue: netns cleanup_net [ 73.198946][ T7] RIP: 0010:vxlan_exit_batch_net+0x52e/0x720 [vxlan] [ 73.201509][ T7] Code: 00 01 00 00 0f 84 39 fd ff ff 48 89 ca 48 c1 ea 03 80 3c 1a 00 0f 85 a6 00 00 00 89 c2 48 83 c2 02 49 8b 14 d4 48 85 d2 74 ce <0f> 0b eb ca e8 b9 51 db dd 84 c0 0f 85 4a fe ff ff 48 c7 c2 80 bc [ 73.208813][ T7] RSP: 0018:ffff888100907c10 EFLAGS: 00010286 [ 73.211027][ T7] RAX: 000000000000003c RBX: dffffc0000000000 RCX: ffff88800ec411f0 [ 73.213702][ T7] RDX: ffff88800a278000 RSI: ffff88800fc78c70 RDI: ffff88800fc78070 [ 73.216169][ T7] RBP: ffff88800b5cbdc0 R08: fffffbfff424de61 R09: fffffbfff424de61 [ 73.218463][ T7] R10: ffffffffa126f307 R11: fffffbfff424de60 R12: ffff88800ec41000 [ 73.220794][ T7] R13: ffff888100907d08 R14: ffff888100907c50 R15: ffff88800fc78c40 [ 73.223337][ T7] FS: 0000000000000000(0000) GS:ffff888114800000(0000) knlGS:0000000000000000 [ 73.225814][ T7] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 73.227616][ T7] CR2: 0000562b5cb4f4d0 CR3: 0000000105fbe001 CR4: 00000000003706e0 [ 73.229700][ T7] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 73.231820][ T7] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 73.233844][ T7] Call Trace: [ 73.234698][ T7] ? vxlan_err_lookup+0x3c0/0x3c0 [vxlan] [ 73.235962][ T7] ? ops_exit_list.isra.11+0x93/0x140 [ 73.237134][ T7] cleanup_net+0x45e/0x8a0 [ ... ] Fixes: 57b61127 ("vxlan: speedup vxlan tunnels dismantle") Signed-off-by:
Taehee Yoo <ap420073@gmail.com> Link: https://lore.kernel.org/r/20210221154552.11749-1-ap420073@gmail.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Jan 19, 2021
-
-
Xin Long authored
Some protocol HW GSO requires fraglist supported by the device, like SCTP. Without NETIF_F_FRAGLIST set in the dev features of vxlan, it would have to do SW GSO before the packets enter the driver, even when the vxlan dev and lower dev (like veth) both have the feature of NETIF_F_GSO_SCTP. So this patch is to add it for vxlan. Signed-off-by:
Xin Long <lucien.xin@gmail.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Jan 07, 2021
-
-
Jakub Kicinski authored
udp_tunnel_nic handles REGISTER and UNREGISTER event, now that all drivers use that infra we can drop the event handling in the tunnel drivers. Reviewed-by:
Alexander Duyck <alexanderduyck@fb.com> Reviewed-by:
Jacob Keller <jacob.e.keller@intel.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Dec 10, 2020
-
-
Antonio Quartulli authored
The definition of IS_ERR() already applies the unlikely() notation when checking the error status of the passed pointer. For this reason there is no need to have the same notation outside of IS_ERR() itself. Clean up code by removing redundant notation. Signed-off-by:
Antonio Quartulli <a@unstable.cc> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Dec 03, 2020
-
-
Zhang Changzhong authored
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: 0ce1822c ("vxlan: add adjacent link to limit depth level") Reported-by:
Hulk Robot <hulkci@huawei.com> Signed-off-by:
Zhang Changzhong <zhangchangzhong@huawei.com> Link: https://lore.kernel.org/r/1606903122-2098-1-git-send-email-zhangchangzhong@huawei.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Dec 01, 2020
-
-
Sven Eckelmann authored
While vxlan doesn't need any extra tailroom, the lowerdev might need it. In that case, copy it over to reduce the chance for additional (re)allocations in the transmit path. Signed-off-by:
Sven Eckelmann <sven@narfation.org> Link: https://lore.kernel.org/r/20201126125247.1047977-2-sven@narfation.org Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Sven Eckelmann authored
It was observed that sending data via batadv over vxlan (on top of wireguard) reduced the performance massively compared to raw ethernet or batadv on raw ethernet. A check of perf data showed that the vxlan_build_skb was calling all the time pskb_expand_head to allocate enough headroom for: min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len + VXLAN_HLEN + iphdr_len; But the vxlan_config_apply only requested needed headroom for: lowerdev->hard_header_len + VXLAN6_HEADROOM or VXLAN_HEADROOM So it completely ignored the needed_headroom of the lower device. The first caller of net_dev_xmit could therefore never make sure that enough headroom was allocated for the rest of the transmit path. Cc: Annika Wickert <annika.wickert@exaring.de> Signed-off-by:
Sven Eckelmann <sven@narfation.org> Tested-by:
Annika Wickert <aw@awlnx.space> Link: https://lore.kernel.org/r/20201126125247.1047977-1-sven@narfation.org Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Nov 10, 2020
-
-
Heiner Kallweit authored
Replace ip_tunnel_get_stats64() with the new identical core function dev_get_tstats64(). Signed-off-by:
Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Nov 06, 2020
-
-
Ido Schimmel authored
This will be used by the next patch which extends the function to replay all the existing nexthops to the notifier block being registered. Device drivers will be able to pass extack to the function since it is passed to them upon reload from devlink. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Reviewed-by:
David Ahern <dsahern@gmail.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Ido Schimmel authored
Convert the sole listener of the nexthop notification chain (the VXLAN driver) to the new notification info. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Nov 03, 2020
-
-
Ido Schimmel authored
The nexthop notification chain is a per-namespace chain and not a global one like the netdev notification chain. Therefore, a single (global) listener cannot be registered to all these chains simultaneously as it will result in list corruptions whenever listeners are registered / unregistered. Instead, register a different listener in each namespace. Currently this is not an issue because only the VXLAN driver registers a listener to this chain, but this is going to change with netdevsim and mlxsw also registering their own listeners. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Reviewed-by:
David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20201101113926.705630-1-idosch@idosch.org Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Oct 06, 2020
-
-
Fabian Frederick authored
use new helper for netstats settings Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Sep 26, 2020
-
-
Jakub Kicinski authored
This reverts commit 546c044c. Nothing prevents user from sending frames to "external" VxLAN devices. In fact kernel itself may generate icmp chatter. This is fine, such frames should be dropped. The point of the "missing encapsulation" warning was that frames with missing encap should not make it into vxlan_xmit_one(). And vxlan_xmit() drops them cleanly, so let it just do that. Without this revert the warning is triggered by the udp_tunnel_nic.sh test, but the minimal repro is: $ ip link add vxlan0 type vxlan \ group 239.1.1.1 \ dev lo \ dstport 1234 \ external $ ip li set dev vxlan0 up [ 419.165981] vxlan0: Missing encapsulation instructions [ 419.166551] WARNING: CPU: 0 PID: 1041 at drivers/net/vxlan.c:2889 vxlan_xmit+0x15c0/0x1fc0 [vxlan] Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Sep 25, 2020
-
-
Fabian Frederick authored
Since commit aab8cc36 ("vxlan: add support for underlay in non-default VRF") vxlan_find_sock() also checks if socket is assigned to the right level 3 master device when lower device is not in the default VRF. Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Fabian Frederick authored
rtnl_configure_link is always checked if < 0 for error code. Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Fabian Frederick authored
vxlan_xmit_one() was only called from vxlan_xmit() without rdst and info was already tested. Emit warning in that function instead Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Fabian Frederick authored
small optimization around checking as it's being done in all receptions Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Fabian Frederick authored
call vxlan_remcsum() before md filling in vxlan_rcv() Signed-off-by:
Fabian Frederick <fabf@skynet.be> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Aug 05, 2020
-
-
Hangbin Liu authored
This reverts commit 71130f29. In commit 71130f29 ("vxlan: fix tos value before xmit") we want to make sure the tos value are filtered by RT_TOS() based on RFC1349. 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | PRECEDENCE | TOS | MBZ | +-----+-----+-----+-----+-----+-----+-----+-----+ But RFC1349 has been obsoleted by RFC2474. The new DSCP field defined like 0 1 2 3 4 5 6 7 +-----+-----+-----+-----+-----+-----+-----+-----+ | DS FIELD, DSCP | ECN FIELD | +-----+-----+-----+-----+-----+-----+-----+-----+ So with IPTOS_TOS_MASK 0x1E RT_TOS(tos) ((tos)&IPTOS_TOS_MASK) the first 3 bits DSCP info will get lost. To take all the DSCP info in xmit, we should revert the patch and just push all tos bits to ip_tunnel_ecn_encap(), which will handling ECN field later. Fixes: 71130f29 ("vxlan: fix tos value before xmit") Signed-off-by:
Hangbin Liu <liuhangbin@gmail.com> Acked-by:
Guillaume Nault <gnault@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Aug 04, 2020
-
-
Stefano Brivio authored
If the interface is a bridge or Open vSwitch port, and we can't forward a packet because it exceeds the local PMTU estimate, trigger an ICMP or ICMPv6 reply to the sender, using the same interface to forward it back. If metadata collection is enabled, reverse destination and source addresses, so that Open vSwitch is able to match this packet against the existing, reverse flow. v2: Use netif_is_any_bridge_port() (David Ahern) Signed-off-by:
Stefano Brivio <sbrivio@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Stefano Brivio authored
It's currently possible to bridge Ethernet tunnels carrying IP packets directly to external interfaces without assigning them addresses and routes on the bridged network itself: this is the case for UDP tunnels bridged with a standard bridge or by Open vSwitch. PMTU discovery is currently broken with those configurations, because the encapsulation effectively decreases the MTU of the link, and while we are able to account for this using PMTU discovery on the lower layer, we don't have a way to relay ICMP or ICMPv6 messages needed by the sender, because we don't have valid routes to it. On the other hand, as a tunnel endpoint, we can't fragment packets as a general approach: this is for instance clearly forbidden for VXLAN by RFC 7348, section 4.3: VTEPs MUST NOT fragment VXLAN packets. Intermediate routers may fragment encapsulated VXLAN packets due to the larger frame size. The destination VTEP MAY silently discard such VXLAN fragments. The same paragraph recommends that the MTU over the physical network accomodates for encapsulations, but this isn't a practical option for complex topologies, especially for typical Open vSwitch use cases. Further, it states that: Other techniques like Path MTU discovery (see [RFC1191] and [RFC1981]) MAY be used to address this requirement as well. Now, PMTU discovery already works for routed interfaces, we get route exceptions created by the encapsulation device as they receive ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and we already rebuild those messages with the appropriate MTU and route them back to the sender. Add the missing bits for bridged cases: - checks in skb_tunnel_check_pmtu() to understand if it's appropriate to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and RFC 4443 section 2.4 for ICMPv6. This function is already called by UDP tunnels - a new function generating those ICMP or ICMPv6 replies. We can't reuse icmp_send() and icmp6_send() as we don't see the sender as a valid destination. This doesn't need to be generic, as we don't cover any other type of ICMP errors given that we only provide an encapsulation function to the sender While at it, make the MTU check in skb_tunnel_check_pmtu() accurate: we might receive GSO buffers here, and the passed headroom already includes the inner MAC length, so we don't have to account for it a second time (that would imply three MAC headers on the wire, but there are just two). This issue became visible while bridging IPv6 packets with 4500 bytes of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50 bytes of encapsulation headroom, we would advertise MTU as 3950, and we would reject fragmented IPv6 datagrams of 3958 bytes size on the wire. We're exclusively dealing with network MTU here, though, so we could get Ethernet frames up to 3964 octets in that case. v2: - moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern) - split IPv4/IPv6 functions (David Ahern) Signed-off-by:
Stefano Brivio <sbrivio@redhat.com> Reviewed-by:
David Ahern <dsahern@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Aug 01, 2020
-
-
Taehee Yoo authored
When vxlan interface is deleted, all fdbs are deleted by vxlan_flush(). vxlan_flush() flushes fdbs but it doesn't delete fdb, which contains all-zeros-mac because it is deleted by vxlan_uninit(). But vxlan_uninit() deletes only the fdb, which contains both all-zeros-mac and default vni. So, the fdb, which contains both all-zeros-mac and non-default vni will not be deleted. Test commands: ip link add vxlan0 type vxlan dstport 4789 external ip link set vxlan0 up bridge fdb add to 00:00:00:00:00:00 dst 172.0.0.1 dev vxlan0 via lo \ src_vni 10000 self permanent ip link del vxlan0 kmemleak reports as follows: unreferenced object 0xffff9486b25ced88 (size 96): comm "bridge", pid 2151, jiffies 4294701712 (age 35506.901s) hex dump (first 32 bytes): 02 00 00 00 ac 00 00 01 40 00 09 b1 86 94 ff ff ........@....... 46 02 00 00 00 00 00 00 a7 03 00 00 12 b5 6a 6b F.............jk backtrace: [<00000000c10cf651>] vxlan_fdb_append.part.51+0x3c/0xf0 [vxlan] [<000000006b31a8d9>] vxlan_fdb_create+0x184/0x1a0 [vxlan] [<0000000049399045>] vxlan_fdb_update+0x12f/0x220 [vxlan] [<0000000090b1ef00>] vxlan_fdb_add+0x12a/0x1b0 [vxlan] [<0000000056633c2c>] rtnl_fdb_add+0x187/0x270 [<00000000dd5dfb6b>] rtnetlink_rcv_msg+0x264/0x490 [<00000000fc44dd54>] netlink_rcv_skb+0x4a/0x110 [<00000000dff433e7>] netlink_unicast+0x18e/0x250 [<00000000b87fb421>] netlink_sendmsg+0x2e9/0x400 [<000000002ed55153>] ____sys_sendmsg+0x237/0x260 [<00000000faa51c66>] ___sys_sendmsg+0x88/0xd0 [<000000006c3982f1>] __sys_sendmsg+0x4e/0x80 [<00000000a8f875d2>] do_syscall_64+0x56/0xe0 [<000000003610eefa>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 unreferenced object 0xffff9486b1c40080 (size 128): comm "bridge", pid 2157, jiffies 4294701754 (age 35506.866s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 f8 dc 42 b2 86 94 ff ff ..........B..... 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk backtrace: [<00000000a2981b60>] vxlan_fdb_create+0x67/0x1a0 [vxlan] [<0000000049399045>] vxlan_fdb_update+0x12f/0x220 [vxlan] [<0000000090b1ef00>] vxlan_fdb_add+0x12a/0x1b0 [vxlan] [<0000000056633c2c>] rtnl_fdb_add+0x187/0x270 [<00000000dd5dfb6b>] rtnetlink_rcv_msg+0x264/0x490 [<00000000fc44dd54>] netlink_rcv_skb+0x4a/0x110 [<00000000dff433e7>] netlink_unicast+0x18e/0x250 [<00000000b87fb421>] netlink_sendmsg+0x2e9/0x400 [<000000002ed55153>] ____sys_sendmsg+0x237/0x260 [<00000000faa51c66>] ___sys_sendmsg+0x88/0xd0 [<000000006c3982f1>] __sys_sendmsg+0x4e/0x80 [<00000000a8f875d2>] do_syscall_64+0x56/0xe0 [<000000003610eefa>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 3ad7a4b1 ("vxlan: support fdb and learning in COLLECT_METADATA mode") Signed-off-by:
Taehee Yoo <ap420073@gmail.com> Acked-by:
Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jul 29, 2020
-
-
Ido Schimmel authored
The commit cited below removed the RCU read-side critical section from rtnl_fdb_dump() which means that the ndo_fdb_dump() callback is invoked without RCU protection. This results in the following warning [1] in the VXLAN driver, which relied on the callback being invoked from an RCU read-side critical section. Fix this by calling rcu_read_lock() in the VXLAN driver, as already done in the bridge driver. [1] WARNING: suspicious RCU usage 5.8.0-rc4-custom-01521-g481007553ce6 #29 Not tainted ----------------------------- drivers/net/vxlan.c:1379 RCU-list traversed in non-reader section!! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by bridge/166: #0: ffffffff85a27850 (rtnl_mutex){+.+.}-{3:3}, at: netlink_dump+0xea/0x1090 stack backtrace: CPU: 1 PID: 166 Comm: bridge Not tainted 5.8.0-rc4-custom-01521-g481007553ce6 #29 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x100/0x184 lockdep_rcu_suspicious+0x153/0x15d vxlan_fdb_dump+0x51e/0x6d0 rtnl_fdb_dump+0x4dc/0xad0 netlink_dump+0x540/0x1090 __netlink_dump_start+0x695/0x950 rtnetlink_rcv_msg+0x802/0xbd0 netlink_rcv_skb+0x17a/0x480 rtnetlink_rcv+0x22/0x30 netlink_unicast+0x5ae/0x890 netlink_sendmsg+0x98a/0xf40 __sys_sendto+0x279/0x3b0 __x64_sys_sendto+0xe6/0x1a0 do_syscall_64+0x54/0xa0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fe14fa2ade0 Code: Bad RIP value. RSP: 002b:00007fff75bb5b88 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00005614b1ba0020 RCX: 00007fe14fa2ade0 RDX: 000000000000011c RSI: 00007fff75bb5b90 RDI: 0000000000000003 RBP: 00007fff75bb5b90 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00005614b1b89160 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Fixes: 5e6d2435 ("bridge: netlink dump interface at par with brctl") Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Reviewed-by:
Jiri Pirko <jiri@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jul 10, 2020
-
-
Jakub Kicinski authored
Cater to devices which: (a) may want to sleep in the callbacks; (b) only have IPv4 support; (c) need all the programming to happen while the netdev is up. Drivers attach UDP tunnel offload info struct to their netdevs, where they declare how many UDP ports of various tunnel types they support. Core takes care of tracking which ports to offload. Use a fixed-size array since this matches what almost all drivers do, and avoids a complexity and uncertainty around memory allocations in an atomic context. Make sure that tunnel drivers don't try to replay the ports when new NIC netdev is registered. Automatic replays would mess up reference counting, and will be removed completely once all drivers are converted. v4: - use a #define NULL to avoid build issues with CONFIG_INET=n. Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jun 25, 2020
-
-
Roopa Prabhu authored
This patch fixes last saved fdb index in fdb dump handler when handling fdb's with nhid. Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries") Signed-off-by:
Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jun 10, 2020
-
-
David Ahern authored
vxlan driver should be using helpers to access nexthop struct internals. Remove open check if whether nexthop is multipath in favor of the existing nexthop_is_multipath helper. Add a new helper, nexthop_has_v4, to cover the need to check has_v4 in a group. Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries") Cc: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by:
David Ahern <dsahern@kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Ahern authored
fdb nexthops are marked with a flag. For standalone nexthops, a flag was added to the nh_info struct. For groups that flag was added to struct nexthop when it should have been added to the group information. Fix by removing the flag from the nexthop struct and adding a flag to nh_group that mirrors nh_info and is really only a caching of the individual types. Add a helper, nexthop_is_fdb, for use by the vxlan code and fixup the internal code to use the flag from either nh_info or nh_group. v2 - propagate fdb_nh in remove_nh_grp_entry Fixes: 38428d68 ("nexthop: support for fdb ecmp nexthops") Cc: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by:
David Ahern <dsahern@kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jun 09, 2020
-
-
Cong Wang authored
The dynamic key update for addr_list_lock still causes troubles, for example the following race condition still exists: CPU 0: CPU 1: (RCU read lock) (RTNL lock) dev_mc_seq_show() netdev_update_lockdep_key() -> lockdep_unregister_key() -> netif_addr_lock_bh() because lockdep doesn't provide an API to update it atomically. Therefore, we have to move it back to static keys and use subclass for nest locking like before. In commit 1a33e10e ("net: partially revert dynamic lockdep key changes"), I already reverted most parts of commit ab92d68f ("net: core: add generic lockdep keys"). This patch reverts the rest and also part of commit f3b0a18b ("net: remove unnecessary variables and callback"). After this patch, addr_list_lock changes back to using static keys and subclasses to satisfy lockdep. Thanks to dev->lower_level, we do not have to change back to ->ndo_get_lock_subclass(). And hopefully this reduces some syzbot lockdep noises too. Reported-by:
<syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com> Cc: Taehee Yoo <ap420073@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by:
Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jun 01, 2020
-
-
Roopa Prabhu authored
fix dereference of nexthop group in fdb nexthop group update validation path. Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries") Reported-by:
Ido Schimmel <idosch@idosch.org> Suggested-by:
Ido Schimmel <idosch@idosch.org> Signed-off-by:
Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
When proxy mode is enabled the vxlan device might reply to Neighbor Solicitation (NS) messages on behalf of remote hosts. In case the NS message includes the "Source link-layer address" option [1], the vxlan device will use the specified address as the link-layer destination address in its reply. To avoid an infinite loop, break out of the options parsing loop when encountering an option with length zero and disregard the NS message. This is consistent with the IPv6 ndisc code and RFC 4886 which states that "Nodes MUST silently discard an ND packet that contains an option with length zero" [2]. [1] https://tools.ietf.org/html/rfc4861#section-4.3 [2] https://tools.ietf.org/html/rfc4861#section-4.6 Fixes: 4b29dba9 ("vxlan: fix nonfunctional neigh_reduce()") Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Acked-by:
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- May 31, 2020
-
-
Roopa Prabhu authored
- remove fdb from nh_list before the rcu grace period - protect fdb->vdev with rcu - hold spin lock before destroying fdb Fixes: c7cdbe2e ("vxlan: support for nexthop notifiers") Signed-off-by:
Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by:
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Roopa Prabhu authored
NDA_NH_ID represents a remote ip or a group of remote ips. It allows use of nexthop groups in lieu of a remote ip or a list of remote ips supported by the fdb api. Current code ignores the other remote ip attrs when NDA_NH_ID is specified. In the spirit of strict checking, This commit adds a check to explicitly return an error on incorrect usage. Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries") Signed-off-by:
Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- May 25, 2020
-
-
Ido Schimmel authored
vxlan_fdb_info() is not always called with RTNL held or from an RCU read-side critical section. For example, in the following call path: vxlan_cleanup() vxlan_fdb_destroy() vxlan_fdb_notify() __vxlan_fdb_notify() vxlan_fdb_info() The use of rtnl_dereference() can therefore result in the following splat [1]. Fix this by dereferencing the nexthop under RCU read-side critical section. [1] [May24 22:56] ============================= [ +0.004676] WARNING: suspicious RCU usage [ +0.004614] 5.7.0-rc5-custom-16219-g201392003491 #2772 Not tainted [ +0.007116] ----------------------------- [ +0.004657] drivers/net/vxlan.c:276 suspicious rcu_dereference_check() usage! [ +0.008164] other info that might help us debug this: [ +0.009126] rcu_scheduler_active = 2, debug_locks = 1 [ +0.007504] 5 locks held by bash/6892: [ +0.004392] #0: ffff8881d47e3410 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: __do_execve_file.isra.27+0x392/0x23c0 [ +0.011795] #1: ffff8881d47e34b0 (&sig->exec_update_mutex){+.+.}-{3:3}, at: flush_old_exec+0x510/0x2030 [ +0.010947] #2: ffff8881a141b0b0 (ptlock_ptr(page)#2){+.+.}-{2:2}, at: unmap_page_range+0x9c0/0x2590 [ +0.010585] #3: ffff888230009d50 ((&vxlan->age_timer)){+.-.}-{0:0}, at: call_timer_fn+0xe8/0x800 [ +0.010192] #4: ffff888183729bc8 (&vxlan->hash_lock[h]){+.-.}-{2:2}, at: vxlan_cleanup+0x133/0x4a0 [ +0.010382] stack backtrace: [ +0.005103] CPU: 1 PID: 6892 Comm: bash Not tainted 5.7.0-rc5-custom-16219-g201392003491 #2772 [ +0.009675] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016 [ +0.010155] Call Trace: [ +0.002775] <IRQ> [ +0.002313] dump_stack+0xfd/0x178 [ +0.003895] lockdep_rcu_suspicious+0x14a/0x153 [ +0.005157] vxlan_fdb_info+0xe39/0x12a0 [ +0.004775] __vxlan_fdb_notify+0xb8/0x160 [ +0.004672] vxlan_fdb_notify+0x8e/0xe0 [ +0.004370] vxlan_fdb_destroy+0x117/0x330 [ +0.004662] vxlan_cleanup+0x1aa/0x4a0 [ +0.004329] call_timer_fn+0x1c4/0x800 [ +0.004357] run_timer_softirq+0x129d/0x17e0 [ +0.004762] __do_softirq+0x24c/0xaef [ +0.004232] irq_exit+0x167/0x190 [ +0.003767] smp_apic_timer_interrupt+0x1dd/0x6a0 [ +0.005340] apic_timer_interrupt+0xf/0x20 [ +0.004620] </IRQ> Fixes: 1274e1cc ("vxlan: ecmp support for mac fdb entries") Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Reported-by:
Amit Cohen <amitc@mellanox.com> Acked-by:
Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- May 22, 2020
-
-
Roopa Prabhu authored
vxlan driver registers for nexthop add/del notifiers to cleanup fdb entries pointing to such nexthops. Signed-off-by:
Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Roopa Prabhu authored
Todays vxlan mac fdb entries can point to multiple remote ips (rdsts) with the sole purpose of replicating broadcast-multicast and unknown unicast packets to those remote ips. E-VPN multihoming [1,2,3] requires bridged vxlan traffic to be load balanced to remote switches (vteps) belonging to the same multi-homed ethernet segment (E-VPN multihoming is analogous to multi-homed LAG implementations, but with the inter-switch peerlink replaced with a vxlan tunnel). In other words it needs support for mac ecmp. Furthermore, for faster convergence, E-VPN multihoming needs the ability to update fdb ecmp nexthops independent of the fdb entries. New route nexthop API is perfect for this usecase. This patch extends the vxlan fdb code to take a nexthop id pointing to an ecmp nexthop group. Changes include: - New NDA_NH_ID attribute for fdbs - Use the newly added fdb nexthop groups - makes vxlan rdsts and nexthop handling code mutually exclusive - since this is a new use-case and the requirement is for ecmp nexthop groups, the fdb add and update path checks that the nexthop is really an ecmp nexthop group. This check can be relaxed in the future, if we want to introduce replication fdb nexthop groups and allow its use in lieu of current rdst lists. - fdb update requests with nexthop id's only allowed for existing fdb's that have nexthop id's - learning will not override an existing fdb entry with nexthop group - I have wrapped the switchdev offload code around the presence of rdst [1] E-VPN RFC https://tools.ietf.org/html/rfc7432 [2] E-VPN with vxlan https://tools.ietf.org/html/rfc8365 [3] http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf Includes a null check fix in vxlan_xmit from Nikolay v2 - Fixed build issue: Reported-by:
kbuild test robot <lkp@intel.com> Signed-off-by:
Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-