1. 27 Sep, 2018 2 commits
  2. 29 May, 2018 1 commit
  3. 14 Mar, 2018 1 commit
    • Sabrina Dubroca's avatar
      ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu · d52e5a7e
      Sabrina Dubroca authored
      Prior to the rework of PMTU information storage in commit
      2c8cec5c ("ipv4: Cache learned PMTU information in inetpeer."),
      when a PMTU event advertising a PMTU smaller than
      net.ipv4.route.min_pmtu was received, we would disable setting the DF
      flag on packets by locking the MTU metric, and set the PMTU to
      Since then, we don't disable DF, and set PMTU to
      net.ipv4.route.min_pmtu, so the intermediate router that has this link
      with a small MTU will have to drop the packets.
      This patch reestablishes pre-2.6.39 behavior by splitting
      rtable->rt_pmtu into a bitfield with rt_mtu_locked and rt_pmtu.
      rt_mtu_locked indicates that we shouldn't set the DF bit on that path,
      and is checked in ip_dont_fragment().
      One possible workaround is to set net.ipv4.route.min_pmtu to a value low
      enough to accommodate the lowest MTU encountered.
      Fixes: 2c8cec5c ("ipv4: Cache learned PMTU information in inetpeer.")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  4. 16 Feb, 2018 1 commit
    • Xin Long's avatar
      xfrm: reuse uncached_list to track xdsts · 510c321b
      Xin Long authored
      In early time, when freeing a xdst, it would be inserted into
      dst_garbage.list first. Then if it's refcnt was still held
      somewhere, later it would be put into dst_busy_list in
      When one dev was being unregistered, the dev of these dsts in
      dst_busy_list would be set with loopback_dev and put this dev.
      So that this dev's removal wouldn't get blocked, and avoid the
      kmsg warning:
        kernel:unregister_netdevice: waiting for veth0 to become \
        free. Usage count = 2
      However after Commit 52df157f ("xfrm: take refcnt of dst
      when creating struct xfrm_dst bundle"), the xdst will not be
      freed with dst gc, and this warning happens.
      To fix it, we need to find these xdsts that are still held by
      others when removing the dev, and free xdst's dev and set it
      with loopback_dev.
      But unfortunately after flow_cache for xfrm was deleted, no
      list tracks them anymore. So we need to save these xdsts
      somewhere to release the xdst's dev later.
      To make this easier, this patch is to reuse uncached_list to
      track xdsts, so that the dev refcnt can be released in the
      event NETDEV_UNREGISTER process of fib_netdev_notifier.
      Thanks to Florian, we could move forward this fix quickly.
      Fixes: 52df157f ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle")
      Reported-by: default avatarJianlin Shi <jishi@redhat.com>
      Reported-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Tested-by: default avatarEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
  5. 15 Feb, 2018 1 commit
  6. 25 Jan, 2018 1 commit
  7. 01 Oct, 2017 1 commit
    • Paolo Abeni's avatar
      udp: perform source validation for mcast early demux · bc044e8d
      Paolo Abeni authored
      The UDP early demux can leverate the rx dst cache even for
      multicast unconnected sockets.
      In such scenario the ipv4 source address is validated only on
      the first packet in the given flow. After that, when we fetch
      the dst entry  from the socket rx cache, we stop enforcing
      the rp_filter and we even start accepting any kind of martian
      Disabling the dst cache for unconnected multicast socket will
      cause large performace regression, nearly reducing by half the
      max ingress tput.
      Instead we factor out a route helper to completely validate an
      skb source address for multicast packets and we call it from
      the UDP early demux for mcast packets landing on unconnected
      sockets, after successful fetching the related cached dst entry.
      This still gives a measurable, but limited performance
      		rp_filter = 0		rp_filter = 1
      edmux disabled:	1182 Kpps		1127 Kpps
      edmux before:	2238 Kpps		2238 Kpps
      edmux after:	2037 Kpps		2019 Kpps
      The above figures are on top of current net tree.
      Applying the net-next commit 6e617de8 ("net: avoid a full
      fib lookup when rp_filter is disabled.") the delta with
      rp_filter == 0 will decrease even more.
      Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  8. 22 Sep, 2017 1 commit
    • Eric Dumazet's avatar
      net: prevent dst uses after free · 222d7dbd
      Eric Dumazet authored
      In linux-4.13, Wei worked hard to convert dst to a traditional
      refcounted model, removing GC.
      We now want to make sure a dst refcount can not transition from 0 back
      to 1.
      The problem here is that input path attached a not refcounted dst to an
      skb. Then later, because packet is forwarded and hits skb_dst_force()
      before exiting RCU section, we might try to take a refcount on one dst
      that is about to be freed, if another cpu saw 1 -> 0 transition in
      dst_release() and queued the dst for freeing after one RCU grace period.
      Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
      always perform the complete check against dst refcount, and not assume
      it is not zero.
      Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
      [  989.919496]  skb_dst_force+0x32/0x34
      [  989.919498]  __dev_queue_xmit+0x1ad/0x482
      [  989.919501]  ? eth_header+0x28/0xc6
      [  989.919502]  dev_queue_xmit+0xb/0xd
      [  989.919504]  neigh_connected_output+0x9b/0xb4
      [  989.919507]  ip_finish_output2+0x234/0x294
      [  989.919509]  ? ipt_do_table+0x369/0x388
      [  989.919510]  ip_finish_output+0x12c/0x13f
      [  989.919512]  ip_output+0x53/0x87
      [  989.919513]  ip_forward_finish+0x53/0x5a
      [  989.919515]  ip_forward+0x2cb/0x3e6
      [  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
      [  989.919518]  ip_rcv_finish+0x2e2/0x321
      [  989.919519]  ip_rcv+0x26f/0x2eb
      [  989.919522]  ? vlan_do_receive+0x4f/0x289
      [  989.919523]  __netif_receive_skb_core+0x467/0x50b
      [  989.919526]  ? tcp_gro_receive+0x239/0x239
      [  989.919529]  ? inet_gro_receive+0x226/0x238
      [  989.919530]  __netif_receive_skb+0x4d/0x5f
      [  989.919532]  netif_receive_skb_internal+0x5c/0xaf
      [  989.919533]  napi_gro_receive+0x45/0x81
      [  989.919536]  ixgbe_poll+0xc8a/0xf09
      [  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
      [  989.919540]  net_rx_action+0xf4/0x266
      [  989.919543]  __do_softirq+0xa8/0x19d
      [  989.919545]  irq_exit+0x5d/0x6b
      [  989.919546]  do_IRQ+0x9c/0xb5
      [  989.919548]  common_interrupt+0x93/0x93
      [  989.919548]  </IRQ>
      Similarly dst_clone() can use dst_hold() helper to have additional
      debugging, as a follow up to commit 44ebe791 ("net: add debug
      atomic_inc_not_zero() in dst_hold()")
      In net-next we will convert dst atomic_t to refcount_t for peace of
      Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Reported-by: default avatarPaweł Staszewski <pstaszewski@itcare.pl>
      Bisected-by: default avatarPaweł Staszewski <pstaszewski@itcare.pl>
      Acked-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 03 Sep, 2017 1 commit
  10. 18 Jun, 2017 1 commit
    • Wei Wang's avatar
      ipv4: call dst_hold_safe() properly · 9df16efa
      Wei Wang authored
      This patch checks all the calls to
      dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if
      dst_hold_safe() is needed to avoid double free issue if dst
      gc is removed and dst_release() directly destroys dst when
      dst->__refcnt drops to 0.
      In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock().
      UDP and other similar protocols always hold refcount for
      skb->_skb_refdst. So both paths seem to be safe.
      In rx path, as it is lockless and skb_dst_set_noref() is likely to be
      used, dst_hold_safe() should always be used when trying to hold dst.
      In the routing code, if dst is held during an rcu protected session, it
      is necessary to call dst_hold_safe() as the current dst might be in its
      rcu grace period.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  11. 26 May, 2017 2 commits
  12. 09 May, 2017 1 commit
  13. 08 May, 2017 1 commit
  14. 21 Mar, 2017 1 commit
    • Nikolay Aleksandrov's avatar
      net: ipv4: add support for ECMP hash policy choice · bf4e0a3d
      Nikolay Aleksandrov authored
      This patch adds support for ECMP hash policy choice via a new sysctl
      called fib_multipath_hash_policy and also adds support for L4 hashes.
      The current values for fib_multipath_hash_policy are:
       0 - layer 3 (default)
       1 - layer 4
      If there's an skb hash already set and it matches the chosen policy then it
      will be used instead of being calculated (currently only for L4).
      In L3 mode we always calculate the hash due to the ICMP error special
      case, the flow dissector's field consistentification should handle the
      address order thus we can remove the address reversals.
      If the skb is provided we always use it for the hash calculation,
      otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  15. 04 Nov, 2016 1 commit
    • Lorenzo Colitti's avatar
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti authored
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  16. 11 Sep, 2016 1 commit
  17. 11 Apr, 2016 1 commit
    • David Ahern's avatar
      net: vrf: Fix dst reference counting · 9ab179d8
      David Ahern authored
      Vivek reported a kernel exception deleting a VRF with an active
      connection through it. The root cause is that the socket has a cached
      reference to a dst that is destroyed. Converting the dst_destroy to
      dst_release and letting proper reference counting kick in does not
      work as the dst has a reference to the device which needs to be released
      as well.
      I talked to Hannes about this at netdev and he pointed out the ipv4 and
      ipv6 dst handling has dst_ifdown for just this scenario. Rather than
      continuing with the reinvented dst wheel in VRF just remove it and
      leverage the ipv4 and ipv6 versions.
      Fixes: 193125db ("net: Introduce VRF device driver")
      Fixes: 35402e31 ("net: Add IPv6 support to VRF device")
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  18. 07 Apr, 2016 1 commit
  19. 17 Feb, 2016 1 commit
  20. 05 Jan, 2016 1 commit
    • David Ahern's avatar
      net: Propagate lookup failure in l3mdev_get_saddr to caller · b5bdacf3
      David Ahern authored
      Commands run in a vrf context are not failing as expected on a route lookup:
          root@kenny:~# ip ro ls table vrf-red
          unreachable default
          root@kenny:~# ping -I vrf-red -c1 -w1
          ping: Warning: source address might be selected on device other than vrf-red.
          PING ( from vrf-red: 56(84) bytes of data.
          --- ping statistics ---
          2 packets transmitted, 0 received, 100% packet loss, time 999ms
      Since the vrf table does not have a route for the ping
      should have failed. The saddr lookup causes a full VRF table lookup.
      Propogating a lookup failure to the user allows the command to fail as
          root@kenny:~# ping -I vrf-red -c1 -w1
          connect: No route to host
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  21. 07 Oct, 2015 2 commits
  22. 05 Oct, 2015 1 commit
  23. 30 Sep, 2015 2 commits
  24. 25 Sep, 2015 1 commit
  25. 17 Sep, 2015 1 commit
  26. 15 Sep, 2015 1 commit
  27. 01 Sep, 2015 1 commit
  28. 20 Aug, 2015 1 commit
  29. 14 Aug, 2015 3 commits
  30. 21 Jul, 2015 1 commit
  31. 15 Jan, 2015 1 commit
    • Eric Dumazet's avatar
      ipv4: per cpu uncached list · 5055c371
      Eric Dumazet authored
      RAW sockets with hdrinc suffer from contention on rt_uncached_lock
      One solution is to use percpu lists, since most routes are destroyed
      by the cpu that created them.
      It is unclear why we even have to put these routes in uncached_list,
      as all outgoing packets should be freed when a device is dismantled.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: caacf05e ("ipv4: Properly purge netdev references on uncached routes.")
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  32. 24 Mar, 2014 1 commit
  33. 13 Jan, 2014 1 commit
    • Hannes Frederic Sowa's avatar
      ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing · f87c10a8
      Hannes Frederic Sowa authored
      While forwarding we should not use the protocol path mtu to calculate
      the mtu for a forwarded packet but instead use the interface mtu.
      We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
      introduced for multicast forwarding. But as it does not conflict with
      our usage in unicast code path it is perfect for reuse.
      I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
      along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
      dependencies because of IPSKB_FORWARDED.
      Because someone might have written a software which does probe
      destinations manually and expects the kernel to honour those path mtus
      I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
      can disable this new behaviour. We also still use mtus which are locked on a
      route for forwarding.
      The reason for this change is, that path mtus information can be injected
      into the kernel via e.g. icmp_err protocol handler without verification
      of local sockets. As such, this could cause the IPv4 forwarding path to
      wrongfully emit fragmentation needed notifications or start to fragment
      packets along a path.
      Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
      won't be set and further fragmentation logic will use the path mtu to
      determine the fragmentation size. They also recheck packet size with
      help of path mtu discovery and report appropriate errors.
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  34. 06 Dec, 2013 1 commit