Skip to content
Snippets Groups Projects
  1. Dec 08, 2021
  2. Dec 02, 2021
  3. Oct 26, 2021
    • Florian Westphal's avatar
      vrf: run conntrack only in context of lower/physdev for locally generated packets · 8c9c296a
      Florian Westphal authored
      
      The VRF driver invokes netfilter for output+postrouting hooks so that users
      can create rules that check for 'oif $vrf' rather than lower device name.
      
      This is a problem when NAT rules are configured.
      
      To avoid any conntrack involvement in round 1, tag skbs as 'untracked'
      to prevent conntrack from picking them up.
      
      This gets cleared before the packet gets handed to the ip stack so
      conntrack will be active on the second iteration.
      
      One remaining issue is that a rule like
      
        output ... oif $vrfname notrack
      
      won't propagate to the second round because we can't tell
      'notrack set via ruleset' and 'notrack set by vrf driver' apart.
      However, this isn't a regression: the 'notrack' removal happens
      instead of unconditional nf_reset_ct().
      I'd also like to avoid leaking more vrf specific conditionals into the
      netfilter infra.
      
      For ingress, conntrack has already been done before the packet makes it
      to the vrf driver, with this patch egress does connection tracking with
      lower/physical device as well.
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c9c296a
  4. Oct 20, 2021
    • Eugene Crosser's avatar
      vrf: Revert "Reset skb conntrack connection..." · 55161e67
      Eugene Crosser authored
      This reverts commit 09e856d5.
      
      When an interface is enslaved in a VRF, prerouting conntrack hook is
      called twice: once in the context of the original input interface, and
      once in the context of the VRF interface. If no special precausions are
      taken, this leads to creation of two conntrack entries instead of one,
      and breaks SNAT.
      
      Commit above was intended to avoid creation of extra conntrack entries
      when input interface is enslaved in a VRF. It did so by resetting
      conntrack related data associated with the skb when it enters VRF context.
      
      However it breaks netfilter operation. Imagine a use case when conntrack
      zone must be assigned based on the original input interface, rather than
      VRF interface (that would make original interfaces indistinguishable). One
      could create netfilter rules similar to these:
      
              chain rawprerouting {
                      type filter hook prerouting priority raw;
                      iif realiface1 ct zone set 1 return
                      iif realiface2 ct zone set 2 return
              }
      
      This works before the mentioned commit, but not after: zone assignment
      is "forgotten", and any subsequent NAT or filtering that is dependent
      on the conntrack zone does not work.
      
      Here is a reproducer script that demonstrates the difference in behaviour.
      
      ==========
      #!/bin/sh
      
      # This script demonstrates unexpected change of nftables behaviour
      # caused by commit 09e856d5 ""vrf: Reset skb conntrack
      # connection on VRF rcv"
      # https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09e856d54bda5f288ef8437a90ab2b9b3eab83d1
      
      
      #
      # Before the commit, it was possible to assign conntrack zone to a
      # packet (or mark it for `notracking`) in the prerouting chanin, raw
      # priority, based on the `iif` (interface from which the packet
      # arrived).
      # After the change, # if the interface is enslaved in a VRF, such
      # assignment is lost. Instead, assignment based on the `iif` matching
      # the VRF master interface is honored. Thus it is impossible to
      # distinguish packets based on the original interface.
      #
      # This script demonstrates this change of behaviour: conntrack zone 1
      # or 2 is assigned depending on the match with the original interface
      # or the vrf master interface. It can be observed that conntrack entry
      # appears in different zone in the kernel versions before and after
      # the commit.
      
      IPIN=172.30.30.1
      IPOUT=172.30.30.2
      PFXL=30
      
      ip li sh vein >/dev/null 2>&1 && ip li del vein
      ip li sh tvrf >/dev/null 2>&1 && ip li del tvrf
      nft list table testct >/dev/null 2>&1 && nft delete table testct
      
      ip li add vein type veth peer veout
      ip li add tvrf type vrf table 9876
      ip li set veout master tvrf
      ip li set vein up
      ip li set veout up
      ip li set tvrf up
      /sbin/sysctl -w net.ipv4.conf.veout.accept_local=1
      /sbin/sysctl -w net.ipv4.conf.veout.rp_filter=0
      ip addr add $IPIN/$PFXL dev vein
      ip addr add $IPOUT/$PFXL dev veout
      
      nft -f - <<__END__
      table testct {
      	chain rawpre {
      		type filter hook prerouting priority raw;
      		iif { veout, tvrf } meta nftrace set 1
      		iif veout ct zone set 1 return
      		iif tvrf ct zone set 2 return
      		notrack
      	}
      	chain rawout {
      		type filter hook output priority raw;
      		notrack
      	}
      }
      __END__
      
      uname -rv
      conntrack -F
      ping -W 1 -c 1 -I vein $IPOUT
      conntrack -L
      
      Signed-off-by: default avatarEugene Crosser <crosser@average.org>
      Acked-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55161e67
  5. Aug 16, 2021
    • Lahav Schlesinger's avatar
      vrf: Reset skb conntrack connection on VRF rcv · 09e856d5
      Lahav Schlesinger authored
      
      To fix the "reverse-NAT" for replies.
      
      When a packet is sent over a VRF, the POST_ROUTING hooks are called
      twice: Once from the VRF interface, and once from the "actual"
      interface the packet will be sent from:
      1) First SNAT: l3mdev_l3_out() -> vrf_l3_out() -> .. -> vrf_output_direct()
           This causes the POST_ROUTING hooks to run.
      2) Second SNAT: 'ip_output()' calls POST_ROUTING hooks again.
      
      Similarly for replies, first ip_rcv() calls PRE_ROUTING hooks, and
      second vrf_l3_rcv() calls them again.
      
      As an example, consider the following SNAT rule:
      > iptables -t nat -A POSTROUTING -p udp -m udp --dport 53 -j SNAT --to-source 2.2.2.2 -o vrf_1
      
      In this case sending over a VRF will create 2 conntrack entries.
      The first is from the VRF interface, which performs the IP SNAT.
      The second will run the SNAT, but since the "expected reply" will remain
      the same, conntrack randomizes the source port of the packet:
      e..g With a socket bound to 1.1.1.1:10000, sending to 3.3.3.3:53, the conntrack
      rules are:
      udp      17 29 src=2.2.2.2 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=61033 packets=0 bytes=0 mark=0 use=1
      udp      17 29 src=1.1.1.1 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=10000 packets=0 bytes=0 mark=0 use=1
      
      i.e. First SNAT IP from 1.1.1.1 --> 2.2.2.2, and second the src port is
      SNAT-ed from 10000 --> 61033.
      
      But when a reply is sent (3.3.3.3:53 -> 2.2.2.2:61033) only the later
      conntrack entry is matched:
      udp      17 29 src=2.2.2.2 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 src=3.3.3.3 dst=2.2.2.2 sport=53 dport=61033 packets=1 bytes=49 mark=0 use=1
      udp      17 28 src=1.1.1.1 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=10000 packets=0 bytes=0 mark=0 use=1
      
      And a "port 61033 unreachable" ICMP packet is sent back.
      
      The issue is that when PRE_ROUTING hooks are called from vrf_l3_rcv(),
      the skb already has a conntrack flow attached to it, which means
      nf_conntrack_in() will not resolve the flow again.
      
      This means only the dest port is "reverse-NATed" (61033 -> 10000) but
      the dest IP remains 2.2.2.2, and since the socket is bound to 1.1.1.1 it's
      not received.
      This can be verified by logging the 4-tuple of the packet in '__udp4_lib_rcv()'.
      
      The fix is then to reset the flow when skb is received on a VRF, to let
      conntrack resolve the flow again (which now will hit the earlier flow).
      
      To reproduce: (Without the fix "Got pkt_to_nat_port" will not be printed by
        running 'bash ./repro'):
        $ cat run_in_A1.py
        import logging
        logging.getLogger("scapy.runtime").setLevel(logging.ERROR)
        from scapy.all import *
        import argparse
      
        def get_packet_to_send(udp_dst_port, msg_name):
            return Ether(src='11:22:33:44:55:66', dst=iface_mac)/ \
                IP(src='3.3.3.3', dst='2.2.2.2')/ \
                UDP(sport=53, dport=udp_dst_port)/ \
                Raw(f'{msg_name}\x0012345678901234567890')
      
        parser = argparse.ArgumentParser()
        parser.add_argument('-iface_mac', dest="iface_mac", type=str, required=True,
                            help="From run_in_A3.py")
        parser.add_argument('-socket_port', dest="socket_port", type=str,
                            required=True, help="From run_in_A3.py")
        parser.add_argument('-v1_mac', dest="v1_mac", type=str, required=True,
                            help="From script")
      
        args, _ = parser.parse_known_args()
        iface_mac = args.iface_mac
        socket_port = int(args.socket_port)
        v1_mac = args.v1_mac
      
        print(f'Source port before NAT: {socket_port}')
      
        while True:
            pkts = sniff(iface='_v0', store=True, count=1, timeout=10)
            if 0 == len(pkts):
                print('Something failed, rerun the script :(', flush=True)
                break
            pkt = pkts[0]
            if not pkt.haslayer('UDP'):
                continue
      
            pkt_sport = pkt.getlayer('UDP').sport
            print(f'Source port after NAT: {pkt_sport}', flush=True)
      
            pkt_to_send = get_packet_to_send(pkt_sport, 'pkt_to_nat_port')
            sendp(pkt_to_send, '_v0', verbose=False) # Will not be received
      
            pkt_to_send = get_packet_to_send(socket_port, 'pkt_to_socket_port')
            sendp(pkt_to_send, '_v0', verbose=False)
            break
      
        $ cat run_in_A2.py
        import socket
        import netifaces
      
        print(f"{netifaces.ifaddresses('e00000')[netifaces.AF_LINK][0]['addr']}",
              flush=True)
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE,
                     str('vrf_1' + '\0').encode('utf-8'))
        s.connect(('3.3.3.3', 53))
        print(f'{s. getsockname()[1]}', flush=True)
        s.settimeout(5)
      
        while True:
            try:
                # Periodically send in order to keep the conntrack entry alive.
                s.send(b'a'*40)
                resp = s.recvfrom(1024)
                msg_name = resp[0].decode('utf-8').split('\0')[0]
                print(f"Got {msg_name}", flush=True)
            except Exception as e:
                pass
      
        $ cat repro.sh
        ip netns del A1 2> /dev/null
        ip netns del A2 2> /dev/null
        ip netns add A1
        ip netns add A2
      
        ip -n A1 link add _v0 type veth peer name _v1 netns A2
        ip -n A1 link set _v0 up
      
        ip -n A2 link add e00000 type bond
        ip -n A2 link add lo0 type dummy
        ip -n A2 link add vrf_1 type vrf table 10001
        ip -n A2 link set vrf_1 up
        ip -n A2 link set e00000 master vrf_1
      
        ip -n A2 addr add 1.1.1.1/24 dev e00000
        ip -n A2 link set e00000 up
        ip -n A2 link set _v1 master e00000
        ip -n A2 link set _v1 up
        ip -n A2 link set lo0 up
        ip -n A2 addr add 2.2.2.2/32 dev lo0
      
        ip -n A2 neigh add 1.1.1.10 lladdr 77:77:77:77:77:77 dev e00000
        ip -n A2 route add 3.3.3.3/32 via 1.1.1.10 dev e00000 table 10001
      
        ip netns exec A2 iptables -t nat -A POSTROUTING -p udp -m udp --dport 53 -j \
      	SNAT --to-source 2.2.2.2 -o vrf_1
      
        sleep 5
        ip netns exec A2 python3 run_in_A2.py > x &
        XPID=$!
        sleep 5
      
        IFACE_MAC=`sed -n 1p x`
        SOCKET_PORT=`sed -n 2p x`
        V1_MAC=`ip -n A2 link show _v1 | sed -n 2p | awk '{print $2'}`
        ip netns exec A1 python3 run_in_A1.py -iface_mac ${IFACE_MAC} -socket_port \
                ${SOCKET_PORT} -v1_mac ${SOCKET_PORT}
        sleep 5
      
        kill -9 $XPID
        wait $XPID 2> /dev/null
        ip netns del A1
        ip netns del A2
        tail x -n 2
        rm x
        set +x
      
      Fixes: 73e20b76 ("net: vrf: Add support for PREROUTING rules on vrf device")
      Signed-off-by: default avatarLahav Schlesinger <lschlesinger@drivenets.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20210815120002.2787653-1-lschlesinger@drivenets.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      09e856d5
  6. Aug 06, 2021
  7. Aug 03, 2021
  8. Jun 21, 2021
    • Antoine Tenart's avatar
      vrf: do not push non-ND strict packets with a source LLA through packet taps again · 603113c5
      Antoine Tenart authored
      
      Non-ND strict packets with a source LLA go through the packet taps
      again, while non-ND strict packets with other source addresses do not,
      and we can see a clone of those packets on the vrf interface (we should
      not). This is due to a series of changes:
      
      Commit 6f12fa77[1] made non-ND strict packets not being pushed again
      in the packet taps. This changed with commit 205704c6[2] for those
      packets having a source LLA, as they need a lookup with the orig_iif.
      
      The issue now is those packets do not skip the 'vrf_ip6_rcv' function to
      the end (as the ones without a source LLA) and go through the check to
      call packet taps again. This check was changed by commit 6f12fa77[1]
      and do not exclude non-strict packets anymore. Packets matching
      'need_strict && !is_ndisc && is_ll_src' are now being sent through the
      packet taps again. This can be seen by dumping packets on the vrf
      interface.
      
      Fix this by having the same code path for all non-ND strict packets and
      selectively lookup with the orig_iif for those with a source LLA. This
      has the effect to revert to the pre-205704c6[2] condition, which
      should also be easier to maintain.
      
      [1] 6f12fa77 ("vrf: mark skb for multicast or link-local as enslaved to VRF")
      [2] 205704c6 ("vrf: packets with lladdr src needs dst at input with orig_iif when needs strict")
      
      Fixes: 205704c6 ("vrf: packets with lladdr src needs dst at input with orig_iif when needs strict")
      Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
      Reported-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      603113c5
  9. Jun 08, 2021
    • Nicolas Dichtel's avatar
      vrf: fix maximum MTU · 9bb392f6
      Nicolas Dichtel authored
      
      My initial goal was to fix the default MTU, which is set to 65536, ie above
      the maximum defined in the driver: 65535 (ETH_MAX_MTU).
      
      In fact, it's seems more consistent, wrt min_mtu, to set the max_mtu to
      IP6_MAX_MTU (65535 + sizeof(struct ipv6hdr)) and use it by default.
      
      Let's also, for consistency, set the mtu in vrf_setup(). This function
      calls ether_setup(), which set the mtu to 1500. Thus, the whole mtu config
      is done in the same function.
      
      Before the patch:
      $ ip link add blue type vrf table 1234
      $ ip link list blue
      9: blue: <NOARP,MASTER> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
          link/ether fa:f5:27:70:24:2a brd ff:ff:ff:ff:ff:ff
      $ ip link set dev blue mtu 65535
      $ ip link set dev blue mtu 65536
      Error: mtu greater than device maximum.
      
      Fixes: 5055376a ("net: vrf: Fix ping failed when vrf mtu is set to 0")
      CC: Miaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9bb392f6
  10. Jun 02, 2021
  11. Apr 14, 2021
  12. Dec 09, 2020
  13. Dec 05, 2020
  14. Dec 04, 2020
    • Andrea Mayer's avatar
      vrf: add mac header for tunneled packets when sniffer is attached · 04893908
      Andrea Mayer authored
      
      Before this patch, a sniffer attached to a VRF used as the receiving
      interface of L3 tunneled packets detects them as malformed packets and
      it complains about that (i.e.: tcpdump shows bogus packets).
      
      The reason is that a tunneled L3 packet does not carry any L2
      information and when the VRF is set as the receiving interface of a
      decapsulated L3 packet, no mac header is currently set or valid.
      Therefore, the purpose of this patch consists of adding a MAC header to
      any packet which is directly received on the VRF interface ONLY IF:
      
       i) a sniffer is attached on the VRF and ii) the mac header is not set.
      
      In this case, the mac address of the VRF is copied in both the
      destination and the source address of the ethernet header. The protocol
      type is set either to IPv4 or IPv6, depending on which L3 packet is
      received.
      
      Signed-off-by: default avatarAndrea Mayer <andrea.mayer@uniroma2.it>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      04893908
  15. Nov 24, 2020
  16. Nov 12, 2020
    • Martin Willi's avatar
      vrf: Fix fast path output packet handling with async Netfilter rules · 9e2b7fa2
      Martin Willi authored
      
      VRF devices use an optimized direct path on output if a default qdisc
      is involved, calling Netfilter hooks directly. This path, however, does
      not consider Netfilter rules completing asynchronously, such as with
      NFQUEUE. The Netfilter okfn() is called for asynchronously accepted
      packets, but the VRF never passes that packet down the stack to send
      it out over the slave device. Using the slower redirect path for this
      seems not feasible, as we do not know beforehand if a Netfilter hook
      has asynchronously completing rules.
      
      Fix the use of asynchronously completing Netfilter rules in OUTPUT and
      POSTROUTING by using a special completion function that additionally
      calls dst_output() to pass the packet down the stack. Also, slightly
      adjust the use of nf_reset_ct() so that is called in the asynchronous
      case, too.
      
      Fixes: dcdd43c4 ("net: vrf: performance improvements for IPv4")
      Fixes: a9ec54d1 ("net: vrf: performance improvements for IPv6")
      Signed-off-by: default avatarMartin Willi <martin@strongswan.org>
      Link: https://lore.kernel.org/r/20201106073030.3974927-1-martin@strongswan.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9e2b7fa2
  17. Jul 24, 2020
    • David Ahern's avatar
      vrf: Handle CONFIG_SYSCTL not set · 1b6687e3
      David Ahern authored
      
      Randy reported compile failure when CONFIG_SYSCTL is not set/enabled:
      
      ERROR: modpost: "sysctl_vals" [drivers/net/vrf.ko] undefined!
      
      Fix by splitting out the sysctl init and cleanup into helpers that
      can be set to do nothing when CONFIG_SYSCTL is disabled. In addition,
      move vrf_strict_mode and vrf_strict_mode_change to above
      vrf_shared_table_handler (code move only) and wrap all of it
      in the ifdef CONFIG_SYSCTL.
      
      Update the strict mode tests to check for the existence of the
      /proc/sys entry.
      
      Fixes: 33306f1a ("vrf: add sysctl parameter for strict mode")
      Cc: Andrea Mayer <andrea.mayer@uniroma2.it>
      Reported-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDavid Ahern <dsahern@kernel.org>
      Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1b6687e3
  18. Jun 21, 2020
  19. May 04, 2020
  20. Apr 23, 2020
  21. Apr 22, 2020
  22. Mar 24, 2020
  23. Oct 24, 2019
    • Taehee Yoo's avatar
      net: core: add generic lockdep keys · ab92d68f
      Taehee Yoo authored
      
      Some interface types could be nested.
      (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
      These interface types should set lockdep class because, without lockdep
      class key, lockdep always warn about unexisting circular locking.
      
      In the current code, these interfaces have their own lockdep class keys and
      these manage itself. So that there are so many duplicate code around the
      /driver/net and /net/.
      This patch adds new generic lockdep keys and some helper functions for it.
      
      This patch does below changes.
      a) Add lockdep class keys in struct net_device
         - qdisc_running, xmit, addr_list, qdisc_busylock
         - these keys are used as dynamic lockdep key.
      b) When net_device is being allocated, lockdep keys are registered.
         - alloc_netdev_mqs()
      c) When net_device is being free'd llockdep keys are unregistered.
         - free_netdev()
      d) Add generic lockdep key helper function
         - netdev_register_lockdep_key()
         - netdev_unregister_lockdep_key()
         - netdev_update_lockdep_key()
      e) Remove unnecessary generic lockdep macro and functions
      f) Remove unnecessary lockdep code of each interfaces.
      
      After this patch, each interface modules don't need to maintain
      their lockdep keys.
      
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab92d68f
  24. Oct 01, 2019
    • Florian Westphal's avatar
      netfilter: drop bridge nf reset from nf_reset · 895b5c9f
      Florian Westphal authored
      
      commit 174e2381
      ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
      recycle always drop skb extensions.  The additional skb_ext_del() that is
      performed via nf_reset on napi skb recycle is not needed anymore.
      
      Most nf_reset() calls in the stack are there so queued skb won't block
      'rmmod nf_conntrack' indefinitely.
      
      This removes the skb_ext_del from nf_reset, and renames it to a more
      fitting nf_reset_ct().
      
      In a few selected places, add a call to skb_ext_reset to make sure that
      no active extensions remain.
      
      I am submitting this for "net", because we're still early in the release
      cycle.  The patch applies to net-next too, but I think the rename causes
      needless divergence between those trees.
      
      Suggested-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      895b5c9f
  25. Sep 27, 2019
  26. Jul 21, 2019
  27. Jun 26, 2019
  28. Jun 23, 2019
    • Wei Wang's avatar
      ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF · 7d9e5f42
      Wei Wang authored
      
      For tx path, in most cases, we still have to take refcnt on the dst
      cause the caller is caching the dst somewhere. But it still is
      beneficial to make use of RT6_LOOKUP_F_DST_NOREF flag while doing the
      route lookup. It is cause this flag prevents manipulating refcnt on
      net->ipv6.ip6_null_entry when doing fib6_rule_lookup() to traverse each
      routing table. The null_entry is a shared object and constant updates on
      it cause false sharing.
      
      We converted the current major lookup function ip6_route_output_flags()
      to make use of RT6_LOOKUP_F_DST_NOREF.
      
      Together with the change in the rx path, we see noticable performance
      boost:
      I ran synflood tests between 2 hosts under the same switch. Both hosts
      have 20G mlx NIC, and 8 tx/rx queues.
      Sender sends pure SYN flood with random src IPs and ports using trafgen.
      Receiver has a simple TCP listener on the target port.
      Both hosts have multiple custom rules:
      - For incoming packets, only local table is traversed.
      - For outgoing packets, 3 tables are traversed to find the route.
      The packet processing rate on the receiver is as follows:
      - Before the fix: 3.78Mpps
      - After the fix:  5.50Mpps
      
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d9e5f42
  29. May 30, 2019
  30. Apr 24, 2019
  31. Apr 08, 2019
    • David Ahern's avatar
      ipv4: Add helpers for neigh lookup for nexthop · 5c9f7c1d
      David Ahern authored
      
      A common theme in the output path is looking up a neigh entry for a
      nexthop, either the gateway in an rtable or a fallback to the daddr
      in the skb:
      
              nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
              neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
              if (unlikely(!neigh))
                      neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
      
      To allow the nexthop to be an IPv6 address we need to consider the
      family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
      on it.
      
      To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
      added in an earlier patch which handles:
      
              neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
              if (unlikely(!neigh))
                      neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
      
      And then add a second one, ip_neigh_for_gw, that calls either
      ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
      
      Update the output paths in the VRF driver and core v4 code to use
      ip_neigh_for_gw simplifying the family based lookup and making both
      ready for a v6 nexthop.
      
      ipv4_neigh_lookup has a different need - the potential to resolve a
      passed in address in addition to any gateway in the rtable or skb. Since
      this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
      difference between __neigh_create used by the helpers and neigh_create
      called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
      and bump the refcnt on the neigh entry.
      
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c9f7c1d
    • David Ahern's avatar
      neighbor: Add skip_cache argument to neigh_output · 0353f282
      David Ahern authored
      
      A later patch allows an IPv6 gateway with an IPv4 route. The neighbor
      entry will exist in the v6 ndisc table and the cached header will contain
      the ipv6 protocol which is wrong for an IPv4 packet. For an IPv4 packet to
      use the v6 neighbor entry, neigh_output needs to skip the cached header
      and just use the output callback for the neigh entry.
      
      A future patchset can look at expanding the hh_cache to handle 2
      protocols. For now, IPv6 gateways with an IPv4 route will take the
      extra overhead of generating the header.
      
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0353f282
    • Miaohe Lin's avatar
      net: vrf: Fix ping failed when vrf mtu is set to 0 · 5055376a
      Miaohe Lin authored
      
      When the mtu of a vrf device is set to 0, it would cause ping
      failed. So I think we should limit vrf mtu in a reasonable range
      to solve this problem. I set dev->min_mtu to IPV6_MIN_MTU, so it
      will works for both ipv4 and ipv6. And if dev->max_mtu still be 0
      can be confusing, so I set dev->max_mtu to ETH_MAX_MTU.
      
      Here is the reproduce step:
      
      1.Config vrf interface and set mtu to 0:
      3: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
      master vrf1 state UP mode DEFAULT group default qlen 1000
          link/ether 52:54:00:9e:dd:c1 brd ff:ff:ff:ff:ff:ff
      
      2.Ping peer:
      3: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel
      master vrf1 state UP group default qlen 1000
          link/ether 52:54:00:9e:dd:c1 brd ff:ff:ff:ff:ff:ff
          inet 10.0.0.1/16 scope global enp4s0
             valid_lft forever preferred_lft forever
      connect: Network is unreachable
      
      3.Set mtu to default value, ping works:
      PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
      64 bytes from 10.0.0.2: icmp_seq=1 ttl=64 time=1.88 ms
      
      Fixes: ad49bc63 ("net: vrf: remove MTU limits for vrf device")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5055376a
  32. Mar 28, 2019
  33. Feb 21, 2019
  34. Dec 06, 2018
  35. Nov 08, 2018
Loading