Skip to content

[th/platform-ip6-multipath-routes] platform: support IPv6 mulitpath routes and fix cache inconsistency

Thomas Haller requested to merge th/platform-ip6-multipath-routes into main

Background:

Route are hard. We have NMPlatform which is a cache of netlink objects. That means, we have a hash table and we cache objects based on some identity (nmp_object_id_equal()). So those objects must have some immutable, indistinguishable properties that determine whether an object is the same a different one.

For routes and routing rules, this identifying property is basically a subset of the attributes (but not all!). That makes it very hard, because tomorrow kernel could add an attribute that becomes part of the identity, and NetworkManager wouldn't recognize it, resulting in cache inconsistency by wrongly thinking two different routes are one and the same. Anyway.

The other point is that we rely on netlink events to maintain the cache. So when we receive a RTM_NEWROUTE we add the object to the cache, and delete it during RTM_DELROUTE. When you do ip route replace, kernel might replace a (different!) route, but only send one RTM_NEWROUTE message. We handle that by somehow finding the route that was replaced/deleted. It's ugly. Did I say, that routes are hard?

Also, for IPv4 routes, multipath attributes are just a part of the routes identity. That is, you add two different routes that only differ by their multipath list, and then kernel does as you would expect. NetworkManager does not support IPv4 multihop routes and just ignores them. As the multipath attribute is part of the identity of an IPv4 route, we can get away with pretending that multipath IPv4 routes don't exist.

Not so for IPv6. When you add (ip route append) an IPv6 route that is identical to an exist route -- except their multipath attribute -- then it behaves as if the existing route was modified and the result is the merged route with more next-hops. Note that in this case kernel will only send a RTM_NEWROUTE message with the full multipath list. If we would treat the multipath list as part of the route's identity, this would be as if kernel deleted one routes and created a different one (the merged one), but only sending one notification. It would be nightmare to find out which route was thereby replaced. Likewise, when you delete a route, then kernel will "subtract" the next-hop and sent a RTM_DELROUTE notification only about the next-hop that was deleted. To handle that, you would have to find the full multihop route, and replace it with the subtracted result.

NetworkManager so far ignored IPv6 routes with more than one next-hop, this means you can start with one single-hop route (that NetworkManger sees and has in the platform cache). Then you create a similar route (only differing by the next-hop). Kernel will merge the routes, but not notify NetworkManager that the single-hop route is not longer a single-hop route. This can easily cause a cache inconsistency and subtle bugs. For IPv6 we MUST handle multihop routes.

Kernels behavior makes little sense, if you expect that routes have an immutable identity and want to get notifications about addition/removal. We can however make sense by it by pretending that all IPv6 routes are single-hop! With only the twist that a single RTM_NEWROUTE notification might notify about multiple routes at the same time. This is what the patch does.

The Patch

Now one RTM_NEWROUTE message can contain multiple IPv6 routes (NMPObject). That would mean that nmp_object_new_from_nl() needs to return a list of objects. But it's not implemented that way. Instead, we still call nmp_object_new_from_nl(), and the parsing code can indicate that there is something more, indicating the caller to call nmp_object_new_from_nl() again in a loop to fetch more objects.

In practice, I think all RTM_DELROUTE messages for IPv6 routes are single-hop. Still, we implement it to handle also multi-hop messages the same way.

Note that we just parse the netlink message again from scratch. The alternative would be to parse the first object once, and then clone the object and only update the next-hop. That would be more efficient, but probably harder to understand/implement.

https://bugzilla.redhat.com/show_bug.cgi?id=1837254#c20

Merge request reports