platform: add small backoff time before resync
Summary
Add a small backoff time before doing a resync when we start losing Netlink messages due to receiving too many of them.
Resolves: https://issues.redhat.com/browse/RHEL-29902
Purpose
If the socket's RX buffer is full it's probably because other process is doing lot of changes very quickly, faster than we can process them. Let's give the writer a small time to finish:
- Avoid contending the kernel's RTNL lock, so we don't make the whole situation even worse and it can finish earlier.
- Avoid having to resync again and again due to trying to resync while the writer is still doing quick changes, so we are unable to catch up yet.
This won't help if this situation takes a long time or is continuous, but that's unlikely to happen, and if it does, it's the writer's fault for starving the whole system.
There is no need to progresively increase the backoff time for the same reason: if this situation takes lot of time, it's the writer's fault.
Note that after f6411ed9 ('platform: dump only selected route protocols') the consequences of doing a resync are way less catastrophic because the routes' dump is quite small. However, it can still be noticeable in some circumstances that I couldn't determine exactly, but seem related to lock contention in the kernel.
In NMCI's ipv6_ignore_nonstatic_routes
test we were seeing quite long times in the step Execute "ip -b ...
sometimes. In the wort cases, it was even causing the next step of the test to fail. Now, with this patch, this step always takes ~4s. Much less messages "platform-linux: netlink[rtnl]: read: too many netlink events. Need to resynchronize platform cache" are seen in the logs now, too.
Checklist
Please read https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/blob/main/CONTRIBUTING.md before opening the merge request. In particular, check that:
-
the subject for all commits is concise and explicative -
the message for all commits explains the reason for the change -
the source is properly formatted -
any relevant documentation is up to date -
you have added unit tests if applicable -
the NEWS file is updated when the change deserves to be mentioned, for example for new features, behavior changes, API deprecations, etc.