src/dns/nm-dns-plugin.h · 93d5efb4869886a8e8e7a1c05877c9cdf5c6e2d8 · NetworkManager / NetworkManager

dns: move ratelimiting and restart from NMDnsManager to NMDnsDnsmasq · 93d5efb4
Thomas Haller authored Sep 01, 2019
Note that the only DNS plugin that actually emits the FAILED signal was
NMDnsDnsmasq. Let's not handle restart, retry and rate-limiting by
NMDnsManager but by NMDnsDnsmasq itself.

There are three goals here:

(1) we want that when dnsmasq (infrequently) crashes, that we always keep
  retrying. A random crash should be automatically resolved and
  eventually dnsmasq should be working again.
  Note that we anyway cannot fully detect whether something is wrong.
  OK, we detect crashes, but if dnsmasq just gets catatonic, it's just
  as broken. Point being: our ability to detect non-working dnsmasq is limited.

(2) when dnsmasq keeps crashing all the time, then rate limit the retry.
  Of course, at this point there is already something seriously wrong,
  but we shouldn't kill the system by respawning the process without rate
  limiting.

(3) previously, when NMDnsManager noticed that the pluging was broken
  (and rate-limiting kicked in), it would temporarily disable the plugin.
  Basically, that meant to write the real name servers to /etc/resolv.conf
  directly, instead of setting localhost. This partly conflicts with
  (1), because we want to retry and recover automatically. So what good
  is it to notice a problem, resort to plain /etc/resolv.conf for a
  short time, and then run into the issues again? If something is really
  broken, there is no way but to involve the user to investigate and
  fix the issue. Hence, we don't need to concern NMDnsManager with this either.
  The only thing that the manager notices is when the dnsmasq binary is not
  available. In that case, update() fails right away, and the manager falls back
  to configure the name servers in /etc/resolv.conf directly.

Also, change the backoff time from 5 minutes to 1 minute (twice the
burst interval). There is not particularly strong reason for either
choice, I think that if the ratelimit kicks in, then something is
already so wrong that it doesn't matter either way. Anyway, also 60
seconds is long enough to not kill the machine otherwise.
93d5efb4