DNS failure when response is truncated
I had an issue where the captive portal page failed to load for my work's wifi network. The page would load on Windows computers and iPhones. It would not load for Android phones or for my linux laptop with NetworkManager. I got the "Hotspot Login" popup with a message that said 'Error resolving "captiveportal.mycompany.com": Temporary failure in name resolution'.
I configured dns=systemd-resolved
, then the captive portal page loaded. It would not load without the systemd-resolved backend.
When it was failing to load I ran 3 commands:
$ nslookup captiveportal.mycompany.com
$ host captiveportal.mycompany.com
$ dig captiveportal.mycompany.com
nslookup said "Truncated, retrying in TCP mode", then "connection timed out; no servers could be reached". host said "connection timed out; no servers could be reached". dig was able to resolve captiveportal.mycompany.com. Then there was extra data in the Authority and Additional sections. These were NS records and A records pointing at the name servers. Dig showed "MSG SIZE rcvd: 959".
The DNS server used was internal to our network and only supports UDP. What happened is that since the message was too long, it set the truncated bit in its responses. When the client saw the truncated bit, it sent a TCP port 53 request back to the DNS server which timed out. I verified this using wireshark.
I fixed the issue by shortening the response the DNS server sent back; most it was unnecessary. We also could have fixed the issue by supporting TCP DNS. RFC 1123 section 6.1.3.2 says:
DNS resolvers and recursive servers MUST support UDP, and SHOULD support TCP, for sending (non-zone-transfer) queries. Specifically, a DNS resolver or server that is sending a non-zone-transfer query MUST send a UDP query first. If the Answer section of the response is truncated and if the requester supports TCP, it SHOULD try the query again using TCP.
So I consider our DNS server to be in the wrong for only supporting UDP.
I'm filing this issue just so that the Network Manager devs are aware of this behavior. This might be Network Manager working as expected. But it seemed odd that everything worked on Windows and using systemd-resolved, but not with Network Manager's DNS.
The question for DNS clients is: if you recieve a truncated DNS response with the A record that you asked for and subsequent TCP DNS fails, should you or should you not make use of that A record?
I hope that makes sense. Thanks for reading.