[systemd-devel] Resolver times out resending with same transaction ID

Wed Mar 29 12:19:51 UTC 2023

This report led me to few checks and indeed. What systemd-resolved is 
doing with NXDOMAIN responses from clearly proper servers is plain 
terrible. It should stop doing it the current way ASAP. Instead of 
caching negative response it doubles each query resulting in NXDOMAIN 
response. Not once as a workaround requirement detection, but for every 
single name not existing. Even for repeated queries.

Created issue 26967 [1] requesting to stop doing so weird things. Aruba 
support were able to identify failing software versions and when they 
were fixed. I think this is exactly the kind of workaround DNS Flag Day 
2019 were about. Please stop doing it by default.

Regards,
Petr

[1] https://github.com/systemd/systemd/issues/26967

On 3/21/23 06:32, Vince Del Vecchio wrote:
> Hi all,
>
> I recently observed reverse IPv4 address lookups timing out on a newly
> configured host.  (Ubuntu 22.04LTS, systemd 249.11-0ubuntu3.7).  I
> tracked the problem to the DVE-2018-0001 mitigation code.
>
> An example:
>
> $ resolvectl query 151.101.1.164
> 151.101.1.164: resolve call failed: All attempts to contact name
> servers or networks failed
>
> tcpdump shows (in relevant part):
>   00:00:00.000000 IP 192.168.1.48.35911 > 8.8.8.8.53: 26417+ [1au] PTR?
> 164.1.101.151.in-addr.arpa. (55)
>   00:00:00.021127 IP 8.8.8.8.53 > 192.168.1.48.35911: 26417 NXDomain
> 0/1/1 (115)
>   00:00:00.021252 IP 192.168.1.48.35911 > 8.8.8.8.53: 26417+ PTR?
> 164.1.101.151.in-addr.arpa. (44)
>
> The first query gets an "NXDOMAIN", which is the correct answer for
> this address.
>
> However, NXDOMAIN triggers the DVE-2018-0001 mitigation code to send an
> revised query without EDNS OPT (confirmed in debug log).  I **never see
> a response to this revised query**.
>
> If there is only a single DNS server, the resolver resends the OPT-less
> query after a timeout, and *that* gets an NXDOMAIN which is returned.
> However, if there are multiple DNS servers (e.g. 8.8.8.8 8.8.4.4), on
> timing out, it sends another query with EDNS to the next server, and
> the three-packet sequence repeats several times until it gives up.
>
> Since the server *will* respond to a retransmit after 5s, my guess is
> that the server, or maybe something in the network, is dropping close-
> in-time requests with the same transaction id.  I tried a few public
> DNSs that (surprisingly?) all behaved the same.  I haven't found a
> simple way to rule out a firewall, router or my ISP.
>
> Regardless, my thought is that resending a slightly different query
> after we did get a response should not use the same transaction id.  I
> patched systemd as follows and the problem goes away:
>
> --- a/src/resolve/resolved-dns-transaction.c
> +++ b/src/resolve/resolved-dns-transaction.c
> @@ -1312,6 +1312,7 @@ void dns_transaction_process_reply(DnsTransaction
> *t, DnsPacket *p, bool encrypt
>                             FORMAT_DNS_RCODE(DNS_PACKET_RCODE(p)),
>                             dns_server_feature_level_to_string(t-
>> clamp_feature_level_nxdomain));
>   
> +                dns_transaction_shuffle_id(t);
>                   dns_transaction_retry(t, false /* use the same server
> */);
>                   return;
>           }
>
>
> A few questions:
>
> - Does anyone else see this?
>
> - Does this look like a reasonable fix?  Any thoughts on whether the
> one other place where dns_transaction_retry(..., false) is called to
> retry the same server with a lower feature level (SERVFAIL etc) should
> do the same?
>
> - Any other issues with the patch?  Or would it be reasonable to (add
> comments and) submit a pull request?
>
> -Vince Del Vecchio
>
-- 
Petr Menšík
Software Engineer, RHEL
Red Hat, https://www.redhat.com/
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB