Resolver stops working and returns SERVFAIL until restarted
Some time after normal operation, knot-resolver stops resolving any domains and returns SERVFAIL on all DNS queries. I have the following configuration:
# cat /etc/knot-resolver/kresd.conf
user('knot-resolver','knot-resolver')
cache.size = 300 * MB
net.ipv6 = false
modules = {
'hints > iterate', -- Load /etc/hosts and allow custom root hints
'stats', -- Track internal statistics
'predict', -- Prefetch expiring/frequent records
}
-- minimum TTL = 2 minutes
cache.min_ttl(120)
dofile("/etc/knot-resolver/knot-aliases-alt.conf")
policy.add(
policy.suffix(
policy.STUB(
{'127.0.0.4'}
),
policy.todnames(blocked_hosts)
)
)
# cat /etc/knot-resolver/knot-aliases-alt.conf
blocked_hosts = {
"0000a-fast-proxy.de.",
"002cc20.icu.",
"007ingyenletoltes.hu.",
"007rc.biz.",
"007slots.com.",
"00seeds.com.",
"010119azino777.com.",
"010119azino777.ru.",
…
"zzzes.ru.",
"zzztorrent.net.",
"zzzz1.live.",
"zzzz2.live.",
}
Both normal recursive queries and queries which should be forwarded to 127.0.0.4 (from blocked_hosts) fail to work.
I've just enabled verbose logging to monitor the issue, but the log seems to buffer a lot. I see new information in journald's journalctl in spikes, a large log every 30 seconds or so. I'm not sure if this is some sort of cache and is to be expected, or it shows some kind of lock problem. It even triggered a watchdog once:
systemd[1]: kresd@1.service: Watchdog timeout (limit 10s)!
systemd[1]: kresd@1.service: Killing process 23036 (kresd) with signal SIGABRT.
systemd[1]: kresd@1.service: Main process exited, code=killed, status=6/ABRT
systemd[1]: kresd@1.service: Unit entered failed state.
systemd[1]: kresd@1.service: Failed with result 'watchdog'.
systemd[1]: kresd@1.service: Service hold-off time over, scheduling restart.
The issue happens irregularly. It used to works fine for weeks but in the last 3 days it happened for 3 times. Sometimes it takes dozens of hours, some time only several minutes. I did not update the configuration and updated the software only after second time. It happens on 4.1.0.
Right now I'm running verbose logging and will update this issue when it happens again.