[systemd-devel] Securing bind with systemd methods (was: bind-mount of /run/systemd for chrooted bind9/named)

Petr Menšík pemensik at redhat.com
Thu Jul 20 16:25:07 UTC 2023


BIND wans to read ephemeral port ranges to use for outgoing queries. We 
have such special quirks bind-mounted into bind chroot. But without 
SELinux-like protection that might not be needed.

Consider read-binding /proc/sys/net/ipv4/ip_local_port_range into chroot.

We have also /etc/rndc.{conf,key} mounted into chroot. Check named -V 
for your build paths, but I would want them in the chroot too.

Cheers,

Petr

On 7/17/23 14:44, Marc Haber wrote:
> Hi,
>
> I'm back. This is my first try at doing a decent systemd unit for bind 9
> / named chrooted with named's own features, making the chroot minimal
> and code-free.
>
> Here we go (this has been merged from various plug-in/overrides files, I
> don't guarantee correct syntax). I have interspersed my
> comments/questions as # comments. If one of the suggested improvements
> warrant filing of an issue, let me know and I'll write well-explained
> issues that are able to stand for themselves.
>
> The first phase of writing this unit was done with systemd 253 on Debian
> unstable, the second phase was on a productive machine running Debian
> stable, systemd 252.
>
> [Unit]
> Description=BIND Domain Name Server
> Documentation=man:named(8)
> After=network.target network-online.target
> Wants=nss-lookup.target network-online.target
> Before=nss-lookup.target
> StartLimitIntervalSec=90s
> StartLimitBurst=5
>
> [Service]
> Type=notify
> ExecStart=/usr/sbin/named -f -u bind -c /etc/bind/named.conf -t /var/local/chroot/bind
> # named(8): In routine operation, signals should not be used to control
> # the nameserver; rndc  should  be  used instead. We're following
> # upstream's advice here.
> ExecReload=/usr/sbin/rndc reload
> ExecStop=/usr/sbin/rndc stop
> Restart=on-failure
> RestartSec=5s
> # I'd rather not have / as working directory and this looks the most
> # sensible
> WorkingDirectory=/var/local/chroot/bind
> # Setting RootDirectory=/ results into service failure ("too many
> # symlinks"), repeated StartLimitBurst times. I think this should be
> # special cased with a better speaking error message if RootDirectory=/
> # is unwanted. I'd like to explain why I tried that - a lot of the
> # sandboxing directives only apply (or make sense) if RootDirectory
> # is set or a service is being chrooted, my service is chrooting itself
> # and I wanted systemd to know about that and enable those directives
> # that only work in the RootDirectory set case. If I'm not making sense
> # here, then it's a docs issue ;-)
> #RootDirectory=/
> ProtectProc=invisible
> ProcSubset=pid
> BindReadOnlyPaths=/run/systemd/notify:/var/local/chroot/bind/run/systemd/notify
> BindReadOnlyPaths=/usr/share/dns:/var/local/chroot/bind/usr/share/dns
> User=bind
> Group=bind
> UMask=077
> # This means that my non-root service gets those three capabilities and
> # is unable to obtain more, right? Would this warrant its own
> # configuration directive like "servcie has those capabilities, not
> # more, not less than that"?
> CapabilityBoundingSet=cap_net_admin cap_net_bind_service cap_sys_chroot
> AmbientCapabilities=  cap_net_admin cap_net_bind_service cap_sys_chroot
> NoNewPrivileges=true
> # Haven't investigated the AppArmor profiles that come with bind yet
> #AppArmorProfile
> ProtectSystem=strict
> ProtectHome=yes
> # {Runtime,Cache,Configuration}Directory cannot be used
> # because our bind chroots itself and those directives only
> # create directories under the standard paths. This makes those
> # directives useless in the case where a service chroots itself and
> # needs its Cache, Configuration etc inside the chroot. Maybe it
> # makes sense to adapt the functionality to support this case?
> #RuntimeDirectory=bind
> ReadWritePaths=/var/local/chroot/bind/run
> #CacheDirectory=bind
> ReadWritePaths=/var/local/chroot/bind/var/cache/bind
> #ConfigurationDirectory=bind
> ReadOnlyPaths=/
> InaccessiblePaths=-/lost+found
> NoExecPaths=/
> # /lib is necessary here, or execve will fail without indication for
> # reason - that was a surprise and hard to debug because even strace
> # didnt hint me towards the real issue
> ExecPaths=/usr/sbin/named /usr/sbin/rndc /lib
> PrivateTmp=true
> PrivateDevices=true
> PrivateIPC=true
> # enabling PrivateUsers=true causes bind to not bind to its ports and
> # log "couldn't add command channel 127.0.0.1#953: permission denied"
> # What do PrivateUsers have to do with binding to ports?
> ProtectHostname=true
> ProtectClock=true
> ProtectKernelTunables=true
> ProtectKernelModules=true
> ProtectKernelLogs=true
> ProtectControlGroups=true
> # if AF_UNIX is mentioned in systemd.exec(5), maybe mentioning
> # AF_NETLINK would also be in order? This was also one of the
> # solutions I had to pull from an strace.
> RestrictAddressFamilies=AF_NETLINK AF_UNIX AF_INET AF_INET6
> RestrictNamespaces=~user pid net uts mnt cgroup ipc
> LockPersonality=true
> MemoryDenyWriteExecute=true
> RestrictRealtime=true
> RestrictSUIDSGID=true
> RemoveIPC=true
> # My first version of SystemCallFilter was like ~@mount ~@swap
> # ~@resources etc, which didn't work. Reading the docs with a computer
> # scientist's mind ("informatiker") gave a hint, but I think this is
> # hard to understand for people who haven't had formal training. But I
> # also understand that this is hard to change without changing semantics
> # for existing units, so maybe a few examples in systemd.exec(5) might ease
> # this - the SystemCallFilter chapter in systemd.exec(5) is already long
> # though. @raw-ip isnt available in systemd 252, so I had to template
> # that in my ansible. And setuid is setuid32 on 32 bit archs like armhf,
> # so I had to template _that_ for my Banana Pi.
> SystemCallFilter=~@mount @swap @raw-ip @resources @reboot @privileged @obsolete
> @module @debug @cpu-emulation @clock
> SystemCallFilter=chroot setuid
> SystemCallArchitectures=native
>
> [Install]
> WantedBy=multi-user.target
> # strangely, this alias only holds if the unit is enabled. If the unit
> # is disabled, the alias is not available which was kind of a surprise.
> Alias=bind9.service
>
> Generally, the error messages I received during the debugging phase were
> not very helpful. I frequently had to resort to strace -p 1 to find out
> what exactly went wrong trying to start named.
>
> For example, there is no exact feedback when the daemon is being
> terminated because of a SystemCallFilter violation, I'd like the system
> call in question to be part of the log.
>
> The same applies to directives regarding sandboxing, when paths are
> given in the directive. My way to debug was either randomly removing
> some of the directives to narrow down the possible error range, or
> stracing again to find out what my daemon tried before it was
> terminated.
>
> Those things might be out of scope for systemd, I simply don't know.
>
> With this unit, systemd-analyze security named is now down to "1.9 OK",
> I think it was > 9 with the standard unit.
>
> Thanks for your help, I wanted to give something back. I'll probably
> suggest this unit for the Debian package once it has reached some
> stability.
>
> Greetings
> Marc
>
-- 
Petr Menšík
Software Engineer, RHEL
Red Hat, https://www.redhat.com/
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB



More information about the systemd-devel mailing list