[systemd-devel] How to disable seccomp in systemd-nspawn?

Lennart Poettering lennart at poettering.net
Sun Aug 16 14:47:55 UTC 2020


On So, 16.08.20 15:01, Steve Dodd (steved424 at gmail.com) wrote:

> On Sun, 16 Aug 2020 at 14:54, Lennart Poettering <lennart at poettering.net>
> wrote:
>
>
> > > I've just been bitten by this - last time I looked into a similar
> > problem,
> > > it seemed the calling code was confused by getting EPERM instead of
> > ENOSYS.
> > > Could we distinguish between these two cases and generate the right error
> > > code? It would save a lot of aggro when working with containers..
> >
>
>
> > Which error to return is a bit of a bikeshedding thing.
> >
> > We return EPERM because this is about sandboxing for us, i.e. access
> > control. And we want to communicate that correctly to payloads, so we
> > say so.
> >
> > ENOSYS would be something we'd return if we'd pretend that something
> > isn't available even though it is.
> >
>
> I'm assuming we don't actually check what's available on the host kernel..
> All the problems I've hit around this have been new syscalls which libc
> tests for by checking for ENOSYS - if it gets that, it falls back to a
> different implementation. If it gets EPERM, however, it just assumes the
> operation failed and returns to caller, which leaves poor users like me and
> the OP scratching their heads :)

Hmm, well, noone knows what seccomp filters people install with the
myriad of seccomp using tools we have these days.

I think it would be wise to use do fallback logic for EPERM too. It's
the error that nspawn uses since day #1 basically. I am a bit puzzled
noone noticed this before, afaik glibc test cases at least on Fedora
(where most glibc upstream devs work on) run in nspawn, so how did
noone notice?

I also think glibc should probably continue to use the old syscalls if
possible and only use the new syscalls when the old ones won't
do... After all by needlessly using new syscalls won't just trip up
thins here, but all across the board where people decode/track
syscalls, even in strace or so...

> A rule of thumb might be to return ENOSYS for anything libseccomp doesn't
> know about - is it possible to look things up that way around?

libseccomp doesn't allow us to install filters for syscalls it doesn't
know anyway iirc...

Not sure I follow though? Why would that help?

> Another useful thing might be to allow whitelisting by syscall number -
> again don't know if seccomp allows this. Would allow easier work arounds in
> cases like this without having to go off and backport libseccomp...

syscall numbers are highly arch dep, we currently don't support that
because you cannot reasonably express this in unit files, as they'd
become very much arch dependent then.

That said, I'd be happy to review/merge a patch that adds a syntax
where you could spell out SystemCallFilter=x86-64:345 for example,
i.e. specify arch plus syscall nr. But it's still ugly, since it would
do result in different filers on different archs.

> Third thing on my wishlist might be a log entry for denied syscalls
> somewhere ..

Hmm, this would make a ton of sense. We currently have a "log" seccomp
action, but it will just log and allow anyway. we'd need another
action that would log and refuse. Please file an RFE, or even better
prep a PR for this!

Lennart

--
Lennart Poettering, Berlin


More information about the systemd-devel mailing list