[systemd-devel] [PATCH] loopback setup in unprivileged containers

Stéphane Graber stephane.graber at canonical.com
Sun Dec 28 09:18:58 PST 2014


On Sun, Dec 28, 2014 at 01:48:23PM +0100, Tom Gundersen wrote:
> Hi Martin,
> 
> On Sat, Dec 27, 2014 at 7:27 PM, Martin Pitt <martin.pitt at ubuntu.com> wrote:
> > I'm forwarding a patch for the loopback setup from Stéphane. I already
> > pushed one part of it as http://cgit.freedesktop.org/systemd/systemd/commit/?id=58a489c
> > which is trivial and obvious, but the other part isn't.
> 
> Thanks for that fix!
> 
> I had a look at this code again, and it turns out that the whole
> address checking is not really needed any longer, and can be
> simplified quite a bit. I'd like to push the attached patch if no one
> objects.
> 
> > Stéphane Graber <stgraber at ubuntu.com> wrote:
> >> Attached is a pretty simple patch/workaround to fix the massive CPU
> >> usage of systemd in unprivileged containers.
> >>
> >> LXC provides each containers with an already-UP loopback device. systemd
> >> will attempt to bring it up regardless of its current state and doing so
> >> gets it into a broken codepath somewhere deep in the netlink handling
> >> code of systemd.
> 
> Hi Stéphane,
> 
> I was not able to reproduce this. Is it reproducable for you using
> nspawn? If not, could you point me to how to reproduce it with LXC, or
> even better give some more details about the failure you see "deep in
> the netlink handling"? Is it 100% reproducible, and are you able to
> get a backtrace? This really sounds like something we need to fix at
> its root.

Hi,

My host system doesn't have nspawn so I can't easily test it this way,
but it was my understanding that nspawn didn't support user namespaces
and uid/gid mappings which is what I'm working with here.

Now, as far as I could tell, the problem was when reading a response back
over netlink where I'd end up in an infinite recvmsg loop which would
eventually return once the timeout for the operation would be reached.


To reproduce, I'm using current LXC from git + a patch to make
cap-dropping work in unprivileged containers and combined with the current
git of lxcfs (which in turn depends on recent cgmanager).
So not a trivial setup to reproduce at this point...


I have however reverted to an unpatched systemd and traced the container
boot, so hopefully that'll give you a clue as to what's going on.

That's on Ubuntu 15.04 with systemd 218 and a 3.13 kernel.

The attached strace was made right from the start of systemd all the way
until I got a login prompt, so it's quite massive (190MB uncompressed)
but should contain everything.

At the time systemd gets started, the "ip link show" output is:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
95: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:16:3e:42:35:ca brd ff:ff:ff:ff:ff:ff

And the "ip addr show" output is:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
97: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 00:16:3e:92:13:4c brd ff:ff:ff:ff:ff:ff
    inet6 fe80::216:3eff:fe92:134c/64 scope link tentative 
       valid_lft forever preferred_lft forever

> 
> > The fix is to always check whether the loopback is ready to use before
> > doing anything.
> 
> The workaround looks fine (i.e., it will give the correct behaviour),
> but I'd really prefer that we don't do this upstream, but rather fix
> the underlying problem.

Agreed. All netlink operations should work as expected even in a user
namespace (that is, uid 0 in the user namespace will have all the usual
privileges against the network namespace that it owns, if any).

I've also never ran into that kind of problem when manually using the
iproute tools, dhcp client or various other daemons that make use of
netlink, so if there's a kernel bug somewhere there, it's hidden in a corner.

> 
> Cheers,
> 
> Tom

> From 13139185a50c286769810e3e7979cfcf51c48ee9 Mon Sep 17 00:00:00 2001
> From: Tom Gundersen <teg at jklm.no>
> Date: Sun, 28 Dec 2014 13:38:23 +0100
> Subject: [PATCH] core: loopback - simplify check_loopback()
> 
> We no longer configure the addresses on the loopback interface, but simply bring it up
> and let the kernel do the rest. Also change the check to only check if the interface
> is up, rather than checking for the IPv4 loopback address.
> ---
>  src/core/loopback-setup.c | 42 ++++++++++++++++++------------------------
>  1 file changed, 18 insertions(+), 24 deletions(-)
> 
> diff --git a/src/core/loopback-setup.c b/src/core/loopback-setup.c
> index ab6335c..0d7d00c 100644
> --- a/src/core/loopback-setup.c
> +++ b/src/core/loopback-setup.c
> @@ -56,30 +56,24 @@ static int start_loopback(sd_rtnl *rtnl) {
>          return 0;
>  }
>  
> -static int check_loopback(void) {
> +static bool check_loopback(sd_rtnl *rtnl) {
> +        _cleanup_rtnl_message_unref_ sd_rtnl_message *req = NULL, *reply = NULL;
> +        unsigned flags;
>          int r;
> -        _cleanup_close_ int fd = -1;
> -        union {
> -                struct sockaddr sa;
> -                struct sockaddr_in in;
> -        } sa = {
> -                .in.sin_family = AF_INET,
> -                .in.sin_addr.s_addr = htonl(INADDR_LOOPBACK),
> -        };
> -
> -        /* If we failed to set up the loop back device, check whether
> -         * it might already be set up */
> -
> -        fd = socket(AF_INET, SOCK_DGRAM|SOCK_NONBLOCK|SOCK_CLOEXEC, 0);
> -        if (fd < 0)
> -                return -errno;
> -
> -        if (bind(fd, &sa.sa, sizeof(sa.in)) >= 0)
> -                r = 1;
> -        else
> -                r = errno == EADDRNOTAVAIL ? 0 : -errno;
> -
> -        return r;
> +
> +        r = sd_rtnl_message_new_link(rtnl, &req, RTM_GETLINK, LOOPBACK_IFINDEX);
> +        if (r < 0)
> +                return r;
> +
> +        r = sd_rtnl_call(rtnl, req, 0, &reply);
> +        if (r < 0)
> +                return r;
> +
> +        r = sd_rtnl_message_link_get_flags(reply, &flags);
> +        if (r < 0)
> +                return r;
> +
> +        return flags & IFF_UP;
>  }
>  
>  int loopback_setup(void) {
> @@ -92,7 +86,7 @@ int loopback_setup(void) {
>  
>          r = start_loopback(rtnl);
>          if (r == -EPERM) {
> -                if (check_loopback() < 0)
> +                if (!check_loopback(rtnl))
>                          return log_warning_errno(EPERM, "Failed to configure loopback device: %m");
>          } else if (r < 0)
>                  return log_warning_errno(r, "Failed to configure loopback device: %m");
> -- 
> 2.2.0
> 


-- 
Stéphane Graber
Ubuntu developer
http://www.canonical.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: systemd-strace.xz
Type: application/octet-stream
Size: 303192 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/systemd-devel/attachments/20141228/f7433bfe/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: Digital signature
URL: <http://lists.freedesktop.org/archives/systemd-devel/attachments/20141228/f7433bfe/attachment-0001.sig>


More information about the systemd-devel mailing list