[systemd-devel] [PATCH 3/3] core: support Distribute=n to distribute to n SO_REUSEPORT workers

Zbigniew Jędrzejewski-Szmek zbyszek at in.waw.pl
Thu Nov 14 11:05:10 PST 2013


On Thu, Nov 14, 2013 at 07:46:16PM +0100, Lennart Poettering wrote:
> On Thu, 14.11.13 19:31, Zbigniew Jędrzejewski-Szmek (zbyszek at in.waw.pl) wrote:
> 
> > 
> > On Thu, Nov 14, 2013 at 07:10:51PM +0100, Lennart Poettering wrote:
> > > When the first connection comes in we'd spawn as many instances as
> > > configured in Distribute=
> > Hm, that seems like a big penalty. Why not instead:
> > - when the first connection comes in, start one worker, keep listening
> > - when the second connection comes in, start one worker, keep listening
> > ...
> > - when the n-th connection comes in, start one worker, stop listening
> > 
> > This way at least we don't have more workers than connections, and
> > it staggers the launching of workers a bit, avoiding an overload.
> 
> Well, I don't see how we could make this work, neither with SO_REUSEPORT
> nor with simple duplicated sockets. After all, in this case systemd
> doesn't accept the connections, it just watches the original listening
> fd for the first time it gets POLLIN on it. That's all. From that there
> is no way to determine how many connections are currently going on,
> i.e. how many connections other processes which share the fd have going
> on.
> 
> If SO_REUSEPORT is used, then I'd expect PID 1 to hand the listening
> socket it used itself to the first instance it spawned plus a new
> socket that is bound to the same address to the second
Stop here. Instead of starting a second instance right now, listen
on the socket. When new connections come in, they might be scheduled
to first instance, or to systemd. If we get one, then we start another
instance, give it this socket, and open another socket and start listening
on it. With m processes started, we will receive 1/(m+1) of new connections.

Now things get a bit more complicated, depending on how long the connections
live. But let's assume that they are short... Then each worker has
approx. 1/m connections, m <= n.

If connections live long, then the first worker has more connections
than 1/m, the second one has a bit less, etc. In the limiting case
that connections live "forever", i.e. much longer than the average
time between connections, e.g. ssh, and the number of connections is
small enough that those initial conditions matter, the first process
will have 1 + 1/2 + 1/3 + 1/4 + ... + 1/n ~= ln(n) + 1/2, the
second will have 1/2 + 1/3 + ... + 1/n ~= ln(n) - 1/2, the
third will have 1/3 + ... + 1/n ~= ln(n) - 1/2 - 1/3, etc. If
we don't like this unevenness, we could start by opening n
SO_REUSEPORT sockets in the beggining, and then activating workers
on demand. The downside is the large number of sockets.

> , and so on for
> all others. Now, if PID 1 keeps watching that original fd, it will get a
> wakeup only when the kernel decides to deliver an incoming connection to
> the fd the first instance is using, and I doubt that is a particularly
> useful information. If SO_RESUEPORT is not used, then I'd expect PID 1
> to hand the listening socket to all instances. If it then kept watching
> it, then it will get even worse information, it will in the worst case
> wake up with every incoming connection, and in the best case miss a
> number of them, and again wthout any chance to determine how many
> incoming connections there are...
> 
> The only thing we could do is to parse /proc/net/tcp and count how many
> connections are active bound to the same local address/port. But yikes,
> that'd be ugly and inefficient.
>
> Maybe one day the kernel als SO_GETCONCURRENT os so, which would tell us
> how many connection sockets are bound to the same local address/port as
> the socket we'd call this on is. Only then we could do such load
> management...

Zbyszek


More information about the systemd-devel mailing list