[systemd-devel] Starting units when a port is available for connections

Thu Jun 18 11:18:56 PDT 2015

On Wed, 27.05.15 19:09, Adam Zegelin (adam at instaclustr.com) wrote:

Heya,

> I’ve successfully managed to set the service type to “notify” and
> modify C* to call sd_notify() when is ready to accept client
> connections.  Further experimentation reveals that this is not an
> ideal solution. C* can take a long time (minutes to _hours_) to
> reach the point where it will accept client connections/queries. The
> default startup timeout is 90s, which causes the service to be
> marked failed if exceeded, hence C*, with its long startup times,
> will often never get the chance to transition to “active”.

You could increase the timeout with TimeoutStartSec= for your service.

> Part of the issue for me is trying to define what “active”
> means. The man pages, for “Type=forking" services, says: "The parent
> process is expected to exit when start-up is complete and all
> communication channels are set up”. I’m assuming for “notify”
> services, sd_notify() should be called when "start-up is complete
> and all communication channels are set up”. Even if this takes
> hours?

Yes, that should be fine.

> Cassandra exposes a number of inet ports of interest:

> - Client connection ports for running queries via Cassandra Query
>   Language (CQL)/Thrift (RPC) — this is what most clients use to
>   query the database (i.e., to run `SELECT * FROM …` style queries)

> - JMX (Java Management Extensions) for performing management
>   operations — the C* and 3rd-party management tools use this to
>   call management functions and to collect statistics/metrics about
>   the JVM and C*.
> 
> The JMX socket is available a few seconds after the process is running.
> 
> The CQL/Thrift ports can take far longer to become available —
> sometimes hours after the process starts. Cassandra only starts
> listening on these ports once it has joined the cluster of nodes &
> has synchronised its state. State synchronisation may require
> bootstrapping & copying large amounts of data across the network and
> hence take a long time to complete.
> 
> Currently my dependent C* client units simply spin-wait, attempting
> to establish a connection to C*. This seems like duplicated effort
> and makes these services more complex than they need to be.
> 
> My original thought was to just disable the startup timeout on the
> C*, but that means the unit will stay “activating” for a long
> time. Also means that JMX clients, which can establish connections
> almost immediately, would have their startup deferred unnecessarily.

>From systemd's PoV there's nothing wrong with having services that
stay for a long time in "starting"... It's not typical, but if you
bump the timeout for the service there's nothing wrong with it.

> Ideally I’d like to be able to write units that can depend on
> individual ports being available from a process — i.e, when the CQL
> port is available, start the client unit(s) and when JMX is
> available, start a monitoring service. Is this possible with
> systemd?

Not really. A service only has a single state machine, and it only
knows one activating and one active state, and that's not configurable
by services.

> Alternatively, I was thinking that I could write some kind of simple
> process/script that attempts a connection, and exits with failure if
> the connection cannot be established, or success if it can. I’d then
> write a unit file, e.g. `cassandra-cql-port.service`:
>
> 	[Unit]
> 	# not really sure what combo of Wants/Requires/Requisite/BindsTo/PartOf/Before/After is needed
> 	Requisite=cassandra.service
> 
> 	[Service]
> 	Type=oneshot
> 	RemainAfterExit=true
> 	ExecStart=/opt/bin/watch-port 9042
> 	Restart=on-failure
> 	RestartSec=1min
> 	StartLimitInterval=0
> 
> My client units could then want/require this unit. Is this a valid
> approach?

I figure for this case. In systemd we have a tool
"systemd-network-wait-online" that works like that and simply waits
for some condition (specifically whether the network is up) to be
true. Services can then order themselves after it, if they need.

> Or am I walking down the wrong path to use systemd to manage this?

systemd does not offer any mechanism to cover nicely what you are
trying to do. 

Generally, given that this is a networked setup I#d always recommend
writing the clients in a way that they can deal with servers being
temporarily unavailable, and then start everything right-away, instead
of trying to map this to systemd's dependency logic. That way things
would be most robust, and it wouldn't matter how things are
distributed on multiple systems.

Lennart

-- 
Lennart Poettering, Red Hat