[systemd-devel] [RFC/PATCH] journal over the network

Mon Nov 19 18:35:30 PST 2012

On Tue, Nov 20, 2012 at 02:21:54AM +0100, Lennart Poettering wrote:
> On Mon, 19.11.12 01:21, Zbigniew Jędrzejewski-Szmek (zbyszek at in.waw.pl) wrote:
> 
> Heya,
> 
> I like your work!
Thanks :)

> 
> > The program (called systemd-journal-remoted now, but I'd be happy to
> > hear suggestions for a better name) listens on sockets (either from
> 
> Since this is also useful when run on the command line I'd really prefer
> to drop the "d" suffix, i.e. "systemd-journal-remote" sounds like a good
> name for it.
OK.

> > socket activation, or specified on the command line with --listen=),
> > or reads stdin (if given --stdin), or uses curl to receive events from
> > a systemd-journal-gatewayd instance (with --url=). So it can be used
> > a server, or as a standalone binary.
> 
> What precisely does --listen= speak?
It just reads pure 'export' stream.

> My intention was to speak only HTTP for all of this, so that we can
> nicely work through firewalls.
Yeah, probably that's more useful than raw stream for normal purposes,
since it allows for authentication and whatnot.

> > Messages must be in the export format. They are parsed and stored
> > into a journal file. The journal file is /var/log/journal/external-*.journal
> > by default, but this can be overridden by commandline options
> > (--output).
> 
> Sounds good!
> 
> I think it would make sense to drop things into
> /var/log/journal/<hostname>/*.journal by default. The hostname would
> have to be determined from the URL the user specified on the command
> line. Ideally we'd use the machine ID here, but since the machine ID is
> hardly something the user should specify on the command line (and we
> cannot just take the machine ID supplied form the other side, because we
> probably should not trust that and hence allow it to tell us to
> overwrite another hosts' data), the hostname is the next best
> thing. Currently libsystemd-journald will ignore directories that are
> not machine IDs when browsing, but we could easily drop that limitation.
So it seems that this mapping (url/source/whatever -> .journal path)
will require some thought.

I'd imagine, that people will want to use this most often as a syslogd
replacement, i.e. launch systemd-journal-remote on a central host, and
then let all other hosts stream messages live. In this case we know
only two things: _MACHINE_ID specified remotely, and the remote
IP:PORT and thus hostname. Actually, I thought that since all those
things are "unreliable" (IP only to some extent, but still), they
wouldn't be used to determine the output file, and all output would go
into one .journal.

I remember that samba does (did?) something like what you suggest, and
kept separate logs based on the information under control of the
connecting host. On a host connected to the internet this would lead
to hundreds of log files.

In addition, .journal files have a fairly big overhead: ~180kB for a
an "empty" file. This overhead might be unwanted if there are many
sources.

Maybe there's no one answer, and choices will have to be provided.

> > Push mode is not implemented... (but it would be a separate program
> > anyway).
> 
> My intention was actually to keep this in the same tool. So that we'd
> have for input and output:
> 
> A) HTTP GET
> B) HTTP POST
> C) SSH PULL (would invoke "journalctl -o export" via ssh)
> D) SSH PUSH (would invoke systemd-journald-remote via ssh)
> E) A directory for direct read access (which would allows us to merge multiplefile into one with this tool)
> F) A directory for direct write access (which is of course the default)
Also useful:
B1) socket listen() without HTTP
B2) HTTPS POST (I'm assuming that POST means to listen)
E1) a specific file for read access
F1) a specific file for write access

B1, F, F1 are implemented; A is implemented but ugly (curl).
E and E1 would require pulling in journalctl functionality.

> We should always require that either E or F is used, but in any
> combination with any of the others.
I think it is useful to allow the output directory to be implicit
(e.g. /var/log/journal/<hostname>/remote.journal can be used).

> > Examples:
> >   journalctl -o export | systemd-journal-remoted --stdin -o /tmp/dir/
> 
> Sounds pretty cool. Pretty close to what I'd have in mind.
> 
> To make this even shorter I'd suggest though that we take two normal
> args for source and dest, and that "-" is used as stdin/stdout
> respectively, and the dest can be ommited:

It started this way during development, but I'm not so sure if it'll
be always clear what is meant:
B, B1, and B2 can also come from socket activation, thus not appearing on
the command line, but output might still be specified.
OTOH, there might be multiple sources, and the implicit output dir.
So I think that explicit --output/-o is better.
Sources as positional arguments might work, as long as they can
be distinguished.

> Hence:
>         journalctl -o export | systemd-journal-remote - /tmp/dir
> Or:
>         systemd-journal-remote http://some.host:19531/entries?boot
> Or:
>         systemd-journal-remote http://some.host:19531/entries?boot /tmp/dir
> Or:
>         systemd-journal-remote /var/log/journal /tmp/dir
> 
> And so on...
>
> >   remote-127.0.0.1~2000.journal
> >   remote-multiple.journal
> >   remote-stdin.journal
> >   remote-http~~~some~host~19531~entries.journal
> > 
> > The goal was to have names containing the port number, so that it is
> > possible to run multiple instances without conflict.
> 
> I'd always try to separate the "base name" out of a host spec. I.e. the
> actual hostname of it. So that people can swap protocols as they
> wish.
> 
> For example, i'd envision that people often begin with just pulling
> things via SSH, but later on end up using HTTP more frequently, and
> hence this should write to the same dir in /var/log/journal by default:
> 
> systemd-journal-remote lennart at somehost
> systemd-journal-remote http://somehost:19531/entries?boot
In pull mode, the hostname is under our control, so this is easy
and safe.

> Hmm, also, thinking about it I think we should only use the "base" URL
> for the HTTP transport, and let the "/entries?boot" stuff be an
> implementation detail we implicitly append.
Agreed.

> > static int spawn_curl(char* url) {
> >         int r;
> >         char argv0[] = "curl";
> >         char argv1[] = "-HAccept: application/vnd.fdo.journal";
> >         char argv2[] = "--silent";
> >         char argv3[] = "--show-error";
> >         char* argv[] = {argv0, argv1, argv2, argv3, url, NULL};
> > 
> >         r = spawn_child("curl", argv);
> >         if (r < 0)
> >                 log_error("Failed to spawn curl: %m");
> >         return r;
> > }
> 
> My intention here was to use libneon, which is quite OK as HTTP client
> library, and includes proxy support, and TLS and whatnon. 
> 
> I am a bit conservative about pulling curl into this low level tool
> (after all it includes a full gopher client!). I also want to be very
> careful to only support HTTP, SSH and "file" as transports, and not any
> random FTP or whatnot people might want to throw at this.
> 
> Otherwise looks pretty OK! Good work!
Happy to hear that. I want to work out the general principles of the
interface and bring it into merge'able shape to get some testing.
I'd rather leave the simple curl pull implementation for now, since
the change to libneon should not be visible to the users.

I guess that writing a man-page is in order...

Zbyszek