[systemd-devel] [ANNOUNCE] Journal File Format Documentation

Tue Oct 23 09:33:58 PDT 2012

On Tue, 23.10.12 18:48, Ciprian Dorin Craciun (ciprian.craciun at gmail.com) wrote:

> > - We needed something with in-line compression, and where we can add
> >   stuff like FSS to
> 
>     Ok. I agree that there are very few libraries that fit here. All I
> can think of making into here would be BerkeleyDB (but it fails other
> requirements you've listed below).

I used BDB in one other project and it makes me shudder. The constant
ABI breakages and disk corruptions where aful. Heck, yum as one user of
it breaks every second Fedora release with a BDB error where the usual
recipe is to remove the yum BDB database so that it is regenerated on
next invocation. I am pretty much through with BDB.

> > - The database should be typeless, and index by all fields, rather than
> >   require fixed schemas. It should efficient with large and binary data.
> 
>     One thing bothers me: why should it index all fields? (For example
> indexing by UID, executable, service, etc. makes sense, but I don't
> think indexing by message is that worthwhile... Moreover by PID or
> coredump (which I think it is hinted is stored in the journal) doesn't
> make too much sense either...)

Sure, not all fields make sense, but many many do, and we don't really
know in advance which ones will, as the vocabulary can be extended by
anybody. For example, if Apache decided to do structured logging for
errors indexed vserver, then I'd love that, but of course the vserver
match would be unknown to everybody else. And that's cool. 

So, the idea here is really to just index everything, and make it cheap
so that if something actually never made sense to be indexed is cheap.

In the journal file format indexing field objects that are only
referenced once is practically free, as instead of storing an offset to
the bsearch object we use for indexing we just store the offset of the
referencing entry in-line.

> > - It should not require file locks or communication between multiple
> >   readers or between readers and the writer. This is primarily a
> >   question of security (we cannot allow users to lock out root or the
> >   writer from acessing the logs by taking a lock) and network
> >   transparency (file locks on network FS are very very flaky), but also
> >   performance.
> 
>     From what I see this is the best reason for the current proposal.
> Indeed no embedded database library (that I know of) allows both
> reading and writing at the same time from multiple processes without
> locking. (Except maybe DJB's CDB and the TinyCDB implementation, but
> that wouldn't fit the bill here.)
> 
>     Maybe this should go at the top of that document as describing
>     "why?".

I have added a link to this very thread to the document now to the first
section of the document.

>     Although I partially agree about this increased flexibility,
> having a custom format means it is very easy to just start "adding
> features", thus accumulating cruft... Thus maybe a general purpose
> system would have limited this tendency...

Well, we really try hard to stay focussed, and want the journal to do
the things it does well, but not become a super-flexible store-anything
database. Of course, we want to pave the way for future extensions, and
that's why we built extensibility into the format (and there's a section
about that in the spec).

>     BTW, a little bit off-topic:
>     * why didn't you try to implement this journal system as a
> standalone library, that could have been reused by other systems
> independently of systemd; (I know this was answered in [2], and that
> the focus is on systemd, but it seems it took quite a lot of work, and
> it's a pity it can't be individually reused);

Well, for starters we want to be careful where we guarantee interface
stability and where not. For example, the C API gives you a very
high-level view on the journals, where interleaving of multiple files is
hidden. However, if we'd split this out then we'd have to expose much
more guts of the implementation, and provide stable API for that, and
that's something I don't want. WE want the ability to change the
internals of the implementation around and guarantee stability only at
the most high level C API that hides it all.

The other thing is simply that the stuff is really integrated with each
other. The journal sources are small because we reuse a lot of internal
C APIs of systemd, and the format exposes a lot of things that are
specific to systemd, for example the vocabulary of well-known fields is
closely bound to systemd.

Also, we believe in the systemd, and in the journal and tight
integration between the two, as we consider logging an essential facet
of service management. It would be against our goals here to separate
them out and turn them back into non-integrated components.

>     * how do you intend to implement something resembly syslog's log
> centralization?

The network model existing since day one is one where we rely on
existing file sharing infrastructure to transfer/collect files. I.e. use
NFS, SMB, FTP, WebDAV, SCP, rsync whatever suits you, and make available
at one spot, and "journactl -m" will interleave them as necessary.

I am curently working on getting log syncing via both a PUSH and PULL
model done. This will be based one existing protocols and standards as
much as we can (SSH or HTTP/HTTPS as transport, and JSON and more as
payload), and is flexible for others to hook into. For example, I think
it would be cool if greylog2 and similar software would just pull the
data out of the journal on its own, simply via HTTP/JSON. We make
everything available to make this smooth, i.e. we provide clients with
stable cursors which they can use to restart operation.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.