[systemd-devel] Fragile journal interleaving

Tue Dec 12 16:09:10 UTC 2017

On Mo, 11.12.17 00:36, Uoti Urpala (uoti.urpala at pp1.inet.fi) wrote:

> The code that tries to determine which of two journal entries is
> "earlier" (similar logic in journal_file_comparare_locations() and
> compare_with_location()) does this by trying different progressively
> less reliable ways to compare - first by seqnum if seqnum id matches,
> then by monotonic time if boot id matches, then by realtime timestamp.
> Unfortunately, while this method could be said to give the "right"
> answer more often than just comparing by the always-available realtime
> timestamp, it does not give a valid ordering relation. As in, unless
> you can already rely on all those comparisons giving the same
> consistent order (when you could probably just use realtime anyway),
> it's possible that you get a relation like A<B<C<A.
> 
> For example, these would have such a relation (assume that the omitted
> monotonic values never determine order):
> A: seqnum_id=0, seqnum=0, time=2
> B: seqnum_id=0, seqnum=1, time=0
> C: seqnum_id=1, seqnum=0, time=1
> 
> A<B by seqnum, B<C by time, C<A by time.
> 
> Such ordering means that the code interleaving entries from different
> files can give rather arbitrary results depending on exactly which
> order the entries are tested in. It triggers at least one issue that I
> consider a clear bug: there's code in next_beyond_location() which
> skips the next entry in a file if it's not in the expected direction
> from the previous globally iterated entry, and this can discard valid
> entries. A comment there says it's meant to discard duplicate entries
> which were somehow recorded in multiple journal files (which I'd assume
> to compare equal), but it also discards non-duplicate entries which
> compare backwards from the previously shown one.

Note that two entries will only compare as fully identical if their
"xor_hash" is equal too. The xor_hash is the XOR combination of the
hashes of all of the entry's fields. That means realistically only
records that actually are identical should be considered as such.

> The reason I originally looked into this code was a set of old journal
> files for which plain "journalctl" only showed 5 days of entries
> despite the files containing months of data. This was triggered by a
> user journal file which matched seqnum ids with some system journal
> files, but had smaller seqnums from a much later time (the earliest
> logs were old enough that the file could have been initialized by
> journald before commit cbd671772c, so the exact same way to trigger
> issues probably won't occur with files created by later versions). The
> code searching for earlier entries first found the early system
> journal, then moved to the user journal because it had smaller seqnums,
> and finally moved to some other file that started 5 days before the end
> (earlier than the user journal and with a different seqnum id - there
> didn't happen to be other earlier files later in the iteration order).
> After printing one entry from there, the next_beyond_location()
> "duplicate" code then discarded all the earlier valid entries.
> 
> I'm not sure about the best way to fix these issues - are there
> supposed to be any guarantees about interleaving? At the very least the
> "duplicate" check should be fixed to not throw away arbitrary amounts
> of valid entries. Any other code relying on assumptions about valid
> ordering? Is the interleaving order supposed to be stable independent
> of things like in which order the files are iterated over?

So, the current "duplicate" check is not just that actually, it's also
a check that guarantees that we progress monotonically, and never go
backwards in time, even when journal files are added or removed while
we iterate through the files.

I am not entirely sure what we can do here. Maybe this can work: beef
up the comparison logic so that it returns more than
smaller/equal/larger but also a special value "ambiguous". And when
that is returned we don't enforce monotonicity strictly but instead go
record-by-record, if you follow what I mean?

Lennart

-- 
Lennart Poettering, Red Hat