[systemd-devel] [PATCH] journal: Fix sd_journal_enumerate_unique skipping values

Zbigniew Jędrzejewski-Szmek zbyszek at in.waw.pl
Thu Oct 9 20:46:20 PDT 2014


On Wed, Oct 08, 2014 at 08:58:53PM +0200, Jan Janssen wrote:
> On 2014-10-08 14:59, Zbigniew Jędrzejewski-Szmek wrote:
> >On Wed, Oct 08, 2014 at 08:24:49AM +0200, Jan Janssen wrote:
> >>
> >>
> >>>Gesendet: Mittwoch, 08. Oktober 2014 um 01:40 Uhr
> >>>Von: "Zbigniew Jędrzejewski-Szmek" <zbyszek at in.waw.pl>
> >>>An: "Jan Janssen" <medhefgo at web.de>
> >>>Cc: systemd-devel at lists.freedesktop.org
> >>>Betreff: Re: [systemd-devel] [PATCH] journal: Fix sd_journal_enumerate_unique skipping values
> >>>
> >>>On Mon, Oct 06, 2014 at 06:57:38PM +0200, Zbigniew Jędrzejewski-Szmek wrote:
> >>>>On Mon, Oct 06, 2014 at 06:36:34PM +0200, Jan Janssen wrote:
> >>>>>*bump*
> >>>>Sorry, I'll look into this.
> >>>
> >>>Doesn't work. Both without or with your other patch
> >>>sd_journal_enumerate_unique I get bogus results on my test case. It
> >>>seems the issue is more complicated.
> >>>
> >>
> >>That's odd. Care to elaborate what bogus results means? Are you even affected by the
> >>bug in question without the patch?
> >
> >Yes, I have a VM where I get a smaller number from -F _BOOT_ID than from --list-boots
> >(w/o your patches), and then the same smaller number with one or two of your patches.
> >So results become consistent, but equally bad.
> >
> >Of course I can't know if this is exactly the same bug, but it certainly looks
> >like it.
> >
> 
> Sounds like maybe one of those calls end up interleaving journals
> from different machines?
> 
> Also, does removing the call to journal_file_object_release() in
> sd_journal_enumerate_unique() improve things or not? How about
> moving it after the if(found) where it was before the patch?
> 
> I'd love to investigate this, but I sadly don't have any journals
> that triggers this :(
Hi,

I now pushed your patch to add release_cookie. I found that the
problem I saw was with sd_j_enumerate_unique getting confused by
removal of corrupted files. See the commit message for longer
explanation. You were right that there's a problem with interleaving
sd_j_e_u and sd_j_next in list_boots(), because this triggered the
issue. I decided to fix this and still allow interleaving.

To reproduce my issue, it is enough to have a file which is corrupted
and this corruption gets detected in sd_j_next(), but the file is in
good enough shape to be opened and pass sd_j_enumerate_unique().
If the file is not the first in hashmap, the problem should occur.

I was inclined to merge your other patch (to remove the interleaving,
especially that it makes the code shorter and simpler), but ultimately
decided against that. It makes the listing noticably slower (1.5 → 1.9 s
in my case). I imagine that on a slower disk this would be even more
visible, as we essentially iterate over the file list twice.

Please test, but note that 'journalctl -F_BOOT_ID' might return *more*
items than 'journalctl --list-boots' because the latter will remove
the files for which it detects corruption. So it is even possible
that simply looking at the journal with normal sequential access
(i.e. plain 'journalctl') can show partial data for boots that
--list-boots will not show. I'm not sure how significant this is
and what to do with that.

Zbyszek


More information about the systemd-devel mailing list