[systemd-devel] [PATCH] readahead: read /usr files last for rotational media, skip /var
Lennart Poettering
lennart at poettering.net
Mon Oct 10 16:33:47 PDT 2011
On Fri, 30.09.11 11:54, Paolo Bonzini (bonzini at gnu.org) wrote:
> When enabling readahead on my system (which has a 5400rpm hard drive)
> "systemd-analyze blame" output is like this:
>
> 19507ms udev.service
> 18336ms fedora-storage-init.service
> 13254ms var-lock.mount
> 12960ms var-run.mount
> 12871ms media.mount
>
> This matches visual feedback from systemd's boot log (the
> "Starting..." and "Started..." messages on the console appeared quite
> slowly). "systemd-analyze plot" shows that the serialization point
> is udev.
>
> Basically, readahead-replay is starving udev and everything else running
> early in the boot process. udev cannot simply load the modules and
> programs it needs; rather, it has to wait for readahead to fetch them.
> The pack file shows things such as libX11, libgio and libglib very close
> to the beginning of the file, while kernel modules are more towards
> the end. The problem is that updated kernel modules are often installed
> months after the root partition was formatted, while large files might
> be installed at the beginning of the drive and stay there forever.
>
> The attached file adds a simple heuristic to readahead-collect: break the
> files in two groups, reading first the files that are not in /usr, and
> then those that are in /usr. This is all but perfect, as it may delay
> some files and still load others too early. It may delay some files
> because systemd will read from /usr/lib/binfmt.d early at startup (this
> sounds clearly wrong, since /usr may not even be mounted at that point!).
> Similarly, it will not delay loading GLib (which has to be in /lib because
> some programs in /sbin use it) even though in my case it is not needed.
>
> Still, it was enough to save 5 more seconds, bringing the total to 20.
> "systemd-analyze blame" was also more satisfying:
>
> 8154ms fedora-storage-init.service
> 7067ms udev.service
> 6064ms var-lock.mount
> 6057ms var-run.mount
> 6043ms media.mount
>
> A better heuristic, perhaps involving some kind of topological sort
> would likely duplicate the size of readahead-collect, so I went for the
> low-hanging fruit.
Hmpf. I can't say I am a particular fan of changes with hardcoded rules
like this. readahead currently stricly loads files in the order they are
stored on disk (with FS_IOC_FIEMAP, only on rotating media), resp. the
order they are used (on SSD).
Normally this should really do the right thing for you unless a stream
of late-used stuff for some reason ends up at the beginning of the disk.
It would be interesting to figure out for your specific file system how
the files are laid out on disk there.
Note that systemd's readahead implementation is far from ideal. Other
implements go much further. For example Ubuntu's ureadahead includes an
ext3 parser and not only looks on the location of files on disks but
also of directories. This gives them a strategic advantage, but I am
strictly against adding any knowledge of low-level file systems into our
own systemd implementation, simply because I want to be able to maintain
the code. (That said, I think ext4 actually supports FIEMAP on
directories nowadays too, so I'd be happy to merge a patch for that,
which should fix this problem.) Also systemd at boot opens quite a few
of its small unit files before starting the readahead logic. It might
make sense to spawn readahead earlier to cover that as well, but then it
would become a special process and I'd would probably be better not too
have too many of those, especially given that the whole concept of
readahead is primarily something to deal with hardware that is more of
yesteryear than of the future (i.e. rotating media).
I guess what I am trying to say here: there's a lot of stuff to minimize
here, and before we add arbitrary rules like the one you suggest i'd
very much prefer to see other optimizations done, and most importantly
figure out why exactly the simple rule "follow order on disk" doesn't
work for you.
(Of course, in an ideal world we'd probably not have any readahead-reply
process, but simply reorder things on disk according to what we
measured, which we actually do for btrfs).
Lennart
--
Lennart Poettering - Red Hat, Inc.
More information about the systemd-devel
mailing list