[systemd-devel] Improve boot-time of systemd

Lennart Poettering lennart at poettering.net
Tue Mar 29 08:13:29 PDT 2011


On Tue, 29.03.11 03:20, fykcee1 at gmail.com (fykcee1 at gmail.com) wrote:

> 2011/3/28 Lennart Poettering <lennart at poettering.net>:
> > On Sun, 20.03.11 05:28, fykcee1 at gmail.com (fykcee1 at gmail.com) wrote:
> >> Current readahead implementation has some problems:
> >> 1. It can't separate *real* block read requests from all read
> >> requests(which includes more blocks read by the kernel's readahead
> >> logic)
> >
> > Shouldn't make a big difference, since on replay we turn off additional
> > kernel-side readahead.
> >
> > However, it is true that the file will only ever increase, never
> > decrease in size.
> For collect, it can't filter out:
> 1. Kernel-side readahead, whether the readahead is initiated by
> kernel(when no /.readhead data), or the replay process.

That is true. But is that really a problem? Usually kernel readahead
should be a useful optimization which shouldn't hurt much. And we will
only apply it once, during the original run. It will not be done again
one replay, since we disable it explicitly then.

> 2. Written blocks of files(opened as "r+", "w+", "a"). The written
> blocks resides at memory when boot time.

Actually, now that I am looking into this it might actually be possible
to distuingish read and write accesses to files, by using
FAN_CLOSE_NOWRITE/FAN_CLOSE_WRITE instead of FAN_OPEN. I do wonder
though why that isn't symmetric here...

> IMHO, the kernel lacks some APIs to notify each *real* read requests.
> e.g, It can be done by tracking each read syscall (mmap seems not easy
> to handle, though).

The kernel has quite a number of APIs, for example there is blktrace,
and there are the newer syscall tracing APIs. But fanotify is actually
the most useful of all of them.

> >> 2. It just gives advices for how to do kernel's readahead, causes the
> >> first read of a fille to spend more time.
> >
> > Hmm?
> posix_fadvise(...) may make each read do more readahead(more than the
> kernel guess way), thus spend more time. e.g.
> * When no replay, someone reads A part of file X --> do some job -->
> reads B part of file X.
> * When replay, both A and B parts of file X are read in one time, thus
> more I/O usage. Other services may spend more time waiting for
> I/O.(This can be observed from bootchart diagram)

The idea of readahead is to load as much IO requests into the kernel as
possible, so that the IO elevator can decide what to read when and to
reorder things as it likes and thinks is best.

> BTW, does posix_fadvise apply globally or just for the process which
> calls it?

The kernel caches each block only once.

> > We do that too. We use "idle" on SSD, and "realtime" on HDD.
> Why "realtime" on HDD?

Because on HDD seeks are very expensive. The idea of readahead is to
rearrange our reads so that no seeks happen, i.e. we read things
linearly in one big chunk. If accesses of other processes are
interleaved with this then disk access will be practically random and
the seeks will hurt.

On SSD seeks are basically free, hence all we do is tell the kernel
early what might be needed later so that that it reads it when it has
nothing else to do.

> BTW, According to test, the "idle" is not really *idle*, see the attachment.
> That means 'replay' will always impact other one's I/O. For 'replay'
> in idle I/O class on HDD, other one's I/O performance will reduce by
> half, according to the test.

That's probably something to fix in the elevator in the kernel?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.


More information about the systemd-devel mailing list