[Pixman] [PATCH 1/5] ARMv6: Lay the groundwork for later patches in the series

Wed Jan 23 08:34:24 PST 2013

On Wed, 23 Jan 2013 13:53:44 -0000, Siarhei Siamashka <siarhei.siamashka at gmail.com> wrote:
> More comments describing the input arguments for the "preload_line"
> macro would be definitely welcome. And also some high level overview
> for the "narrow", "medium" and "wide" cases.

Because the alignment of pixel data to cachelines, and even the number of
cachelines per row can vary from row to row, and because of the need to
preload each scanline once and only once, this prefetch strategy treats
each row of pixels independently. When a pixel row is long enough, there
are three distinct phases of prefetch:
* an inner loop section, where each time a cacheline of data is
   processed, another cacheline is preloaded (the exact distance ahead is
   determined empirically using profiling results from lowlevel-blt-bench)
* a leading section, where enough cachelines are preloaded to ensure no
   cachelines escape being preloaded when the inner loop starts
* a trailing section, where a limited number (0 or more) of cachelines
   are preloaded to deal with data (if any) that hangs off the end of the
   last iteration of the inner loop, plus any trailing bytes that were not
   enough to make up one whole iteration of the inner loop

There are (in general) three distinct code paths, selected between
depending upon how long the pixel row is. If it is long enough that there
is at least one iteration of the inner loop (as described above) then
this is described as the "wide" case. If it is shorter than that, but
there are still enough bytes output that there is at least one 16-byte-
long, 16-byte-aligned write to the destination (the optimum type of
write), then this is the "medium" case. If it is not even this long, then
this is the "narrow" case, and there is no attempt to align writes to
16-byte boundaries. In the "medium" and "narrow" cases, all the
cachelines containing data from the pixel row are prefetched up-front.

If the above two paragraphs are useful, I could add them as comments
(maybe in the next patch series I do, rather than resending the whole of
the current one for the sake of some comments?)

The arguments for preload_line:
"narrow_case" - just means that the macro was invoked from the "narrow"
   code path rather than the "medium" one - because in the narrow case,
   the row of pixels is known to output no more than 30 bytes, then
   (assuming the source pixels are no wider than the the destination
   pixels) they cannot possibly straddle more than 2 32-byte cachelines,
   meaning there's no need for a loop.
"bpp" - number of bits per pixel in the channel (source, mask or
   destination) that's being preloaded, or 0 if this channel is not used
   for reading
"bpp_shift" - log2 of ("bpp"/8) (except if "bpp"=0 of course)
"base" - base address register of channel to preload (SRC, MASK or DST)

> How many iterations does this loop typically run? If this tries to
> preload the whole scanline (as the name of the macro implies), then
> we may have some problems after 3rd iteration.

Assuming source and destination have the same pixel depth (for
simplicity), the threshold between the "medium" case where whole-line
preloading is used, and the "wide" case where it isn't, is
(prefetch_distance+3)*32 - 1  bytes. Commonly, prefetch_distance turns
out to be 2 or 3, so worst case we're talking about up to and including
190 bytes, which can span 7 cachelines if you're unlucky. Per channel
that's being preloaded, of course (up to 3).

I can think of at least one set of profiling results that indicate that
even this is better than no preloading - you may recall from an earlier
patch where I was accidentally using the wrong base register in
preload_line, I briefly had an option to disable preload_line altogether
because it appeared to be causing a regression. Once I fixed the base
register, re-enabling preload_line was advantageous, so I removed the
option to disable it.

Ben