[PATCH 08/21] udl-kms: avoid prefetch

Wed Jun 6 12:04:24 UTC 2018

Hi Mikulas,

On Tue, 2018-06-05 at 11:30 -0400, Mikulas Patocka wrote:
> 
> On Tue, 5 Jun 2018, Alexey Brodkin wrote:
> 
> > Hi Mikulas,
> > 
> > On Sun, 2018-06-03 at 16:41 +0200, Mikulas Patocka wrote:
> > > Modern processors can detect linear memory accesses and prefetch data
> > > automatically, so there's no need to use prefetch.
> > 
> > Not each and every CPU that's capable of running Linux has prefetch
> > functionality :)
> > 
> > Still read-on...
> > 
> > > Signed-off-by: Mikulas Patocka <mpatocka at redhat.com>
> > > 
> > > ---
> > >  drivers/gpu/drm/udl/udl_transfer.c |    7 -------
> > >  1 file changed, 7 deletions(-)
> > > 
> > > Index: linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c
> > > ===================================================================
> > > --- linux-4.16.12.orig/drivers/gpu/drm/udl/udl_transfer.c	2018-05-31 14:48:12.000000000 +0200
> > > +++ linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c	2018-05-31 14:48:12.000000000 +0200
> > > @@ -13,7 +13,6 @@
> > >  #include <linux/module.h>
> > >  #include <linux/slab.h>
> > >  #include <linux/fb.h>
> > > -#include <linux/prefetch.h>
> > >  #include <asm/unaligned.h>
> > >  
> > >  #include <drm/drmP.h>
> > > @@ -51,9 +50,6 @@ static int udl_trim_hline(const u8 *bbac
> > >  	int start = width;
> > >  	int end = width;
> > >  
> > > -	prefetch((void *) front);
> > > -	prefetch((void *) back);
> > 
> > AFAIK prefetcher fetches new data according to a known history... i.e. based on previously
> > used pattern we'll trying to get the next batch of data.
> > 
> > But the code above is in the very beginning of the data processing routine where
> > prefetcher doesn't yet have any history to know what and where to prefetch.
> > 
> > So I'd say this particular usage is good.
> > At least those prefetches shouldn't hurt because typically it
> > would be just 1 instruction if those exist or nothing if CPU/compiler doesn't
> > support it.
> 
> See this post https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_444336_&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=lqdeeSSEes0GFDDl656e
> ViXO7breS55ytWkhpk5R81I&m=a5RaqJtvajFkM1hL7bOKD5jV7cpFfTvG2Y1cYCdBPd0&s=w0W8wFtAgENp8TE6RzdPGhdKRasJc_otIn08V0EkgrY&e= where they measured that 
> prefetch hurts performance. Prefetch shouldn't be used unless you have a 
> proof that it improves performance.
> 
> The problem is that the prefetch instruction causes stalls in the pipeline 
> when it encounters TLB miss and the automatic prefetcher doesn't.

Wow, thanks for the link.
I didn't know about that subtle issue with prefetch instructions on ARM and x86.

So OK in case of UDL these prefetches anyways make not not much sense I guess and there's
something worse still, see what I've got from WandBoard Quad running kmscube [1] application
with help of perf utility:
--------------------------->8-------------------------
# Overhead  Command  Shared Object            Symbol 
# ........  .......  .......................  ........................................
#
    92.93%  kmscube  [kernel.kallsyms]        [k] udl_render_hline
     2.51%  kmscube  [kernel.kallsyms]        [k] __divsi3
     0.33%  kmscube  [kernel.kallsyms]        [k] _raw_spin_unlock_irqrestore
     0.22%  kmscube  [kernel.kallsyms]        [k] lock_acquire
     0.19%  kmscube  [kernel.kallsyms]        [k] _raw_spin_unlock_irq
     0.17%  kmscube  [kernel.kallsyms]        [k] udl_handle_damage
     0.12%  kmscube  [kernel.kallsyms]        [k] v7_dma_clean_range
     0.11%  kmscube  [kernel.kallsyms]        [k] l2c210_clean_range
     0.06%  kmscube  [kernel.kallsyms]        [k] __memzero
--------------------------->8-------------------------

That said it's not even USB 2.0 which is a bottle-neck but
computations in the udl_render_hline().

[1] https://cgit.freedesktop.org/mesa/kmscube/

-Alexey