[PATCH] x86: Add an explicit barrier() to clflushopt()

Mon Jan 11 13:05:06 PST 2016

On Mon, Jan 11, 2016 at 12:11:05PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 3:28 AM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> >
> > Bizarrely,
> >
> > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> > index 6000ad7..cf074400 100644
> > --- a/arch/x86/mm/pageattr.c
> > +++ b/arch/x86/mm/pageattr.c
> > @@ -141,6 +141,7 @@ void clflush_cache_range(void *vaddr, unsigned int size)
> >         for (; p < vend; p += clflush_size)
> >                 clflushopt(p);
> >
> > +       clflushopt(vend-1);
> >         mb();
> >  }
> >  EXPORT_SYMBOL_GPL(clflush_cache_range);
> >
> > works like a charm.
> 
> Have you checked all your callers? If the above makes a difference, it
> really sounds like the caller has passed in a size of zero, resulting
> in no cache flush, because the caller had incorrect ranges. The
> additional clflushopt now flushes the previous cacheline that wasn't
> flushed correctly before.
> 
> That "size was zero" thing would explain why changing the loop to "p
> <= vend" also fixes things for you.

This is on top of HPA's suggestion to do the size==0 check up front.

> IOW, just how sure are you that all the ranges are correct?

All our callers are of the pattern:

memcpy(dst, vaddr, size)
clflush_cache_range(dst, size)

or 

clflush_cache_range(vaddr, size)
memcpy(dst, vaddr, size)

I am resonably confident that the ranges are sane. I've tried to verify
that we do the clflushes by forcing them. However, if I clflush the
whole object instead of the cachelines around the copies, the tests pass.
(Flushing up to a couple of megabytes instead of a few hundred bytes, it
is hard to draw any conclusions about what the bug might be.)

I can narrow down the principal buggy path by doing the clflush(vend-1)
in the callers at least.

The problem is that the tests that fail are those looking for bugs in
the coherency code, which may just as well be caused by the GPU writing
into those ranges at the same time as the CPU trying to read them. I've
looked into timing and tried adding udelay()s or uncached mmio along the
suspect paths, but that didn't change the presentation - having a udelay
fix the issue is usually a good indicator of a GPU write that hasn't landed
before the CPU read.

The bug only affects a couple of recent non-coherent platforms, earlier
Atoms and older Core seem unaffacted. That may also mean that it is the
GPU flush instruction that changed between platforms and isn't working
(as we intended at least).

Thanks for everyone's help and suggestions,
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre