[Mesa-dev] [PATCH 2/5] i965/tiled_memcpy: inline movntdqa loads in tiled_to_linear

Tue Apr 10 15:33:18 UTC 2018

Chris Wilson <chris at chris-wilson.co.uk> writes:

> Quoting Chris Wilson (2018-04-05 20:54:54)
> > Quoting Scott D Phillips (2018-04-03 21:05:42)

[...]

> > Ok, was hoping to see how you choose to use the streaming load, but I
> > guess that's the next patch.
> > 
> > Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
>
> Oh, one point Eric Anholt mentioned on another thread about movntqda is
> that stale data inside the internal buffer is not automatically
> invalidated. We may need to emit explicit mfence before the copies if we
> are in doubt. A single mfence per tiled-copy is probably not enough to
> worry about optimising away.

Looking around, I found this errata about movntdqa not honoring the
ordering guarantees of locked instructions (VLP31 in the pdf):

https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-n3520-j2850-celeron-n2920-n2820-n2815-n2806-j1850-j1750-spec-update.pdf

So I added this code near the top of tiled_to_linear():

if (mem_copy == (mem_copy_fn)_mesa_streaming_load_memcpy) {
   /* Various atom processors have errata where the movntdqa instruction
    * (which is used in streaming_load_memcpu) may incorrectly be reordered
    * before locked instructions. To work around that, we put an lfence
    * here to manually wait for preceeding loads to be completed.
    */
   __builtin_ia32_lfence();
}

It seems that an mfence won't suffice where the errata mentions you need
the lfence, by my hazy understanding. Do I have that right, or should
this be an mfence?