[Mesa-dev] [PATCH 09/13] i965/tiled_memcpy: inline movntdqa loads in tiled_to_linear

Tue May 22 23:46:35 UTC 2018

On Mon, Apr 30, 2018 at 4:34 PM, Scott D Phillips
<scott.d.phillips at intel.com> wrote:
> Matt Turner <mattst88 at gmail.com> writes:
>
>> On Mon, Apr 30, 2018 at 10:25 AM, Scott D Phillips
>> <scott.d.phillips at intel.com> wrote:
>>> The reference for MOVNTDQA says:
>>>
>>>     For WC memory type, the nontemporal hint may be implemented by
>>>     loading a temporary internal buffer with the equivalent of an
>>>     aligned cache line without filling this data to the cache.
>>>     [...] Subsequent MOVNTDQA reads to unread portions of the WC
>>>     cache line will receive data from the temporary internal
>>>     buffer if data is available.
>>>
>>> This hidden cache line sized temporary buffer can improve the
>>> read performance from wc maps.
>>>
>>> v2: Add mfence at start of tiled_to_linear for streaming loads (Chris)
>>>
>>> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>
>>
>> I think I understand the mechanics of this change. Let me tell you
>> what I think is going on and you can confirm my understanding.
>>
>> We want to inline a movntdqa-using memcpy function into the tiling
>> functions for performance reasons. Since movntdqa is an SSE4.1
>> instruction, we must split out intel_tiled_memcpy.c into its own
>> convenience library so it can be built with -msse4.1.
>>
>> Because the existing _mesa_streaming_load_memcpy is compiled into its
>> own object file, we can't use it directly (and we might not want to
>> use it directly anyway, since it calls _mm_mfence() we can just do it
>> once at the very end of the calling function). So we implement a
>> simple memcpy with movntdqa capable of handling only 16-bytes or
>> 64-bytes.
>>
>> Somewhat oddly, we pass in _mesa_streaming_load_memcpy (which we
>> cannot inline) and then actually call our simpler memcpy. It's safe to
>> only protect the SSE4.1-using code with #if defined(USE_SSE41) because
>> _mesa_streaming_load_memcpy is only passed if cpu_has_sse4_1 is true
>> (in the next patch).
>>
>> Did I miss anything?
>
> You've got it exactly.
>
>>> ---
>>>  src/mesa/drivers/dri/i965/Makefile.am          |  7 +++
>>>  src/mesa/drivers/dri/i965/Makefile.sources     |  6 ++-
>>>  src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 62 ++++++++++++++++++++++++++
>>>  src/mesa/drivers/dri/i965/meson.build          | 18 ++++++--
>>>  4 files changed, 88 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/Makefile.am b/src/mesa/drivers/dri/i965/Makefile.am
>>> index 889d4c68a2b..ff47add93f4 100644
>>> --- a/src/mesa/drivers/dri/i965/Makefile.am
>>> +++ b/src/mesa/drivers/dri/i965/Makefile.am
>>> @@ -92,8 +92,14 @@ libi965_gen11_la_CFLAGS = $(AM_CFLAGS) -DGEN_VERSIONx10=110
>>>
>>>  noinst_LTLIBRARIES = \
>>>         libi965_dri.la \
>>> +       libintel_tiled_memcpy.la \
>>>         $(I965_PERGEN_LIBS)
>>>
>>> +libintel_tiled_memcpy_la_SOURCES = \
>>> +       $(intel_tiled_memcpy_FILES)
>>> +libintel_tiled_memcpy_la_CFLAGS = \
>>> +       $(AM_CFLAGS) $(SSE41_CFLAGS)
>>> +
>>>  libi965_dri_la_SOURCES = \
>>>         $(i965_FILES) \
>>>         $(i965_oa_GENERATED_FILES)
>>> @@ -104,6 +110,7 @@ libi965_dri_la_LIBADD = \
>>>         $(top_builddir)/src/intel/compiler/libintel_compiler.la \
>>>         $(top_builddir)/src/intel/blorp/libblorp.la \
>>>         $(I965_PERGEN_LIBS) \
>>> +       libintel_tiled_memcpy.la
>>>         $(LIBDRM_LIBS)
>>>
>>>  BUILT_SOURCES = $(i965_oa_GENERATED_FILES)
>>> diff --git a/src/mesa/drivers/dri/i965/Makefile.sources b/src/mesa/drivers/dri/i965/Makefile.sources
>>> index 5e53d874d88..a15bba50eec 100644
>>> --- a/src/mesa/drivers/dri/i965/Makefile.sources
>>> +++ b/src/mesa/drivers/dri/i965/Makefile.sources
>>> @@ -112,11 +112,13 @@ i965_FILES = \
>>>         intel_tex_image.c \
>>>         intel_tex_obj.h \
>>>         intel_tex_validate.c \
>>> -       intel_tiled_memcpy.c \
>>> -       intel_tiled_memcpy.h \
>>>         intel_upload.c \
>>>         libdrm_macros.h
>>>
>>> +intel_tiled_memcpy_FILES = \
>>> +       intel_tiled_memcpy.c \
>>> +       intel_tiled_memcpy.h
>>> +
>>>  i965_gen4_FILES = \
>>>         genX_blorp_exec.c \
>>>         genX_state_upload.c
>>> diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>>> index 7c6bde990d6..fac5427d2ed 100644
>>> --- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>>> +++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
>>> @@ -36,6 +36,10 @@
>>>  #include "brw_context.h"
>>>  #include "intel_tiled_memcpy.h"
>>>
>>> +#if defined(USE_SSE41)
>>> +#include "main/streaming-load-memcpy.h"
>>> +#include <smmintrin.h>
>>> +#endif
>>>  #if defined(__SSSE3__)
>>>  #include <tmmintrin.h>
>>>  #elif defined(__SSE2__)
>>> @@ -213,6 +217,31 @@ rgba8_copy_aligned_src(void *dst, const void *src, size_t bytes)
>>>     return dst;
>>>  }
>>>
>>> +#if defined(USE_SSE41)
>>> +static ALWAYS_INLINE void *
>>> +_memcpy_streaming_load(void *dest, const void *src, size_t count)
>>> +{
>>> +   if (count == 16) {
>>
>> After inlining and some optimization, the compiler will know for sure
>> what count is, right? Curious if it will always be able to optimize
>> away one of these branches.
>
> Exactly, we go to some lengths to have the count be constant in the fast
> paths, which will make it constant here when it is inlined. The
> generated assembly for the inner loop winds up being 4 movntdqa
> instructions in a row, just like we want.
>
>>> +      __m128i val = _mm_stream_load_si128((__m128i *)src);
>>> +      _mm_store_si128((__m128i *)dest, val);
>>> +      return dest;
>>> +   } else if (count == 64) {
>>> +      __m128i val0 = _mm_stream_load_si128(((__m128i *)src) + 0);
>>> +      __m128i val1 = _mm_stream_load_si128(((__m128i *)src) + 1);
>>> +      __m128i val2 = _mm_stream_load_si128(((__m128i *)src) + 2);
>>> +      __m128i val3 = _mm_stream_load_si128(((__m128i *)src) + 3);
>>> +      _mm_store_si128(((__m128i *)dest) + 0, val0);
>>> +      _mm_store_si128(((__m128i *)dest) + 1, val1);
>>> +      _mm_store_si128(((__m128i *)dest) + 2, val2);
>>> +      _mm_store_si128(((__m128i *)dest) + 3, val3);
>>> +      return dest;
>>> +   } else {
>>> +      assert(count < 64); /* and (count < 16) for ytiled */
>>> +      return memcpy(dest, src, count);
>>
>> I would just make this unreachable().
>
> This is another bit of a wart. This case actually is reached, the
> mem_copy_align16 functions in the tiling/detiling functions are just
> meant to guarantee that the source and destination are 16-byte aligned,
> but nothing about the length.
>
> So, the mem_copy_align16 function will get called for the remnant of the
> surface on the right side that is 16-byte aligned but not a multiple of
> 16 bytes.
>
> We could also handle this by doing a streaming load and then maskmovdqu
> to do a masked store. I suppose we know that the extra bytes read from
> the 16-aligned load couldn't cross a page boundary. I just punted though
> and used the system memcpy.

Cool. That all sounds good to me. Have a

Reviewed-by: Matt Turner <mattst88 at gmail.com>