[Mesa-dev] [PATCH 09/13] i965/tiled_memcpy: inline movntdqa loads in tiled_to_linear

Matt Turner mattst88 at gmail.com
Mon Apr 30 21:58:51 UTC 2018


On Mon, Apr 30, 2018 at 10:25 AM, Scott D Phillips
<scott.d.phillips at intel.com> wrote:
> The reference for MOVNTDQA says:
>
>     For WC memory type, the nontemporal hint may be implemented by
>     loading a temporary internal buffer with the equivalent of an
>     aligned cache line without filling this data to the cache.
>     [...] Subsequent MOVNTDQA reads to unread portions of the WC
>     cache line will receive data from the temporary internal
>     buffer if data is available.
>
> This hidden cache line sized temporary buffer can improve the
> read performance from wc maps.
>
> v2: Add mfence at start of tiled_to_linear for streaming loads (Chris)
>
> Reviewed-by: Chris Wilson <chris at chris-wilson.co.uk>

I think I understand the mechanics of this change. Let me tell you
what I think is going on and you can confirm my understanding.

We want to inline a movntdqa-using memcpy function into the tiling
functions for performance reasons. Since movntdqa is an SSE4.1
instruction, we must split out intel_tiled_memcpy.c into its own
convenience library so it can be built with -msse4.1.

Because the existing _mesa_streaming_load_memcpy is compiled into its
own object file, we can't use it directly (and we might not want to
use it directly anyway, since it calls _mm_mfence() we can just do it
once at the very end of the calling function). So we implement a
simple memcpy with movntdqa capable of handling only 16-bytes or
64-bytes.

Somewhat oddly, we pass in _mesa_streaming_load_memcpy (which we
cannot inline) and then actually call our simpler memcpy. It's safe to
only protect the SSE4.1-using code with #if defined(USE_SSE41) because
_mesa_streaming_load_memcpy is only passed if cpu_has_sse4_1 is true
(in the next patch).

Did I miss anything?

> ---
>  src/mesa/drivers/dri/i965/Makefile.am          |  7 +++
>  src/mesa/drivers/dri/i965/Makefile.sources     |  6 ++-
>  src/mesa/drivers/dri/i965/intel_tiled_memcpy.c | 62 ++++++++++++++++++++++++++
>  src/mesa/drivers/dri/i965/meson.build          | 18 ++++++--
>  4 files changed, 88 insertions(+), 5 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/Makefile.am b/src/mesa/drivers/dri/i965/Makefile.am
> index 889d4c68a2b..ff47add93f4 100644
> --- a/src/mesa/drivers/dri/i965/Makefile.am
> +++ b/src/mesa/drivers/dri/i965/Makefile.am
> @@ -92,8 +92,14 @@ libi965_gen11_la_CFLAGS = $(AM_CFLAGS) -DGEN_VERSIONx10=110
>
>  noinst_LTLIBRARIES = \
>         libi965_dri.la \
> +       libintel_tiled_memcpy.la \
>         $(I965_PERGEN_LIBS)
>
> +libintel_tiled_memcpy_la_SOURCES = \
> +       $(intel_tiled_memcpy_FILES)
> +libintel_tiled_memcpy_la_CFLAGS = \
> +       $(AM_CFLAGS) $(SSE41_CFLAGS)
> +
>  libi965_dri_la_SOURCES = \
>         $(i965_FILES) \
>         $(i965_oa_GENERATED_FILES)
> @@ -104,6 +110,7 @@ libi965_dri_la_LIBADD = \
>         $(top_builddir)/src/intel/compiler/libintel_compiler.la \
>         $(top_builddir)/src/intel/blorp/libblorp.la \
>         $(I965_PERGEN_LIBS) \
> +       libintel_tiled_memcpy.la
>         $(LIBDRM_LIBS)
>
>  BUILT_SOURCES = $(i965_oa_GENERATED_FILES)
> diff --git a/src/mesa/drivers/dri/i965/Makefile.sources b/src/mesa/drivers/dri/i965/Makefile.sources
> index 5e53d874d88..a15bba50eec 100644
> --- a/src/mesa/drivers/dri/i965/Makefile.sources
> +++ b/src/mesa/drivers/dri/i965/Makefile.sources
> @@ -112,11 +112,13 @@ i965_FILES = \
>         intel_tex_image.c \
>         intel_tex_obj.h \
>         intel_tex_validate.c \
> -       intel_tiled_memcpy.c \
> -       intel_tiled_memcpy.h \
>         intel_upload.c \
>         libdrm_macros.h
>
> +intel_tiled_memcpy_FILES = \
> +       intel_tiled_memcpy.c \
> +       intel_tiled_memcpy.h
> +
>  i965_gen4_FILES = \
>         genX_blorp_exec.c \
>         genX_state_upload.c
> diff --git a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> index 7c6bde990d6..fac5427d2ed 100644
> --- a/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> +++ b/src/mesa/drivers/dri/i965/intel_tiled_memcpy.c
> @@ -36,6 +36,10 @@
>  #include "brw_context.h"
>  #include "intel_tiled_memcpy.h"
>
> +#if defined(USE_SSE41)
> +#include "main/streaming-load-memcpy.h"
> +#include <smmintrin.h>
> +#endif
>  #if defined(__SSSE3__)
>  #include <tmmintrin.h>
>  #elif defined(__SSE2__)
> @@ -213,6 +217,31 @@ rgba8_copy_aligned_src(void *dst, const void *src, size_t bytes)
>     return dst;
>  }
>
> +#if defined(USE_SSE41)
> +static ALWAYS_INLINE void *
> +_memcpy_streaming_load(void *dest, const void *src, size_t count)
> +{
> +   if (count == 16) {

After inlining and some optimization, the compiler will know for sure
what count is, right? Curious if it will always be able to optimize
away one of these branches.

> +      __m128i val = _mm_stream_load_si128((__m128i *)src);
> +      _mm_store_si128((__m128i *)dest, val);
> +      return dest;
> +   } else if (count == 64) {
> +      __m128i val0 = _mm_stream_load_si128(((__m128i *)src) + 0);
> +      __m128i val1 = _mm_stream_load_si128(((__m128i *)src) + 1);
> +      __m128i val2 = _mm_stream_load_si128(((__m128i *)src) + 2);
> +      __m128i val3 = _mm_stream_load_si128(((__m128i *)src) + 3);
> +      _mm_store_si128(((__m128i *)dest) + 0, val0);
> +      _mm_store_si128(((__m128i *)dest) + 1, val1);
> +      _mm_store_si128(((__m128i *)dest) + 2, val2);
> +      _mm_store_si128(((__m128i *)dest) + 3, val3);
> +      return dest;
> +   } else {
> +      assert(count < 64); /* and (count < 16) for ytiled */
> +      return memcpy(dest, src, count);

I would just make this unreachable().


More information about the mesa-dev mailing list