[Mesa-dev] [PATCH][RFC] mesa/main: Clamp rgba with streamed sse

Mon Nov 3 00:57:17 PST 2014

On 31.10.2014 20:30, Roland Scheidegger wrote:
> Am 31.10.2014 um 18:17 schrieb Matt Turner:
>> On Fri, Oct 31, 2014 at 3:13 AM, Juha-Pekka Heikkila
>> <juhapekka.heikkila at gmail.com> wrote:
>>> Signed-off-by: Juha-Pekka Heikkila <juhapekka.heikkila at gmail.com>
>>> ---
>>>  src/mesa/main/colormac.h      | 20 +++++++++++++++
>>>  src/mesa/main/pixeltransfer.c | 59 ++++++++++++++++++++++++++++++++-----------
>>>  2 files changed, 64 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/src/mesa/main/colormac.h b/src/mesa/main/colormac.h
>>> index c8adca6..da5e094 100644
>>> --- a/src/mesa/main/colormac.h
>>> +++ b/src/mesa/main/colormac.h
>>> @@ -51,6 +51,26 @@ _mesa_unclamped_float_rgba_to_ubyte(GLubyte dst[4], const GLfloat src[4])
>>>
>>>
>>>  /**
>>> + * Clamp four float values to [min,max]
>>> + */
>>> +#if defined(__SSE2__) && defined(__GNUC__)
>>> +static inline void
>>> +_mesa_clamp_float_rgba(GLfloat src[4], GLfloat result[4], const float min,
>>> +                       const float max)
>>> +{
>>> +    __m128  operand, minval, maxval;
>>> +
>>> +    operand = _mm_loadu_ps(src);
>>
>> Surely 128-bit pixels will be 128-bit aligned? I think we can use an
>> aligned load here.
> I can't see why that would be the case (sure in some cases it might be
> but I don't see anything which would actually guarantee that). temp
> images themselves don't seem to have any forced alignment (that could be
> fixed though I didn't check fi there's other paths into it which can't
> be easily fixed).

On SNB 64-bit build with my standard build options I did see the pixels
aligned at 64-bit at worst thus I have unaligned load/store here.

> 
>>
>>> +    minval = _mm_set1_ps(min);
>>> +    maxval = _mm_set1_ps(max);
>>> +    operand = _mm_max_ps(operand, minval);
>>> +    operand = _mm_min_ps(operand, maxval);
>>> +    _mm_storeu_ps(result, operand);
>>
>> And an aligned store here.
>>
>>> +}
>>> +#endif
>>> +
>>> +
>>> +/**
>>>   * \name Generic color packing macros.  All inputs should be GLubytes.
>>>   *
>>>   * \todo We may move these into texstore.h at some point.
>>> diff --git a/src/mesa/main/pixeltransfer.c b/src/mesa/main/pixeltransfer.c
>>> index 8bbeeb8..e16eb59 100644
>>> --- a/src/mesa/main/pixeltransfer.c
>>> +++ b/src/mesa/main/pixeltransfer.c
>>> @@ -35,7 +35,7 @@
>>>  #include "pixeltransfer.h"
>>>  #include "imports.h"
>>>  #include "mtypes.h"
>>> -
>>> +#include "x86/common_x86_asm.h"
>>>
>>>  /*
>>>   * Apply scale and bias factors to an array of RGBA pixels.
>>> @@ -89,16 +89,34 @@ _mesa_map_rgba( const struct gl_context *ctx, GLuint n, GLfloat rgba[][4] )
>>>     const GLfloat *bMap = ctx->PixelMaps.BtoB.Map;
>>>     const GLfloat *aMap = ctx->PixelMaps.AtoA.Map;
>>>     GLuint i;
>>> -   for (i=0;i<n;i++) {
>>> -      GLfloat r = CLAMP(rgba[i][RCOMP], 0.0F, 1.0F);
>>> -      GLfloat g = CLAMP(rgba[i][GCOMP], 0.0F, 1.0F);
>>> -      GLfloat b = CLAMP(rgba[i][BCOMP], 0.0F, 1.0F);
>>> -      GLfloat a = CLAMP(rgba[i][ACOMP], 0.0F, 1.0F);
>>> -      rgba[i][RCOMP] = rMap[F_TO_I(r * rscale)];
>>> -      rgba[i][GCOMP] = gMap[F_TO_I(g * gscale)];
>>> -      rgba[i][BCOMP] = bMap[F_TO_I(b * bscale)];
>>> -      rgba[i][ACOMP] = aMap[F_TO_I(a * ascale)];
>>> +
>>> +#if defined(__SSE2__) && defined(__GNUC__)
>>> +   if (cpu_has_xmm2) {
>>
>> #ifdef __SSE2__ means the compiler is free to use SSE2 instructions
>> whenever it pleases. That's not what you want here, if you're also
>> doing runtime checking (cpu_has_xmm2).
>>
>> The typical way to do this is to put the function containing SSE
>> instructions in a separate file that is compiled with -msse2. See
>> streaming-load-memcpy.c for example. gcc has a way to mark specific
>> functions, but since we have to compile with MSVC...
>>
>> I think you just want copy-and-paste the SSE4.1 testing code in
>> configure.ac for SSE2 and then wrap these uses in #ifdef USE_SSE2.
>>
>>> +      for (i=0;i<n;i++) {
>>> +         GLfloat rgba_temp[4];
>>> +         _mesa_clamp_float_rgba(rgba[i], rgba_temp, 0.0F, 1.0F);
>>> +         rgba[i][RCOMP] = rMap[F_TO_I(rgba_temp[RCOMP] * rscale)];
>>> +         rgba[i][GCOMP] = gMap[F_TO_I(rgba_temp[GCOMP] * gscale)];
>>> +         rgba[i][BCOMP] = bMap[F_TO_I(rgba_temp[BCOMP] * bscale)];
>>> +         rgba[i][ACOMP] = aMap[F_TO_I(rgba_temp[ACOMP] * ascale)];
>>
>> Oh, but we shouldn't be bothering to store the floats back to memory
>> anyway. We should just do this part with SSE as well.
> 
> Indeed the mul scale and F_TO_I look like obvious candidates for
> vectorization (though the table part unfortunately not).

I'll rework this patch, it'll be a bit different looking with these
suggested changes.

/Juha-Pekka