[Mesa-dev] [PATCH 22/22] meta/blit: Use GL_EXT_shader_samples_identical in MSAA-SS resolve blit

Fri Feb 19 20:54:34 UTC 2016

Am 19.02.2016 um 11:57 schrieb Iago Toral:
> On Thu, 2016-02-18 at 16:13 -0800, Ian Romanick wrote:
>> On 02/18/2016 01:47 PM, Ian Romanick wrote:
>>> On 02/18/2016 08:44 AM, Neil Roberts wrote:
>>>> I made a pathological test case (attached) which repeatedly renders into
>>>> an MSAA FBO and then blits it to the screen and measures the framerate.
>>>> It checks it with a range of different sample counts. The rendering is
>>>> done either by rendering two triangles to fill the framebuffer or by
>>>> calling glClear.
>>>>
>>>> The percentage increase in framerate after applying the patch is like
>>>> this:
>>>>
>>>> With triangles to fill buffer:
>>>> 	
>>>> 16      62,27%
>>>> 8       48,59%
>>>> 4       27,72%
>>>> 2       5,34%
>>>> 0       0,58%
>>>>         
>>>> With glClear:   
>>>>         
>>>> 16      -5,20%
>>>> 8       -7,08%
>>>> 4       -2,45%
>>>> 2       -20,76%
>>>> 0       3,71%
>>>
>>> I find this result interesting.  In the samples=0 case, the before and
>>> after shaders should be identical.  I dug into this a bit, and I think
>>> the problem is the previous patch.  Using do { ... } while (false) in
>>> the macro makes the compiler generate some really, really bad code.
>>> Just deleting those two lines makes a huge difference.
>>>
>>> With do { ... } while (false):
>>>
>>> SIMD8 shader: 228 instructions. 4 loops. 1554 cycles. 0:0 spills:fills.
>>> Promoted 0 constants. Compacted 3648 to 2192 bytes (40%)
>>>
>>> Without:
>>>
>>> SIMD8 shader: 159 instructions. 0 loops. 380 cycles. 0:0 spills:fills.
>>> Promoted 0 constants. Compacted 2544 to 1600 bytes (37%)
>>> SIMD16 shader: 159 instructions. 0 loops. 898 cycles. 0:0 spills:fills.
>>> Promoted 0 constants. Compacted 2544 to 1600 bytes (37%)
>>>
>>> This is for samples=16 and sampler2DMS.  The cycle counts are bogus
>>> because the 4 loops have their cycle counts statically multiplied by
>>> 10... even though the loop will only execute once.
>>>
>>> So... I'll try some more experiments.  I also wonder about changing the
>>> SAMPLES > 1 to SAMPLES > 2.
>>
>> This is weird, and I'm not sure what to make of it.  I ran your test at
>> a bunch of commits on my branch.  Getting rid of gl_FragColor was the
>> big win.  The GL_EXT_shader_samples_identical patch is still the big
>> loss.  I'm assuming that causes enough register pressure over the
>> previous commit to lose SIMD16.  The following patch just removes the
>> "do {" and "} while (false)".  That gets back some of the losses,
>> but not all of it.
> 
> I don't know much about this, but using shader_samples_identical should
> only give a benefit if we actually get identical samples, otherwise it
> means more work, right? I noticed that the test renders a quad with
> random colors for each vertex that will be interpolated across the
> region, so could it be that we are hitting few cases of identical
> samples? Shouldn't the results be expected to be better if we rendered a
> flat color instead?

Unless you run that shader drawing random colors at per-sample frequency
(using per-sample interpolation), you still get identical samples (for
each pixel). Except at the edges of the quad.
(I didn't actually look at the test...)

Roland

> 
>> The raw per-commit data is below.
>>
>> 5c7f974 Android: disable unused-parameter warning
>> With triangles to fill buffer:
>>
>> 8,419.603882
>> 4,633.834045
>> 2,595.628113
>> 0,795.798157
>>
>> With glClear:
>>
>> 8,450.937958
>> 4,594.000610
>> 2,641.354553
>> 0,767.400818
>>
>> 89b8c0a meta/blit: Replace the gl_FragColor write
>> With triangles to fill buffer:
>>
>> 8,597.943054
>> 4,803.083862
>> 2,942.862549
>> 0,1129.432983
>>
>> With glClear:
>>
>> 8,643.500671
>> 4,980.392151
>> 2,982.704407
>> 0,1116.569946
>>
>> 7025d43 meta/blit: Don't dynamically select the MSAA "merge" function
>> With triangles to fill buffer:
>>
>> 8,599.412598
>> 4,954.745056
>> 2,933.184021
>> 0,1129.305420
>>
>> With glClear:
>>
>> 8,582.106079
>> 4,965.810303
>> 2,975.419434
>> 0,1169.180420
>>
>> b895a4d meta/blit: Use a single, master shader template for MSAA-SS blits
>> With triangles to fill buffer:
>>
>> 8,591.786011
>> 4,932.835815
>> 2,910.498047
>> 0,1072.501099
>>
>> With glClear:
>>
>> 8,623.441406
>> 4,952.562378
>> 2,941.442261
>> 0,1081.314819
>>
>> 197e4be meta/blit: Use GL_EXT_shader_samples_identical in MSAA-SS resolve blit
>> With triangles to fill buffer:
>>
>> 8,559.252869
>> 4,795.355103
>> 2,781.494202
>> 0,941.619568
>>
>> With glClear:
>>
>> 8,577.400513
>> 4,806.321533
>> 2,811.030029
>> 0,957.579224
>>
>> ec01613 fix loop nonsense derp
>> With triangles to fill buffer:
>>
>> 8,559.221558
>> 4,788.767944
>> 2,786.287170
>> 0,1078.632324
>>
>> With glClear:
>>
>> 8,575.771545
>> 4,956.205750
>> 2,813.206482
>> 0,1111.111084
>>
>>> I will probably also make a patch to fix the damage done by the do { ...
>>> } while (false).  That's just comedy.
>>>
>>>> It seems like a pretty convincing win for the triangle case but the
>>>> clear case makes it slightly worse. Presumably this is because we don't
>>>> do anything to detect the value stored in the MCS buffer when doing a
>>>> fast clear so the fast path isn't taken and the shader being more
>>>> complicated makes it slower.
>>>>
>>>> Not sure if we want to try and do anything about that because
>>>> potentially the cleared pixels aren't very common in a framebuffer from
>>>> a real use case so it might not really matter.
>>>>
>>>> Currently we don't use SIMD16 for 16x MSAA because we can't allocate the
>>>> registers well enough to make it worthwhile. This patch makes that
>>>> problem a bit more interesting because even if we end up spilling a lot
>>>> it might still be worth doing SIMD16 because the cases where the spilled
>>>> instructions are hit would be much less common.
>>>>
>>>> - Neil
>>>>
>>>> Ian Romanick <idr at freedesktop.org> writes:
>>>>
>>>>> From: Ian Romanick <ian.d.romanick at intel.com>
>>>>>
>>>>> Somewhat surprisingly, this didn't have any affect on performance in the
>>>>> benchmarks that Martin tried for me.
>>>>>
>>>>> Signed-off-by: Ian Romanick <ian.d.romanick at intel.com>
>>>>> ---
>>>>>  src/mesa/drivers/common/meta_blit.c | 10 +++++++++-
>>>>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/src/mesa/drivers/common/meta_blit.c b/src/mesa/drivers/common/meta_blit.c
>>>>> index 28aabd3..c0ec51f 100644
>>>>> --- a/src/mesa/drivers/common/meta_blit.c
>>>>> +++ b/src/mesa/drivers/common/meta_blit.c
>>>>> @@ -530,6 +530,7 @@ setup_glsl_msaa_blit_shader(struct gl_context *ctx,
>>>>>           fs_source = ralloc_asprintf(mem_ctx,
>>>>>                                       "#version 130\n"
>>>>>                                       "#extension GL_ARB_texture_multisample: require\n"
>>>>> +                                     "#extension GL_EXT_shader_samples_identical: enable\n"
>>>>>                                       "#define gvec4 %svec4\n"
>>>>>                                       "uniform %ssampler2DMS%s texSampler;\n"
>>>>>                                       "in %s texCoords;\n"
>>>>> @@ -569,7 +570,14 @@ setup_glsl_msaa_blit_shader(struct gl_context *ctx,
>>>>>                                       "   i%s tc = i%s(texCoords);\n"
>>>>>                                       "   int i;\n"
>>>>>                                       "\n"
>>>>> -                                     "   for (i = 0; i < SAMPLES; i++)\n"
>>>>> +                                     "   S[0] = texelFetch(texSampler, tc, 0);\n"
>>>>> +                                     "#if defined(GL_EXT_shader_samples_identical) && SAMPLES > 1\n"
>>>>> +                                     "   if (textureSamplesIdenticalEXT(texSampler, tc)) {\n"
>>>>> +                                     "      emit2(S[0]);\n"
>>>>> +                                     "      return;\n"
>>>>> +                                     "   }\n"
>>>>> +                                     "#endif\n"
>>>>> +                                     "   for (i = 1; i < SAMPLES; i++)\n"
>>>>>                                       "      S[i] = texelFetch(texSampler, tc, i);\n"
>>>>>                                       "\n"
>>>>>                                       "   REDUCE(s16, s32);\n"
>>>>> -- 
>>>>> 2.5.0
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> 
> 
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>