[Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row

Mon Sep 16 01:12:40 PDT 2013

On Mon, Sep 16, 2013 at 3:50 AM, Mark Mueller <markkmueller at gmail.com> wrote:
>
>
>
> On Fri, Sep 13, 2013 at 2:15 PM, Paul Berry <stereotype441 at gmail.com> wrote:
>>
>> On 12 September 2013 22:06, Chia-I Wu <olvaffe at gmail.com> wrote:
>>>
>>> From: Chia-I Wu <olv at lunarg.com>
>>>
>>> Consider only the top-left and top-right pixels to approximate DDX in a
>>> 2x2
>>> subspan, unless the application or the user requests a more accurate
>>> approximation.  This results in a less accurate approximation.  However,
>>> it
>>> improves the performance of Xonotic with Ultra settings by 24.3879% +/-
>>> 0.832202% (at 95.0% confidence) on Haswell.  No noticeable image quality
>>> difference observed.
>>>
>>> No piglit gpu.tests regressions (tested with v1)
>>>
>>> I failed to come up with an explanation for the performance difference,
>>> as the
>>> change does not affect Ivy Bridge.  If anyone has the insight, please
>>> kindly
>>> enlighten me.  Performance differences may also be observed on other
>>> games
>>> that call textureGrad and dFdx.
>>>
>>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
>>> Update
>>>     comments.
>>
>>
>> I'm not entirely comfortable making a change that has a known negative
>> impact on computational accuracy (even one that leads to such an impressive
>> performance improvement) when we don't have any theories as to why the
>> performance improvement happens, or why the improvement doesn't apply to Ivy
>> Bridge.  In my experience, making changes to the codebase without
>> understanding why they improve things almost always leads to improvements
>> that are brittle, since it's likely that the true source of the improvement
>> is a coincidence that will be wiped out by some future change (or won't be
>> relevant to client programs other than this particular benchmark).  Having a
>> theory as to why the performance improvement happens would help us be
>> confident that we're applying the right fix under the right circumstances.
> There's another angle to approach this and that is to develop a simple test
> case that will show the different results across a range of computational
> accuracy and run the test on proprietary drivers for the same hardware to
> determine what settings they are using.
Yes, I have a little test.  On Windows, rendering with texture2D() or
textureGrad() does not have a noticeable impact on the performance.
But I am not sure how to change dFdx() accuracy from the shaders.  On
Linux, I did that by modifying the driver.

>
>>
>>
>> For example, here's one theory as to why we might be seeing an
>> improvement: perhaps Haswell's sample_d processing is smart enough to
>> realize that when all the gradient values within a sub-span are the same,
>> that means that all of the sampling for the sub-span will come from the same
>> LOD, and that allows it to short-cut some expensive step in the LOD
>> calculation.  Perhaps the same improvement isn't seen on Ivy Bridge because
>> Ivy Bridge's sample_d processing logic is less sophisticated, so it's unable
>> to perform the optimization.  If this is the case, then conditioning the
>> optimization on brw->is_haswell (as you've done) makes sense.
>>
>> Another possible explanation for the Haswell vs Ivy Bridge difference is
>> that perhaps Ivy Bridge, being a lower-performing chip, has other
>> bottlenecks that make the optimization irrelevant for this particular
>> benchmark, but potentially still useful for other benchmarks.  For instance,
>> maybe when this benchmark executes on Ivy Bridge, the texture that's being
>> sampled from is located in sufficiently distant memory that optimizing the
>> sample_d's memory accesses makes no difference, since the bottleneck is the
>> speed with which the texture can be read into cache, rather than the speed
>> of operation of sample_d.  If this explanation is correct, then it might be
>> worth applying the optimization to both Ivy Bridge and Haswell (and perhaps
>> Sandy Bridge as well), since it might conceivably benefit those other chips
>> when running applications that place less cache pressure on the chip.
>
>
> This scenario is where I'd place my bets, especially given that the numbers
> are based on Xonotic. I benchmarked this patch using Xonotic on Bay Trail as
> is and by replacing !brw->is_haswell with !brw->is_baytrail. With ultra and
> ultimate levels at medium and high resolutions, the results were all
> essentially the same at comparable resolutions and quality levels.
Isn't Bay Trail based on Ivy Bridge?
>
> I don't see any justification to tie this change to just Haswell hardware.
> There's all sorts of reasons why doing that sounds like a big mistake. In
> fact, another _explanation_ to add to your list is maybe there's another
> is_haswell test elsewhere in the driver that is responsible for the
> performance anomaly.
I had a look at all 'if (brw->is_haswell)' paths but nothing special
came out to me.  It could as well be some place missing a 'if
(brw->is_haswell)'.

>
>>
>> Another possibile explanation is that Haswell has a bug in its sample_d
>> logic which causes it to be slow under some conditions, and this
>> lower-accuracy DDX computation happens to work around it.  If that's the
>> case, we might want to consider not using sample_d at all on Haswell, and
>> instead calculating the LOD in the shader and using sample_l instead.  If
>> this is the correct explanation, then that might let us have faster
>> performance without sacrificing DDX accuracy.
>>
>> A final possible explanation for the performance improvement is that
>> perhaps for some reason sample_d performs more optimally when the DDX and
>> DDY computations have similar accuracies to each other.  Before your patch,
>> our computation of DDX was more accurate than DDY; your patch decreases the
>> accuracy of DDX to match DDY.  If this explanation is correct, then a better
>> solution would probably be to improve the accuracy of DDY to make it
>> comparable to DDX, rather than the other way around.
>>
>> Before we land this patch, can we do some experiments to try to figure out
>> which of these explanations (if any) is correct?
>>
>>>
>>>
>>> Signed-off-by: Chia-I Wu <olv at lunarg.com>
>>> ---
>>>  src/mesa/drivers/dri/i965/brw_context.c   |  1 +
>>>  src/mesa/drivers/dri/i965/brw_context.h   |  1 +
>>>  src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40
>>> ++++++++++++++++++++++++-------
>>>  src/mesa/drivers/dri/i965/intel_screen.c  |  4 ++++
>>>  4 files changed, 38 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_context.c
>>> b/src/mesa/drivers/dri/i965/brw_context.c
>>> index 4fcc9fb..1cdfb9d 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_context.c
>>> +++ b/src/mesa/drivers/dri/i965/brw_context.c
>>> @@ -470,6 +470,7 @@ brwCreateContext(int api,
>>>     brw_draw_init( brw );
>>>
>>>     brw->precompile = driQueryOptionb(&brw->optionCache,
>>> "shader_precompile");
>>> +   brw->accurate_derivative = driQueryOptionb(&brw->optionCache,
>>> "accurate_derivative");
>>>
>>>     ctx->Const.ContextFlags = 0;
>>>     if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0)
>>> diff --git a/src/mesa/drivers/dri/i965/brw_context.h
>>> b/src/mesa/drivers/dri/i965/brw_context.h
>>> index c566bba..8bfc54a 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_context.h
>>> +++ b/src/mesa/drivers/dri/i965/brw_context.h
>>> @@ -964,6 +964,7 @@ struct brw_context
>>>     bool always_flush_cache;
>>>     bool disable_throttling;
>>>     bool precompile;
>>> +   bool accurate_derivative;
>>>
>>>     driOptionCache optionCache;
>>>     /** @} */
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>>> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>>> index bfb3d33..69aeab1 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>>> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct
>>> brw_reg dst, struct brw_reg src
>>>   *
>>>   * arg0: ss0.tl ss0.tr ss0.bl ss0.br ss1.tl ss1.tr ss1.bl ss1.br
>>>   *
>>> - * and we're trying to produce:
>>> + * Ideally, we want to produce:
>>>   *
>>>   *           DDX                     DDY
>>>   * dst: (ss0.tr - ss0.tl)     (ss0.tl - ss0.bl)
>>> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct
>>> brw_reg dst, struct brw_reg src
>>>   *
>>>   * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same
>>> result
>>>   * for each pair, and vertstride = 2 jumps us 2 elements after
>>> processing a
>>> - * pair. But for DDY, it's harder, as we want to produce the pairs
>>> swizzled
>>> - * between each other.  We could probably do it like ddx and swizzle the
>>> right
>>> - * order later, but bail for now and just produce
>>> + * pair.  But the ideal approximation of DDX may impose a huge
>>> performance
>>> + * cost on sample_d.  As such, we favor ((ss0.tr - ss0.tl)x4 (ss1.tr -
>>> + * ss1.tl)x4) unless the app or the user requests otherwise.
>>> + *
>>> + * For DDY, it's harder, as we want to produce the pairs swizzled
>>> between each
>>> + * other.  We could probably do it like ddx and swizzle the right order
>>> later,
>>> + * but bail for now and just produce
>>>   * ((ss0.tl - ss0.bl)x4 (ss1.tl - ss1.bl)x4)
>>>   */
>>>  void
>>>  fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct
>>> brw_reg src)
>>>  {
>>> +   unsigned vstride, width;
>>> +
>>> +   /* Produce accurate result only when requested.  We emit only one
>>> +    * instruction for either case, but the problem is the result may
>>> affect
>>> +    * how fast sample_d executes.
>>> +    *
>>> +    * Since the performance difference is only observed on Haswell,
>>> ignore the
>>> +    * hints on other GENs for now.
>>> +    */
>>> +   if (!brw->is_haswell ||
>>> +       brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST ||
>>> +       brw->accurate_derivative) {
>>> +      vstride = BRW_VERTICAL_STRIDE_2;
>>> +      width = BRW_WIDTH_2;
>>> +   }
>>> +   else {
>>> +      vstride = BRW_VERTICAL_STRIDE_4;
>>> +      width = BRW_WIDTH_4;
>>> +   }
>>> +
>>>     struct brw_reg src0 = brw_reg(src.file, src.nr, 1,
>>>                                  BRW_REGISTER_TYPE_F,
>>> -                                BRW_VERTICAL_STRIDE_2,
>>> -                                BRW_WIDTH_2,
>>> +                                vstride,
>>> +                                width,
>>>                                  BRW_HORIZONTAL_STRIDE_0,
>>>                                  BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>>>     struct brw_reg src1 = brw_reg(src.file, src.nr, 0,
>>>                                  BRW_REGISTER_TYPE_F,
>>> -                                BRW_VERTICAL_STRIDE_2,
>>> -                                BRW_WIDTH_2,
>>> +                                vstride,
>>> +                                width,
>>>                                  BRW_HORIZONTAL_STRIDE_0,
>>>                                  BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>>>     brw_ADD(p, dst, src0, negate(src1));
>>> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c
>>> b/src/mesa/drivers/dri/i965/intel_screen.c
>>> index eb6515e..ee08ffd 100644
>>> --- a/src/mesa/drivers/dri/i965/intel_screen.c
>>> +++ b/src/mesa/drivers/dri/i965/intel_screen.c
>>> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] =
>>>     DRI_CONF_SECTION_END
>>>     DRI_CONF_SECTION_QUALITY
>>>        DRI_CONF_FORCE_S3TC_ENABLE("false")
>>> +
>>> +      DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false")
>>> +        DRI_CONF_DESC(en, "Perform more accurate derivative
>>> calculation")
>>> +      DRI_CONF_OPT_END
>>>     DRI_CONF_SECTION_END
>>>     DRI_CONF_SECTION_DEBUG
>>>        DRI_CONF_NO_RAST("false")
>>> --
>>> 1.8.3.1
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>>
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>


-- 
olv at LunarG.com