[Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row
Chia-I Wu
olvaffe at gmail.com
Mon Sep 16 01:31:15 PDT 2013
On Mon, Sep 16, 2013 at 4:12 PM, Chia-I Wu <olvaffe at gmail.com> wrote:
> On Mon, Sep 16, 2013 at 3:50 AM, Mark Mueller <markkmueller at gmail.com> wrote:
>>
>>
>>
>> On Fri, Sep 13, 2013 at 2:15 PM, Paul Berry <stereotype441 at gmail.com> wrote:
>>>
>>> On 12 September 2013 22:06, Chia-I Wu <olvaffe at gmail.com> wrote:
>>>>
>>>> From: Chia-I Wu <olv at lunarg.com>
>>>>
>>>> Consider only the top-left and top-right pixels to approximate DDX in a
>>>> 2x2
>>>> subspan, unless the application or the user requests a more accurate
>>>> approximation. This results in a less accurate approximation. However,
>>>> it
>>>> improves the performance of Xonotic with Ultra settings by 24.3879% +/-
>>>> 0.832202% (at 95.0% confidence) on Haswell. No noticeable image quality
>>>> difference observed.
>>>>
>>>> No piglit gpu.tests regressions (tested with v1)
>>>>
>>>> I failed to come up with an explanation for the performance difference,
>>>> as the
>>>> change does not affect Ivy Bridge. If anyone has the insight, please
>>>> kindly
>>>> enlighten me. Performance differences may also be observed on other
>>>> games
>>>> that call textureGrad and dFdx.
>>>>
>>>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
>>>> Update
>>>> comments.
>>>
>>>
>>> I'm not entirely comfortable making a change that has a known negative
>>> impact on computational accuracy (even one that leads to such an impressive
>>> performance improvement) when we don't have any theories as to why the
>>> performance improvement happens, or why the improvement doesn't apply to Ivy
>>> Bridge. In my experience, making changes to the codebase without
>>> understanding why they improve things almost always leads to improvements
>>> that are brittle, since it's likely that the true source of the improvement
>>> is a coincidence that will be wiped out by some future change (or won't be
>>> relevant to client programs other than this particular benchmark). Having a
>>> theory as to why the performance improvement happens would help us be
>>> confident that we're applying the right fix under the right circumstances.
>> There's another angle to approach this and that is to develop a simple test
>> case that will show the different results across a range of computational
>> accuracy and run the test on proprietary drivers for the same hardware to
>> determine what settings they are using.
> Yes, I have a little test. On Windows, rendering with texture2D() or
> textureGrad() does not have a noticeable impact on the performance.
> But I am not sure how to change dFdx() accuracy from the shaders. On
> Linux, I did that by modifying the driver.
>
>>
>>>
>>>
>>> For example, here's one theory as to why we might be seeing an
>>> improvement: perhaps Haswell's sample_d processing is smart enough to
>>> realize that when all the gradient values within a sub-span are the same,
>>> that means that all of the sampling for the sub-span will come from the same
>>> LOD, and that allows it to short-cut some expensive step in the LOD
>>> calculation. Perhaps the same improvement isn't seen on Ivy Bridge because
>>> Ivy Bridge's sample_d processing logic is less sophisticated, so it's unable
>>> to perform the optimization. If this is the case, then conditioning the
>>> optimization on brw->is_haswell (as you've done) makes sense.
>>>
>>> Another possible explanation for the Haswell vs Ivy Bridge difference is
>>> that perhaps Ivy Bridge, being a lower-performing chip, has other
>>> bottlenecks that make the optimization irrelevant for this particular
>>> benchmark, but potentially still useful for other benchmarks. For instance,
>>> maybe when this benchmark executes on Ivy Bridge, the texture that's being
>>> sampled from is located in sufficiently distant memory that optimizing the
>>> sample_d's memory accesses makes no difference, since the bottleneck is the
>>> speed with which the texture can be read into cache, rather than the speed
>>> of operation of sample_d. If this explanation is correct, then it might be
>>> worth applying the optimization to both Ivy Bridge and Haswell (and perhaps
>>> Sandy Bridge as well), since it might conceivably benefit those other chips
>>> when running applications that place less cache pressure on the chip.
>>
>>
>> This scenario is where I'd place my bets, especially given that the numbers
>> are based on Xonotic. I benchmarked this patch using Xonotic on Bay Trail as
>> is and by replacing !brw->is_haswell with !brw->is_baytrail. With ultra and
>> ultimate levels at medium and high resolutions, the results were all
>> essentially the same at comparable resolutions and quality levels.
> Isn't Bay Trail based on Ivy Bridge?
For Bay Trail, this might help you
http://lists.freedesktop.org/archives/mesa-dev/2013-September/044288.html
if you are interested.
>>
>> I don't see any justification to tie this change to just Haswell hardware.
>> There's all sorts of reasons why doing that sounds like a big mistake. In
>> fact, another _explanation_ to add to your list is maybe there's another
>> is_haswell test elsewhere in the driver that is responsible for the
>> performance anomaly.
> I had a look at all 'if (brw->is_haswell)' paths but nothing special
> came out to me. It could as well be some place missing a 'if
> (brw->is_haswell)'.
>
>>
>>>
>>> Another possibile explanation is that Haswell has a bug in its sample_d
>>> logic which causes it to be slow under some conditions, and this
>>> lower-accuracy DDX computation happens to work around it. If that's the
>>> case, we might want to consider not using sample_d at all on Haswell, and
>>> instead calculating the LOD in the shader and using sample_l instead. If
>>> this is the correct explanation, then that might let us have faster
>>> performance without sacrificing DDX accuracy.
>>>
>>> A final possible explanation for the performance improvement is that
>>> perhaps for some reason sample_d performs more optimally when the DDX and
>>> DDY computations have similar accuracies to each other. Before your patch,
>>> our computation of DDX was more accurate than DDY; your patch decreases the
>>> accuracy of DDX to match DDY. If this explanation is correct, then a better
>>> solution would probably be to improve the accuracy of DDY to make it
>>> comparable to DDX, rather than the other way around.
>>>
>>> Before we land this patch, can we do some experiments to try to figure out
>>> which of these explanations (if any) is correct?
>>>
>>>>
>>>>
>>>> Signed-off-by: Chia-I Wu <olv at lunarg.com>
>>>> ---
>>>> src/mesa/drivers/dri/i965/brw_context.c | 1 +
>>>> src/mesa/drivers/dri/i965/brw_context.h | 1 +
>>>> src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40
>>>> ++++++++++++++++++++++++-------
>>>> src/mesa/drivers/dri/i965/intel_screen.c | 4 ++++
>>>> 4 files changed, 38 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_context.c
>>>> b/src/mesa/drivers/dri/i965/brw_context.c
>>>> index 4fcc9fb..1cdfb9d 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_context.c
>>>> +++ b/src/mesa/drivers/dri/i965/brw_context.c
>>>> @@ -470,6 +470,7 @@ brwCreateContext(int api,
>>>> brw_draw_init( brw );
>>>>
>>>> brw->precompile = driQueryOptionb(&brw->optionCache,
>>>> "shader_precompile");
>>>> + brw->accurate_derivative = driQueryOptionb(&brw->optionCache,
>>>> "accurate_derivative");
>>>>
>>>> ctx->Const.ContextFlags = 0;
>>>> if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0)
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_context.h
>>>> b/src/mesa/drivers/dri/i965/brw_context.h
>>>> index c566bba..8bfc54a 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_context.h
>>>> +++ b/src/mesa/drivers/dri/i965/brw_context.h
>>>> @@ -964,6 +964,7 @@ struct brw_context
>>>> bool always_flush_cache;
>>>> bool disable_throttling;
>>>> bool precompile;
>>>> + bool accurate_derivative;
>>>>
>>>> driOptionCache optionCache;
>>>> /** @} */
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>>>> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>>>> index bfb3d33..69aeab1 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>>>> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct
>>>> brw_reg dst, struct brw_reg src
>>>> *
>>>> * arg0: ss0.tl ss0.tr ss0.bl ss0.br ss1.tl ss1.tr ss1.bl ss1.br
>>>> *
>>>> - * and we're trying to produce:
>>>> + * Ideally, we want to produce:
>>>> *
>>>> * DDX DDY
>>>> * dst: (ss0.tr - ss0.tl) (ss0.tl - ss0.bl)
>>>> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct
>>>> brw_reg dst, struct brw_reg src
>>>> *
>>>> * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same
>>>> result
>>>> * for each pair, and vertstride = 2 jumps us 2 elements after
>>>> processing a
>>>> - * pair. But for DDY, it's harder, as we want to produce the pairs
>>>> swizzled
>>>> - * between each other. We could probably do it like ddx and swizzle the
>>>> right
>>>> - * order later, but bail for now and just produce
>>>> + * pair. But the ideal approximation of DDX may impose a huge
>>>> performance
>>>> + * cost on sample_d. As such, we favor ((ss0.tr - ss0.tl)x4 (ss1.tr -
>>>> + * ss1.tl)x4) unless the app or the user requests otherwise.
>>>> + *
>>>> + * For DDY, it's harder, as we want to produce the pairs swizzled
>>>> between each
>>>> + * other. We could probably do it like ddx and swizzle the right order
>>>> later,
>>>> + * but bail for now and just produce
>>>> * ((ss0.tl - ss0.bl)x4 (ss1.tl - ss1.bl)x4)
>>>> */
>>>> void
>>>> fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct
>>>> brw_reg src)
>>>> {
>>>> + unsigned vstride, width;
>>>> +
>>>> + /* Produce accurate result only when requested. We emit only one
>>>> + * instruction for either case, but the problem is the result may
>>>> affect
>>>> + * how fast sample_d executes.
>>>> + *
>>>> + * Since the performance difference is only observed on Haswell,
>>>> ignore the
>>>> + * hints on other GENs for now.
>>>> + */
>>>> + if (!brw->is_haswell ||
>>>> + brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST ||
>>>> + brw->accurate_derivative) {
>>>> + vstride = BRW_VERTICAL_STRIDE_2;
>>>> + width = BRW_WIDTH_2;
>>>> + }
>>>> + else {
>>>> + vstride = BRW_VERTICAL_STRIDE_4;
>>>> + width = BRW_WIDTH_4;
>>>> + }
>>>> +
>>>> struct brw_reg src0 = brw_reg(src.file, src.nr, 1,
>>>> BRW_REGISTER_TYPE_F,
>>>> - BRW_VERTICAL_STRIDE_2,
>>>> - BRW_WIDTH_2,
>>>> + vstride,
>>>> + width,
>>>> BRW_HORIZONTAL_STRIDE_0,
>>>> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>>>> struct brw_reg src1 = brw_reg(src.file, src.nr, 0,
>>>> BRW_REGISTER_TYPE_F,
>>>> - BRW_VERTICAL_STRIDE_2,
>>>> - BRW_WIDTH_2,
>>>> + vstride,
>>>> + width,
>>>> BRW_HORIZONTAL_STRIDE_0,
>>>> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>>>> brw_ADD(p, dst, src0, negate(src1));
>>>> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c
>>>> b/src/mesa/drivers/dri/i965/intel_screen.c
>>>> index eb6515e..ee08ffd 100644
>>>> --- a/src/mesa/drivers/dri/i965/intel_screen.c
>>>> +++ b/src/mesa/drivers/dri/i965/intel_screen.c
>>>> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] =
>>>> DRI_CONF_SECTION_END
>>>> DRI_CONF_SECTION_QUALITY
>>>> DRI_CONF_FORCE_S3TC_ENABLE("false")
>>>> +
>>>> + DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false")
>>>> + DRI_CONF_DESC(en, "Perform more accurate derivative
>>>> calculation")
>>>> + DRI_CONF_OPT_END
>>>> DRI_CONF_SECTION_END
>>>> DRI_CONF_SECTION_DEBUG
>>>> DRI_CONF_NO_RAST("false")
>>>> --
>>>> 1.8.3.1
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>>
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>
>
>
>
> --
> olv at LunarG.com
--
olv at LunarG.com
More information about the mesa-dev
mailing list