[Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row
Mark Mueller
markkmueller at gmail.com
Sun Sep 15 12:50:33 PDT 2013
On Fri, Sep 13, 2013 at 2:15 PM, Paul Berry <stereotype441 at gmail.com> wrote:
> On 12 September 2013 22:06, Chia-I Wu <olvaffe at gmail.com> wrote:
>
>> From: Chia-I Wu <olv at lunarg.com>
>>
>> Consider only the top-left and top-right pixels to approximate DDX in a
>> 2x2
>> subspan, unless the application or the user requests a more accurate
>> approximation. This results in a less accurate approximation. However,
>> it
>> improves the performance of Xonotic with Ultra settings by 24.3879% +/-
>> 0.832202% (at 95.0% confidence) on Haswell. No noticeable image quality
>> difference observed.
>>
>> No piglit gpu.tests regressions (tested with v1)
>>
>> I failed to come up with an explanation for the performance difference,
>> as the
>> change does not affect Ivy Bridge. If anyone has the insight, please
>> kindly
>> enlighten me. Performance differences may also be observed on other games
>> that call textureGrad and dFdx.
>>
>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
>> Update
>> comments.
>>
>
> I'm not entirely comfortable making a change that has a known negative
> impact on computational accuracy (even one that leads to such an impressive
> performance improvement) when we don't have any theories as to why the
> performance improvement happens, or why the improvement doesn't apply to
> Ivy Bridge. In my experience, making changes to the codebase without
> understanding why they improve things almost always leads to improvements
> that are brittle, since it's likely that the true source of the improvement
> is a coincidence that will be wiped out by some future change (or won't be
> relevant to client programs other than this particular benchmark). Having
> a theory as to why the performance improvement happens would help us be
> confident that we're applying the right fix under the right circumstances.
>
There's another angle to approach this and that is to develop a simple test
case that will show the different results across a range of computational
accuracy and run the test on proprietary drivers for the same hardware to
determine what settings they are using.
>
> For example, here's one theory as to why we might be seeing an
> improvement: perhaps Haswell's sample_d processing is smart enough to
> realize that when all the gradient values within a sub-span are the same,
> that means that all of the sampling for the sub-span will come from the
> same LOD, and that allows it to short-cut some expensive step in the LOD
> calculation. Perhaps the same improvement isn't seen on Ivy Bridge because
> Ivy Bridge's sample_d processing logic is less sophisticated, so it's
> unable to perform the optimization. If this is the case, then conditioning
> the optimization on brw->is_haswell (as you've done) makes sense.
>
> Another possible explanation for the Haswell vs Ivy Bridge difference is
> that perhaps Ivy Bridge, being a lower-performing chip, has other
> bottlenecks that make the optimization irrelevant for this particular
> benchmark, but potentially still useful for other benchmarks. For
> instance, maybe when this benchmark executes on Ivy Bridge, the texture
> that's being sampled from is located in sufficiently distant memory that
> optimizing the sample_d's memory accesses makes no difference, since the
> bottleneck is the speed with which the texture can be read into cache,
> rather than the speed of operation of sample_d. If this explanation is
> correct, then it might be worth applying the optimization to both Ivy
> Bridge and Haswell (and perhaps Sandy Bridge as well), since it might
> conceivably benefit those other chips when running applications that place
> less cache pressure on the chip.
>
This scenario is where I'd place my bets, especially given that the numbers
are based on Xonotic. I benchmarked this patch using Xonotic on Bay Trail
as is and by replacing !brw->is_haswell with !brw->is_baytrail. With ultra
and ultimate levels at medium and high resolutions, the results were all
essentially the same at comparable resolutions and quality levels.
I don't see any justification to tie this change to just Haswell hardware.
There's all sorts of reasons why doing that sounds like a big mistake. In
fact, another _explanation_ to add to your list is maybe there's another
is_haswell test elsewhere in the driver that is responsible for the
performance anomaly.
> Another possibile explanation is that Haswell has a bug in its sample_d
> logic which causes it to be slow under some conditions, and this
> lower-accuracy DDX computation happens to work around it. If that's the
> case, we might want to consider not using sample_d at all on Haswell, and
> instead calculating the LOD in the shader and using sample_l instead. If
> this is the correct explanation, then that might let us have faster
> performance without sacrificing DDX accuracy.
>
> A final possible explanation for the performance improvement is that
> perhaps for some reason sample_d performs more optimally when the DDX and
> DDY computations have similar accuracies to each other. Before your patch,
> our computation of DDX was more accurate than DDY; your patch decreases the
> accuracy of DDX to match DDY. If this explanation is correct, then a
> better solution would probably be to improve the accuracy of DDY to make it
> comparable to DDX, rather than the other way around.
>
> Before we land this patch, can we do some experiments to try to figure out
> which of these explanations (if any) is correct?
>
>
>>
>> Signed-off-by: Chia-I Wu <olv at lunarg.com>
>> ---
>> src/mesa/drivers/dri/i965/brw_context.c | 1 +
>> src/mesa/drivers/dri/i965/brw_context.h | 1 +
>> src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40
>> ++++++++++++++++++++++++-------
>> src/mesa/drivers/dri/i965/intel_screen.c | 4 ++++
>> 4 files changed, 38 insertions(+), 8 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_context.c
>> b/src/mesa/drivers/dri/i965/brw_context.c
>> index 4fcc9fb..1cdfb9d 100644
>> --- a/src/mesa/drivers/dri/i965/brw_context.c
>> +++ b/src/mesa/drivers/dri/i965/brw_context.c
>> @@ -470,6 +470,7 @@ brwCreateContext(int api,
>> brw_draw_init( brw );
>>
>> brw->precompile = driQueryOptionb(&brw->optionCache,
>> "shader_precompile");
>> + brw->accurate_derivative = driQueryOptionb(&brw->optionCache,
>> "accurate_derivative");
>>
>> ctx->Const.ContextFlags = 0;
>> if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0)
>> diff --git a/src/mesa/drivers/dri/i965/brw_context.h
>> b/src/mesa/drivers/dri/i965/brw_context.h
>> index c566bba..8bfc54a 100644
>> --- a/src/mesa/drivers/dri/i965/brw_context.h
>> +++ b/src/mesa/drivers/dri/i965/brw_context.h
>> @@ -964,6 +964,7 @@ struct brw_context
>> bool always_flush_cache;
>> bool disable_throttling;
>> bool precompile;
>> + bool accurate_derivative;
>>
>> driOptionCache optionCache;
>> /** @} */
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> index bfb3d33..69aeab1 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct
>> brw_reg dst, struct brw_reg src
>> *
>> * arg0: ss0.tl ss0.tr ss0.bl ss0.br ss1.tl ss1.tr ss1.bl ss1.br
>> *
>> - * and we're trying to produce:
>> + * Ideally, we want to produce:
>> *
>> * DDX DDY
>> * dst: (ss0.tr - ss0.tl) (ss0.tl - ss0.bl)
>> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct
>> brw_reg dst, struct brw_reg src
>> *
>> * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same
>> result
>> * for each pair, and vertstride = 2 jumps us 2 elements after
>> processing a
>> - * pair. But for DDY, it's harder, as we want to produce the pairs
>> swizzled
>> - * between each other. We could probably do it like ddx and swizzle the
>> right
>> - * order later, but bail for now and just produce
>> + * pair. But the ideal approximation of DDX may impose a huge
>> performance
>> + * cost on sample_d. As such, we favor ((ss0.tr - ss0.tl)x4 (ss1.tr -
>> + * ss1.tl)x4) unless the app or the user requests otherwise.
>> + *
>> + * For DDY, it's harder, as we want to produce the pairs swizzled
>> between each
>> + * other. We could probably do it like ddx and swizzle the right order
>> later,
>> + * but bail for now and just produce
>> * ((ss0.tl - ss0.bl)x4 (ss1.tl - ss1.bl)x4)
>> */
>> void
>> fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct
>> brw_reg src)
>> {
>> + unsigned vstride, width;
>> +
>> + /* Produce accurate result only when requested. We emit only one
>> + * instruction for either case, but the problem is the result may
>> affect
>> + * how fast sample_d executes.
>> + *
>> + * Since the performance difference is only observed on Haswell,
>> ignore the
>> + * hints on other GENs for now.
>> + */
>> + if (!brw->is_haswell ||
>> + brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST ||
>> + brw->accurate_derivative) {
>> + vstride = BRW_VERTICAL_STRIDE_2;
>> + width = BRW_WIDTH_2;
>> + }
>> + else {
>> + vstride = BRW_VERTICAL_STRIDE_4;
>> + width = BRW_WIDTH_4;
>> + }
>> +
>> struct brw_reg src0 = brw_reg(src.file, src.nr, 1,
>> BRW_REGISTER_TYPE_F,
>> - BRW_VERTICAL_STRIDE_2,
>> - BRW_WIDTH_2,
>> + vstride,
>> + width,
>> BRW_HORIZONTAL_STRIDE_0,
>> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>> struct brw_reg src1 = brw_reg(src.file, src.nr, 0,
>> BRW_REGISTER_TYPE_F,
>> - BRW_VERTICAL_STRIDE_2,
>> - BRW_WIDTH_2,
>> + vstride,
>> + width,
>> BRW_HORIZONTAL_STRIDE_0,
>> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>> brw_ADD(p, dst, src0, negate(src1));
>> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c
>> b/src/mesa/drivers/dri/i965/intel_screen.c
>> index eb6515e..ee08ffd 100644
>> --- a/src/mesa/drivers/dri/i965/intel_screen.c
>> +++ b/src/mesa/drivers/dri/i965/intel_screen.c
>> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] =
>> DRI_CONF_SECTION_END
>> DRI_CONF_SECTION_QUALITY
>> DRI_CONF_FORCE_S3TC_ENABLE("false")
>> +
>> + DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false")
>> + DRI_CONF_DESC(en, "Perform more accurate derivative calculation")
>> + DRI_CONF_OPT_END
>> DRI_CONF_SECTION_END
>> DRI_CONF_SECTION_DEBUG
>> DRI_CONF_NO_RAST("false")
>> --
>> 1.8.3.1
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130915/07e3fe11/attachment.html>
More information about the mesa-dev
mailing list