[Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row

Fri Sep 13 14:15:01 PDT 2013

On 12 September 2013 22:06, Chia-I Wu <olvaffe at gmail.com> wrote:

> From: Chia-I Wu <olv at lunarg.com>
>
> Consider only the top-left and top-right pixels to approximate DDX in a 2x2
> subspan, unless the application or the user requests a more accurate
> approximation.  This results in a less accurate approximation.  However, it
> improves the performance of Xonotic with Ultra settings by 24.3879% +/-
> 0.832202% (at 95.0% confidence) on Haswell.  No noticeable image quality
> difference observed.
>
> No piglit gpu.tests regressions (tested with v1)
>
> I failed to come up with an explanation for the performance difference, as
> the
> change does not affect Ivy Bridge.  If anyone has the insight, please
> kindly
> enlighten me.  Performance differences may also be observed on other games
> that call textureGrad and dFdx.
>
> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
>  Update
>     comments.
>

I'm not entirely comfortable making a change that has a known negative
impact on computational accuracy (even one that leads to such an impressive
performance improvement) when we don't have any theories as to why the
performance improvement happens, or why the improvement doesn't apply to
Ivy Bridge.  In my experience, making changes to the codebase without
understanding why they improve things almost always leads to improvements
that are brittle, since it's likely that the true source of the improvement
is a coincidence that will be wiped out by some future change (or won't be
relevant to client programs other than this particular benchmark).  Having
a theory as to why the performance improvement happens would help us be
confident that we're applying the right fix under the right circumstances.

For example, here's one theory as to why we might be seeing an improvement:
perhaps Haswell's sample_d processing is smart enough to realize that when
all the gradient values within a sub-span are the same, that means that all
of the sampling for the sub-span will come from the same LOD, and that
allows it to short-cut some expensive step in the LOD calculation.  Perhaps
the same improvement isn't seen on Ivy Bridge because Ivy Bridge's sample_d
processing logic is less sophisticated, so it's unable to perform the
optimization.  If this is the case, then conditioning the optimization on
brw->is_haswell (as you've done) makes sense.

Another possible explanation for the Haswell vs Ivy Bridge difference is
that perhaps Ivy Bridge, being a lower-performing chip, has other
bottlenecks that make the optimization irrelevant for this particular
benchmark, but potentially still useful for other benchmarks.  For
instance, maybe when this benchmark executes on Ivy Bridge, the texture
that's being sampled from is located in sufficiently distant memory that
optimizing the sample_d's memory accesses makes no difference, since the
bottleneck is the speed with which the texture can be read into cache,
rather than the speed of operation of sample_d.  If this explanation is
correct, then it might be worth applying the optimization to both Ivy
Bridge and Haswell (and perhaps Sandy Bridge as well), since it might
conceivably benefit those other chips when running applications that place
less cache pressure on the chip.

Another possibile explanation is that Haswell has a bug in its sample_d
logic which causes it to be slow under some conditions, and this
lower-accuracy DDX computation happens to work around it.  If that's the
case, we might want to consider not using sample_d at all on Haswell, and
instead calculating the LOD in the shader and using sample_l instead.  If
this is the correct explanation, then that might let us have faster
performance without sacrificing DDX accuracy.

A final possible explanation for the performance improvement is that
perhaps for some reason sample_d performs more optimally when the DDX and
DDY computations have similar accuracies to each other.  Before your patch,
our computation of DDX was more accurate than DDY; your patch decreases the
accuracy of DDX to match DDY.  If this explanation is correct, then a
better solution would probably be to improve the accuracy of DDY to make it
comparable to DDX, rather than the other way around.

Before we land this patch, can we do some experiments to try to figure out
which of these explanations (if any) is correct?

>
> Signed-off-by: Chia-I Wu <olv at lunarg.com>
> ---
>  src/mesa/drivers/dri/i965/brw_context.c   |  1 +
>  src/mesa/drivers/dri/i965/brw_context.h   |  1 +
>  src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40
> ++++++++++++++++++++++++-------
>  src/mesa/drivers/dri/i965/intel_screen.c  |  4 ++++
>  4 files changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_context.c
> b/src/mesa/drivers/dri/i965/brw_context.c
> index 4fcc9fb..1cdfb9d 100644
> --- a/src/mesa/drivers/dri/i965/brw_context.c
> +++ b/src/mesa/drivers/dri/i965/brw_context.c
> @@ -470,6 +470,7 @@ brwCreateContext(int api,
>     brw_draw_init( brw );
>
>     brw->precompile = driQueryOptionb(&brw->optionCache,
> "shader_precompile");
> +   brw->accurate_derivative = driQueryOptionb(&brw->optionCache,
> "accurate_derivative");
>
>     ctx->Const.ContextFlags = 0;
>     if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0)
> diff --git a/src/mesa/drivers/dri/i965/brw_context.h
> b/src/mesa/drivers/dri/i965/brw_context.h
> index c566bba..8bfc54a 100644
> --- a/src/mesa/drivers/dri/i965/brw_context.h
> +++ b/src/mesa/drivers/dri/i965/brw_context.h
> @@ -964,6 +964,7 @@ struct brw_context
>     bool always_flush_cache;
>     bool disable_throttling;
>     bool precompile;
> +   bool accurate_derivative;
>
>     driOptionCache optionCache;
>     /** @} */
> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
> index bfb3d33..69aeab1 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct
> brw_reg dst, struct brw_reg src
>   *
>   * arg0: ss0.tl ss0.tr ss0.bl ss0.br ss1.tl ss1.tr ss1.bl ss1.br
>   *
> - * and we're trying to produce:
> + * Ideally, we want to produce:
>   *
>   *           DDX                     DDY
>   * dst: (ss0.tr - ss0.tl)     (ss0.tl - ss0.bl)
> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct
> brw_reg dst, struct brw_reg src
>   *
>   * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same
> result
>   * for each pair, and vertstride = 2 jumps us 2 elements after processing
> a
> - * pair. But for DDY, it's harder, as we want to produce the pairs
> swizzled
> - * between each other.  We could probably do it like ddx and swizzle the
> right
> - * order later, but bail for now and just produce
> + * pair.  But the ideal approximation of DDX may impose a huge performance
> + * cost on sample_d.  As such, we favor ((ss0.tr - ss0.tl)x4 (ss1.tr -
> + * ss1.tl)x4) unless the app or the user requests otherwise.
> + *
> + * For DDY, it's harder, as we want to produce the pairs swizzled between
> each
> + * other.  We could probably do it like ddx and swizzle the right order
> later,
> + * but bail for now and just produce
>   * ((ss0.tl - ss0.bl)x4 (ss1.tl - ss1.bl)x4)
>   */
>  void
>  fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct
> brw_reg src)
>  {
> +   unsigned vstride, width;
> +
> +   /* Produce accurate result only when requested.  We emit only one
> +    * instruction for either case, but the problem is the result may
> affect
> +    * how fast sample_d executes.
> +    *
> +    * Since the performance difference is only observed on Haswell,
> ignore the
> +    * hints on other GENs for now.
> +    */
> +   if (!brw->is_haswell ||
> +       brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST ||
> +       brw->accurate_derivative) {
> +      vstride = BRW_VERTICAL_STRIDE_2;
> +      width = BRW_WIDTH_2;
> +   }
> +   else {
> +      vstride = BRW_VERTICAL_STRIDE_4;
> +      width = BRW_WIDTH_4;
> +   }
> +
>     struct brw_reg src0 = brw_reg(src.file, src.nr, 1,
>                                  BRW_REGISTER_TYPE_F,
> -                                BRW_VERTICAL_STRIDE_2,
> -                                BRW_WIDTH_2,
> +                                vstride,
> +                                width,
>                                  BRW_HORIZONTAL_STRIDE_0,
>                                  BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>     struct brw_reg src1 = brw_reg(src.file, src.nr, 0,
>                                  BRW_REGISTER_TYPE_F,
> -                                BRW_VERTICAL_STRIDE_2,
> -                                BRW_WIDTH_2,
> +                                vstride,
> +                                width,
>                                  BRW_HORIZONTAL_STRIDE_0,
>                                  BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>     brw_ADD(p, dst, src0, negate(src1));
> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c
> b/src/mesa/drivers/dri/i965/intel_screen.c
> index eb6515e..ee08ffd 100644
> --- a/src/mesa/drivers/dri/i965/intel_screen.c
> +++ b/src/mesa/drivers/dri/i965/intel_screen.c
> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] =
>     DRI_CONF_SECTION_END
>     DRI_CONF_SECTION_QUALITY
>        DRI_CONF_FORCE_S3TC_ENABLE("false")
> +
> +      DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false")
> +        DRI_CONF_DESC(en, "Perform more accurate derivative calculation")
> +      DRI_CONF_OPT_END
>     DRI_CONF_SECTION_END
>     DRI_CONF_SECTION_DEBUG
>        DRI_CONF_NO_RAST("false")
> --
> 1.8.3.1
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130913/96ceb5a0/attachment.html>