[Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row

Fri Sep 20 07:50:24 PDT 2013

On 17 September 2013 19:54, Chia-I Wu <olvaffe at gmail.com> wrote:

> Hi Paul,
>
> On Mon, Sep 16, 2013 at 3:46 PM, Chia-I Wu <olvaffe at gmail.com> wrote:
> > On Sat, Sep 14, 2013 at 5:15 AM, Paul Berry <stereotype441 at gmail.com>
> wrote:
> >> On 12 September 2013 22:06, Chia-I Wu <olvaffe at gmail.com> wrote:
> >>>
> >>> From: Chia-I Wu <olv at lunarg.com>
> >>>
> >>> Consider only the top-left and top-right pixels to approximate DDX in a
> >>> 2x2
> >>> subspan, unless the application or the user requests a more accurate
> >>> approximation.  This results in a less accurate approximation.
>  However,
> >>> it
> >>> improves the performance of Xonotic with Ultra settings by 24.3879% +/-
> >>> 0.832202% (at 95.0% confidence) on Haswell.  No noticeable image
> quality
> >>> difference observed.
> >>>
> >>> No piglit gpu.tests regressions (tested with v1)
> >>>
> >>> I failed to come up with an explanation for the performance
> difference, as
> >>> the
> >>> change does not affect Ivy Bridge.  If anyone has the insight, please
> >>> kindly
> >>> enlighten me.  Performance differences may also be observed on other
> games
> >>> that call textureGrad and dFdx.
> >>>
> >>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
> >>> Update
> >>>     comments.
> >>
> >>
> >> I'm not entirely comfortable making a change that has a known negative
> >> impact on computational accuracy (even one that leads to such an
> impressive
> >> performance improvement) when we don't have any theories as to why the
> >> performance improvement happens, or why the improvement doesn't apply
> to Ivy
> >> Bridge.  In my experience, making changes to the codebase without
> >> understanding why they improve things almost always leads to
> improvements
> >> that are brittle, since it's likely that the true source of the
> improvement
> >> is a coincidence that will be wiped out by some future change (or won't
> be
> >> relevant to client programs other than this particular benchmark).
>  Having a
> >> theory as to why the performance improvement happens would help us be
> >> confident that we're applying the right fix under the right
> circumstances.
> > That is how I feel as I've mentioned.  I am really glad to have the
> > discussion.  I have done some experiments actually.  It is just that
> > those experiments only tell me what theories are likely to be wrong.
> > They could not tell me if a theory is right.
> Do the experiments make sense to you?  What other experiments do you
> want to see conducted?
>
> It could be hard to get direct proof without knowing the internal working..
>

Sorry for the slow reply.  We had some internal discussions with the
hardware architects about this, and it appears that the first theory is
correct: Haswell has an optimization in its sample_d processing which
allows it to assume that all pixels in a 2x2 subspan will resolve to the
same LOD provided that all the gradients in the 2x2 subspan are
sufficiently similar to each other.  There's a register called SAMPLER_MODE
which determines how similar the gradients have to be in order to trigger
the optimization.  It can be set to values between 0 and 0x1f, where 0 (the
default) means "only trigger the optimization if the gradients are exactly
equal" and 0x1f means "trigger the optimization as frequently as
possible".  Obviously triggering the optimization more often reduces the
quality of the rendered output slightly, because it forces all pixels
within a 2x2 subspan to sample from the same LOD.

We believe that setting this register to 0x1f should produce an equivalent
speed-up to your patch, without sacrificing the quality of d/dx when it is
used for other (non-sample_d) purposes.  This approach would have the
additional advantage that the benefit would apply to any shader that uses
the sample_d message, regardless of whether or not that shader uses d/dx
and d/dy to compute its gradients.

Would you mind trying this register to see if it produces an equivalent
performance benefit in both your micro-benchmark and Xonotic with Ultra
settings?  The register is located at address 07028h in register space MMIO
0/2/0.  When setting it, the upper 16 bits are a write mask, so to set the
register to 0 you would store 0x001f0000, and to set it to 0x1f you would
store 0x001f001f.

Since the SAMPLER_MODE setting allows us to trade off quality vs
performance, we're also interested to know whether a value less than 0x1f
is sufficient to produce the performance improvement in Xonotic--it would
be nice if we could find a "sweet spot" for this setting that produces the
performance improvement we need without sacrificing too much quality.

Finally, do you have any ability to see whether the Windows driver sets
this register, and if so what it sets it to?  That would provide some nice
confirmation that we aren't barking up the wrong tree here.

As a follow-up task, I'm planning to write a patch that improves the
quality of our d/dy calculation to be comparable to d/dx.  Based on our
current understanding of what's going on, I suspect that my patch may have
a slight effect on the SAMPLER_MODE sweet spot.  I'll try to get that patch
out today, and I'll Cc you so that you can try it out.

Thanks so much for finding this, Chia-I, and thanks for your patience as
we've been sorting through trying to find the true explanation for the
performance improvement.

>
> >
> > So I have a micro benchmark that draws a 256x256 texture to a 512x512
> > window, using texture2D() or textureGrad().  It can also set up the
> > vertex buffer such that the texture is rotated around the z-axis by 15
> > degrees.
> >
> > On Haswell, when the texture is not rotated, rendering with
> > textureGrad() is less than 1% slower than rendering with texture2D().
> > The slowdown could be from the additional instructions to calculate
> > DDX/DDY or the extra message payload.  Whether this patch is applied
> > or not does not make any difference.
> >
> > When the texture is rotated, rendering with textureGrad() is ~3%
> > slower than rendering with texture2D() without the patch.  With the
> > patch, the difference is down to less than 1% again.
> >
> > Computing LOD in the shader results in ~17% slowdown comparing to
> textureGrad().
> >
> > As a better way to control the result of DDX, I also hacked the driver
> > so that DDX produced the values I specified for each pixel.  When not
> > all pixels in the subspan have the same gradient, rendering is ~6%
> > slower comparing to when all pixels in the subspan have the same
> > gradient.
> >
> > The weird thing is, in SIMD8 mode, two subspans are processed at the
> > same time.  When all pixels in one of the subspan have the same
> > gradient, whether the pixels in the other subspan have the same
> > gradient or not does not matter.
> >
> > As for Ivy Bridge, rendering with textureGrad() is always
> > significantly slower than rendering with texture2D().  Computing LOD
> > in the shader results in another ~4% slowdown comparing to
> > textureGrad().
> >
> >>
> >> For example, here's one theory as to why we might be seeing an
> improvement:
> >> perhaps Haswell's sample_d processing is smart enough to realize that
> when
> >> all the gradient values within a sub-span are the same, that means that
> all
> >> of the sampling for the sub-span will come from the same LOD, and that
> >> allows it to short-cut some expensive step in the LOD calculation.
>  Perhaps
> >> the same improvement isn't seen on Ivy Bridge because Ivy Bridge's
> sample_d
> >> processing logic is less sophisticated, so it's unable to perform the
> >> optimization.  If this is the case, then conditioning the optimization
> on
> >> brw->is_haswell (as you've done) makes sense.
> > This is also the theory I have and my experiments could not rule it
> > out.  The question I have is, if LODs of all pixels were calculated
> > parallely, could the short cut help this much?  I don't have enough
> > knowledge in hardware to know the answer or even to know this is a
> > question or not.
> >
> >>
> >> Another possible explanation for the Haswell vs Ivy Bridge difference is
> >> that perhaps Ivy Bridge, being a lower-performing chip, has other
> >> bottlenecks that make the optimization irrelevant for this particular
> >> benchmark, but potentially still useful for other benchmarks.  For
> instance,
> >> maybe when this benchmark executes on Ivy Bridge, the texture that's
> being
> >> sampled from is located in sufficiently distant memory that optimizing
> the
> >> sample_d's memory accesses makes no difference, since the bottleneck is
> the
> >> speed with which the texture can be read into cache, rather than the
> speed
> >> of operation of sample_d.  If this explanation is correct, then it
> might be
> >> worth applying the optimization to both Ivy Bridge and Haswell (and
> perhaps
> >> Sandy Bridge as well), since it might conceivably benefit those other
> chips
> >> when running applications that place less cache pressure on the chip.
> > In the micro benchmark, replacing texture2D() by textureGrad() alone
> > on Ivy Bridge makes the performance difference.  The cache should work
> > equally well for either messages.
> >>
> >> Another possibile explanation is that Haswell has a bug in its sample_d
> >> logic which causes it to be slow under some conditions, and this
> >> lower-accuracy DDX computation happens to work around it.  If that's the
> >> case, we might want to consider not using sample_d at all on Haswell,
> and
> >> instead calculating the LOD in the shader and using sample_l instead.
>  If
> >> this is the correct explanation, then that might let us have faster
> >> performance without sacrificing DDX accuracy.
> > Calculating the LOD in the shader is slower on Haswell (and Ivy Bridge
> > too in the micro benchmark).
> >
> >>
> >> A final possible explanation for the performance improvement is that
> perhaps
> >> for some reason sample_d performs more optimally when the DDX and DDY
> >> computations have similar accuracies to each other.  Before your patch,
> our
> >> computation of DDX was more accurate than DDY; your patch decreases the
> >> accuracy of DDX to match DDY.  If this explanation is correct, then a
> better
> >> solution would probably be to improve the accuracy of DDY to make it
> >> comparable to DDX, rather than the other way around.
> > I was able to hack the driver to specify the result of DDX
> > arbitrarily.  It only matters if all pixels have the same gradient.
> > Whether the accuracy matches that of DDY does not seem to matter.
> >
> >>
> >> Before we land this patch, can we do some experiments to try to figure
> out
> >> which of these explanations (if any) is correct?
> > The first theory is what I have.  But I cannot image what the hardware
> > does to make the difference: does calculating 8 LODs eight times
> > slower than calculating one?
> >
> >
> > Attached is the micro benchmark I added to mesa/demos.
> >>
> >>>
> >>>
> >>> Signed-off-by: Chia-I Wu <olv at lunarg.com>
> >>> ---
> >>>  src/mesa/drivers/dri/i965/brw_context.c   |  1 +
> >>>  src/mesa/drivers/dri/i965/brw_context.h   |  1 +
> >>>  src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40
> >>> ++++++++++++++++++++++++-------
> >>>  src/mesa/drivers/dri/i965/intel_screen.c  |  4 ++++
> >>>  4 files changed, 38 insertions(+), 8 deletions(-)
> >>>
> >>> diff --git a/src/mesa/drivers/dri/i965/brw_context.c
> >>> b/src/mesa/drivers/dri/i965/brw_context.c
> >>> index 4fcc9fb..1cdfb9d 100644
> >>> --- a/src/mesa/drivers/dri/i965/brw_context.c
> >>> +++ b/src/mesa/drivers/dri/i965/brw_context.c
> >>> @@ -470,6 +470,7 @@ brwCreateContext(int api,
> >>>     brw_draw_init( brw );
> >>>
> >>>     brw->precompile = driQueryOptionb(&brw->optionCache,
> >>> "shader_precompile");
> >>> +   brw->accurate_derivative = driQueryOptionb(&brw->optionCache,
> >>> "accurate_derivative");
> >>>
> >>>     ctx->Const.ContextFlags = 0;
> >>>     if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0)
> >>> diff --git a/src/mesa/drivers/dri/i965/brw_context.h
> >>> b/src/mesa/drivers/dri/i965/brw_context.h
> >>> index c566bba..8bfc54a 100644
> >>> --- a/src/mesa/drivers/dri/i965/brw_context.h
> >>> +++ b/src/mesa/drivers/dri/i965/brw_context.h
> >>> @@ -964,6 +964,7 @@ struct brw_context
> >>>     bool always_flush_cache;
> >>>     bool disable_throttling;
> >>>     bool precompile;
> >>> +   bool accurate_derivative;
> >>>
> >>>     driOptionCache optionCache;
> >>>     /** @} */
> >>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
> >>> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
> >>> index bfb3d33..69aeab1 100644
> >>> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
> >>> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
> >>> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct
> >>> brw_reg dst, struct brw_reg src
> >>>   *
> >>>   * arg0: ss0.tl ss0.tr ss0.bl ss0.br ss1.tl ss1.tr ss1.bl ss1.br
> >>>   *
> >>> - * and we're trying to produce:
> >>> + * Ideally, we want to produce:
> >>>   *
> >>>   *           DDX                     DDY
> >>>   * dst: (ss0.tr - ss0.tl)     (ss0.tl - ss0.bl)
> >>> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct
> >>> brw_reg dst, struct brw_reg src
> >>>   *
> >>>   * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same
> >>> result
> >>>   * for each pair, and vertstride = 2 jumps us 2 elements after
> processing
> >>> a
> >>> - * pair. But for DDY, it's harder, as we want to produce the pairs
> >>> swizzled
> >>> - * between each other.  We could probably do it like ddx and swizzle
> the
> >>> right
> >>> - * order later, but bail for now and just produce
> >>> + * pair.  But the ideal approximation of DDX may impose a huge
> >>> performance
> >>> + * cost on sample_d.  As such, we favor ((ss0.tr - ss0.tl)x4 (ss1.tr-
> >>> + * ss1.tl)x4) unless the app or the user requests otherwise.
> >>> + *
> >>> + * For DDY, it's harder, as we want to produce the pairs swizzled
> between
> >>> each
> >>> + * other.  We could probably do it like ddx and swizzle the right
> order
> >>> later,
> >>> + * but bail for now and just produce
> >>>   * ((ss0.tl - ss0.bl)x4 (ss1.tl - ss1.bl)x4)
> >>>   */
> >>>  void
> >>>  fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct
> >>> brw_reg src)
> >>>  {
> >>> +   unsigned vstride, width;
> >>> +
> >>> +   /* Produce accurate result only when requested.  We emit only one
> >>> +    * instruction for either case, but the problem is the result may
> >>> affect
> >>> +    * how fast sample_d executes.
> >>> +    *
> >>> +    * Since the performance difference is only observed on Haswell,
> >>> ignore the
> >>> +    * hints on other GENs for now.
> >>> +    */
> >>> +   if (!brw->is_haswell ||
> >>> +       brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST ||
> >>> +       brw->accurate_derivative) {
> >>> +      vstride = BRW_VERTICAL_STRIDE_2;
> >>> +      width = BRW_WIDTH_2;
> >>> +   }
> >>> +   else {
> >>> +      vstride = BRW_VERTICAL_STRIDE_4;
> >>> +      width = BRW_WIDTH_4;
> >>> +   }
> >>> +
> >>>     struct brw_reg src0 = brw_reg(src.file, src.nr, 1,
> >>>                                  BRW_REGISTER_TYPE_F,
> >>> -                                BRW_VERTICAL_STRIDE_2,
> >>> -                                BRW_WIDTH_2,
> >>> +                                vstride,
> >>> +                                width,
> >>>                                  BRW_HORIZONTAL_STRIDE_0,
> >>>                                  BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
> >>>     struct brw_reg src1 = brw_reg(src.file, src.nr, 0,
> >>>                                  BRW_REGISTER_TYPE_F,
> >>> -                                BRW_VERTICAL_STRIDE_2,
> >>> -                                BRW_WIDTH_2,
> >>> +                                vstride,
> >>> +                                width,
> >>>                                  BRW_HORIZONTAL_STRIDE_0,
> >>>                                  BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
> >>>     brw_ADD(p, dst, src0, negate(src1));
> >>> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c
> >>> b/src/mesa/drivers/dri/i965/intel_screen.c
> >>> index eb6515e..ee08ffd 100644
> >>> --- a/src/mesa/drivers/dri/i965/intel_screen.c
> >>> +++ b/src/mesa/drivers/dri/i965/intel_screen.c
> >>> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] =
> >>>     DRI_CONF_SECTION_END
> >>>     DRI_CONF_SECTION_QUALITY
> >>>        DRI_CONF_FORCE_S3TC_ENABLE("false")
> >>> +
> >>> +      DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false")
> >>> +        DRI_CONF_DESC(en, "Perform more accurate derivative
> calculation")
> >>> +      DRI_CONF_OPT_END
> >>>     DRI_CONF_SECTION_END
> >>>     DRI_CONF_SECTION_DEBUG
> >>>        DRI_CONF_NO_RAST("false")
> >>> --
> >>> 1.8.3.1
> >>>
> >>> _______________________________________________
> >>> mesa-dev mailing list
> >>> mesa-dev at lists.freedesktop.org
> >>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>
> >>
> >
> >
> >
> > --
> > olv at LunarG.com
>
>
>
> --
> olv at LunarG.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130920/ab4024d0/attachment-0001.html>