[Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row

Mon Sep 16 00:46:51 PDT 2013

On Sat, Sep 14, 2013 at 5:15 AM, Paul Berry <stereotype441 at gmail.com> wrote:
> On 12 September 2013 22:06, Chia-I Wu <olvaffe at gmail.com> wrote:
>>
>> From: Chia-I Wu <olv at lunarg.com>
>>
>> Consider only the top-left and top-right pixels to approximate DDX in a
>> 2x2
>> subspan, unless the application or the user requests a more accurate
>> approximation.  This results in a less accurate approximation.  However,
>> it
>> improves the performance of Xonotic with Ultra settings by 24.3879% +/-
>> 0.832202% (at 95.0% confidence) on Haswell.  No noticeable image quality
>> difference observed.
>>
>> No piglit gpu.tests regressions (tested with v1)
>>
>> I failed to come up with an explanation for the performance difference, as
>> the
>> change does not affect Ivy Bridge.  If anyone has the insight, please
>> kindly
>> enlighten me.  Performance differences may also be observed on other games
>> that call textureGrad and dFdx.
>>
>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
>> Update
>>     comments.
>
>
> I'm not entirely comfortable making a change that has a known negative
> impact on computational accuracy (even one that leads to such an impressive
> performance improvement) when we don't have any theories as to why the
> performance improvement happens, or why the improvement doesn't apply to Ivy
> Bridge.  In my experience, making changes to the codebase without
> understanding why they improve things almost always leads to improvements
> that are brittle, since it's likely that the true source of the improvement
> is a coincidence that will be wiped out by some future change (or won't be
> relevant to client programs other than this particular benchmark).  Having a
> theory as to why the performance improvement happens would help us be
> confident that we're applying the right fix under the right circumstances.
That is how I feel as I've mentioned.  I am really glad to have the
discussion.  I have done some experiments actually.  It is just that
those experiments only tell me what theories are likely to be wrong.
They could not tell me if a theory is right.

So I have a micro benchmark that draws a 256x256 texture to a 512x512
window, using texture2D() or textureGrad().  It can also set up the
vertex buffer such that the texture is rotated around the z-axis by 15
degrees.

On Haswell, when the texture is not rotated, rendering with
textureGrad() is less than 1% slower than rendering with texture2D().
The slowdown could be from the additional instructions to calculate
DDX/DDY or the extra message payload.  Whether this patch is applied
or not does not make any difference.

When the texture is rotated, rendering with textureGrad() is ~3%
slower than rendering with texture2D() without the patch.  With the
patch, the difference is down to less than 1% again.

Computing LOD in the shader results in ~17% slowdown comparing to textureGrad().

As a better way to control the result of DDX, I also hacked the driver
so that DDX produced the values I specified for each pixel.  When not
all pixels in the subspan have the same gradient, rendering is ~6%
slower comparing to when all pixels in the subspan have the same
gradient.

The weird thing is, in SIMD8 mode, two subspans are processed at the
same time.  When all pixels in one of the subspan have the same
gradient, whether the pixels in the other subspan have the same
gradient or not does not matter.

As for Ivy Bridge, rendering with textureGrad() is always
significantly slower than rendering with texture2D().  Computing LOD
in the shader results in another ~4% slowdown comparing to
textureGrad().

>
> For example, here's one theory as to why we might be seeing an improvement:
> perhaps Haswell's sample_d processing is smart enough to realize that when
> all the gradient values within a sub-span are the same, that means that all
> of the sampling for the sub-span will come from the same LOD, and that
> allows it to short-cut some expensive step in the LOD calculation.  Perhaps
> the same improvement isn't seen on Ivy Bridge because Ivy Bridge's sample_d
> processing logic is less sophisticated, so it's unable to perform the
> optimization.  If this is the case, then conditioning the optimization on
> brw->is_haswell (as you've done) makes sense.
This is also the theory I have and my experiments could not rule it
out.  The question I have is, if LODs of all pixels were calculated
parallely, could the short cut help this much?  I don't have enough
knowledge in hardware to know the answer or even to know this is a
question or not.

>
> Another possible explanation for the Haswell vs Ivy Bridge difference is
> that perhaps Ivy Bridge, being a lower-performing chip, has other
> bottlenecks that make the optimization irrelevant for this particular
> benchmark, but potentially still useful for other benchmarks.  For instance,
> maybe when this benchmark executes on Ivy Bridge, the texture that's being
> sampled from is located in sufficiently distant memory that optimizing the
> sample_d's memory accesses makes no difference, since the bottleneck is the
> speed with which the texture can be read into cache, rather than the speed
> of operation of sample_d.  If this explanation is correct, then it might be
> worth applying the optimization to both Ivy Bridge and Haswell (and perhaps
> Sandy Bridge as well), since it might conceivably benefit those other chips
> when running applications that place less cache pressure on the chip.
In the micro benchmark, replacing texture2D() by textureGrad() alone
on Ivy Bridge makes the performance difference.  The cache should work
equally well for either messages.
>
> Another possibile explanation is that Haswell has a bug in its sample_d
> logic which causes it to be slow under some conditions, and this
> lower-accuracy DDX computation happens to work around it.  If that's the
> case, we might want to consider not using sample_d at all on Haswell, and
> instead calculating the LOD in the shader and using sample_l instead.  If
> this is the correct explanation, then that might let us have faster
> performance without sacrificing DDX accuracy.
Calculating the LOD in the shader is slower on Haswell (and Ivy Bridge
too in the micro benchmark).

>
> A final possible explanation for the performance improvement is that perhaps
> for some reason sample_d performs more optimally when the DDX and DDY
> computations have similar accuracies to each other.  Before your patch, our
> computation of DDX was more accurate than DDY; your patch decreases the
> accuracy of DDX to match DDY.  If this explanation is correct, then a better
> solution would probably be to improve the accuracy of DDY to make it
> comparable to DDX, rather than the other way around.
I was able to hack the driver to specify the result of DDX
arbitrarily.  It only matters if all pixels have the same gradient.
Whether the accuracy matches that of DDY does not seem to matter.

>
> Before we land this patch, can we do some experiments to try to figure out
> which of these explanations (if any) is correct?
The first theory is what I have.  But I cannot image what the hardware
does to make the difference: does calculating 8 LODs eight times
slower than calculating one?

Attached is the micro benchmark I added to mesa/demos.
>
>>
>>
>> Signed-off-by: Chia-I Wu <olv at lunarg.com>
>> ---
>>  src/mesa/drivers/dri/i965/brw_context.c   |  1 +
>>  src/mesa/drivers/dri/i965/brw_context.h   |  1 +
>>  src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40
>> ++++++++++++++++++++++++-------
>>  src/mesa/drivers/dri/i965/intel_screen.c  |  4 ++++
>>  4 files changed, 38 insertions(+), 8 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_context.c
>> b/src/mesa/drivers/dri/i965/brw_context.c
>> index 4fcc9fb..1cdfb9d 100644
>> --- a/src/mesa/drivers/dri/i965/brw_context.c
>> +++ b/src/mesa/drivers/dri/i965/brw_context.c
>> @@ -470,6 +470,7 @@ brwCreateContext(int api,
>>     brw_draw_init( brw );
>>
>>     brw->precompile = driQueryOptionb(&brw->optionCache,
>> "shader_precompile");
>> +   brw->accurate_derivative = driQueryOptionb(&brw->optionCache,
>> "accurate_derivative");
>>
>>     ctx->Const.ContextFlags = 0;
>>     if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0)
>> diff --git a/src/mesa/drivers/dri/i965/brw_context.h
>> b/src/mesa/drivers/dri/i965/brw_context.h
>> index c566bba..8bfc54a 100644
>> --- a/src/mesa/drivers/dri/i965/brw_context.h
>> +++ b/src/mesa/drivers/dri/i965/brw_context.h
>> @@ -964,6 +964,7 @@ struct brw_context
>>     bool always_flush_cache;
>>     bool disable_throttling;
>>     bool precompile;
>> +   bool accurate_derivative;
>>
>>     driOptionCache optionCache;
>>     /** @} */
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> index bfb3d33..69aeab1 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct
>> brw_reg dst, struct brw_reg src
>>   *
>>   * arg0: ss0.tl ss0.tr ss0.bl ss0.br ss1.tl ss1.tr ss1.bl ss1.br
>>   *
>> - * and we're trying to produce:
>> + * Ideally, we want to produce:
>>   *
>>   *           DDX                     DDY
>>   * dst: (ss0.tr - ss0.tl)     (ss0.tl - ss0.bl)
>> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct
>> brw_reg dst, struct brw_reg src
>>   *
>>   * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same
>> result
>>   * for each pair, and vertstride = 2 jumps us 2 elements after processing
>> a
>> - * pair. But for DDY, it's harder, as we want to produce the pairs
>> swizzled
>> - * between each other.  We could probably do it like ddx and swizzle the
>> right
>> - * order later, but bail for now and just produce
>> + * pair.  But the ideal approximation of DDX may impose a huge
>> performance
>> + * cost on sample_d.  As such, we favor ((ss0.tr - ss0.tl)x4 (ss1.tr -
>> + * ss1.tl)x4) unless the app or the user requests otherwise.
>> + *
>> + * For DDY, it's harder, as we want to produce the pairs swizzled between
>> each
>> + * other.  We could probably do it like ddx and swizzle the right order
>> later,
>> + * but bail for now and just produce
>>   * ((ss0.tl - ss0.bl)x4 (ss1.tl - ss1.bl)x4)
>>   */
>>  void
>>  fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct
>> brw_reg src)
>>  {
>> +   unsigned vstride, width;
>> +
>> +   /* Produce accurate result only when requested.  We emit only one
>> +    * instruction for either case, but the problem is the result may
>> affect
>> +    * how fast sample_d executes.
>> +    *
>> +    * Since the performance difference is only observed on Haswell,
>> ignore the
>> +    * hints on other GENs for now.
>> +    */
>> +   if (!brw->is_haswell ||
>> +       brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST ||
>> +       brw->accurate_derivative) {
>> +      vstride = BRW_VERTICAL_STRIDE_2;
>> +      width = BRW_WIDTH_2;
>> +   }
>> +   else {
>> +      vstride = BRW_VERTICAL_STRIDE_4;
>> +      width = BRW_WIDTH_4;
>> +   }
>> +
>>     struct brw_reg src0 = brw_reg(src.file, src.nr, 1,
>>                                  BRW_REGISTER_TYPE_F,
>> -                                BRW_VERTICAL_STRIDE_2,
>> -                                BRW_WIDTH_2,
>> +                                vstride,
>> +                                width,
>>                                  BRW_HORIZONTAL_STRIDE_0,
>>                                  BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>>     struct brw_reg src1 = brw_reg(src.file, src.nr, 0,
>>                                  BRW_REGISTER_TYPE_F,
>> -                                BRW_VERTICAL_STRIDE_2,
>> -                                BRW_WIDTH_2,
>> +                                vstride,
>> +                                width,
>>                                  BRW_HORIZONTAL_STRIDE_0,
>>                                  BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>>     brw_ADD(p, dst, src0, negate(src1));
>> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c
>> b/src/mesa/drivers/dri/i965/intel_screen.c
>> index eb6515e..ee08ffd 100644
>> --- a/src/mesa/drivers/dri/i965/intel_screen.c
>> +++ b/src/mesa/drivers/dri/i965/intel_screen.c
>> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] =
>>     DRI_CONF_SECTION_END
>>     DRI_CONF_SECTION_QUALITY
>>        DRI_CONF_FORCE_S3TC_ENABLE("false")
>> +
>> +      DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false")
>> +        DRI_CONF_DESC(en, "Perform more accurate derivative calculation")
>> +      DRI_CONF_OPT_END
>>     DRI_CONF_SECTION_END
>>     DRI_CONF_SECTION_DEBUG
>>        DRI_CONF_NO_RAST("false")
>> --
>> 1.8.3.1
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
>

-- 
olv at LunarG.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-i965-add-micro-benchmark-for-sample.patch
Type: application/octet-stream
Size: 7437 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130916/af619e90/attachment-0001.obj>