[Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row

Fri Sep 20 08:30:57 PDT 2013

On 09/20/2013 09:50 AM, Paul Berry wrote:
> On 17 September 2013 19:54, Chia-I Wu <olvaffe at gmail.com
> <mailto:olvaffe at gmail.com>> wrote:
> 
>     Hi Paul,
> 
>     On Mon, Sep 16, 2013 at 3:46 PM, Chia-I Wu <olvaffe at gmail.com
>     <mailto:olvaffe at gmail.com>> wrote:
>     > On Sat, Sep 14, 2013 at 5:15 AM, Paul Berry
>     <stereotype441 at gmail.com <mailto:stereotype441 at gmail.com>> wrote:
>     >> On 12 September 2013 22:06, Chia-I Wu <olvaffe at gmail.com
>     <mailto:olvaffe at gmail.com>> wrote:
>     >>>
>     >>> From: Chia-I Wu <olv at lunarg.com <mailto:olv at lunarg.com>>
>     >>>
>     >>> Consider only the top-left and top-right pixels to approximate
>     DDX in a
>     >>> 2x2
>     >>> subspan, unless the application or the user requests a more accurate
>     >>> approximation.  This results in a less accurate approximation.
>      However,
>     >>> it
>     >>> improves the performance of Xonotic with Ultra settings by
>     24.3879% +/-
>     >>> 0.832202% (at 95.0% confidence) on Haswell.  No noticeable image
>     quality
>     >>> difference observed.
>     >>>
>     >>> No piglit gpu.tests regressions (tested with v1)
>     >>>
>     >>> I failed to come up with an explanation for the performance
>     difference, as
>     >>> the
>     >>> change does not affect Ivy Bridge.  If anyone has the insight,
>     please
>     >>> kindly
>     >>> enlighten me.  Performance differences may also be observed on
>     other games
>     >>> that call textureGrad and dFdx.
>     >>>
>     >>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
>     >>> Update
>     >>>     comments.
>     >>
>     >>
>     >> I'm not entirely comfortable making a change that has a known
>     negative
>     >> impact on computational accuracy (even one that leads to such an
>     impressive
>     >> performance improvement) when we don't have any theories as to
>     why the
>     >> performance improvement happens, or why the improvement doesn't
>     apply to Ivy
>     >> Bridge.  In my experience, making changes to the codebase without
>     >> understanding why they improve things almost always leads to
>     improvements
>     >> that are brittle, since it's likely that the true source of the
>     improvement
>     >> is a coincidence that will be wiped out by some future change (or
>     won't be
>     >> relevant to client programs other than this particular
>     benchmark).  Having a
>     >> theory as to why the performance improvement happens would help us be
>     >> confident that we're applying the right fix under the right
>     circumstances.
>     > That is how I feel as I've mentioned.  I am really glad to have the
>     > discussion.  I have done some experiments actually.  It is just that
>     > those experiments only tell me what theories are likely to be wrong.
>     > They could not tell me if a theory is right.
>     Do the experiments make sense to you?  What other experiments do you
>     want to see conducted?
> 
>     It could be hard to get direct proof without knowing the internal
>     working..
> 
> 
> Sorry for the slow reply.  We had some internal discussions with the
> hardware architects about this, and it appears that the first theory is
> correct: Haswell has an optimization in its sample_d processing which
> allows it to assume that all pixels in a 2x2 subspan will resolve to the
> same LOD provided that all the gradients in the 2x2 subspan are
> sufficiently similar to each other.  There's a register called
> SAMPLER_MODE which determines how similar the gradients have to be in
> order to trigger the optimization.  It can be set to values between 0
> and 0x1f, where 0 (the default) means "only trigger the optimization if
> the gradients are exactly equal" and 0x1f means "trigger the
> optimization as frequently as possible".  Obviously triggering the
> optimization more often reduces the quality of the rendered output
> slightly, because it forces all pixels within a 2x2 subspan to sample
> from the same LOD.
> 
> We believe that setting this register to 0x1f should produce an
> equivalent speed-up to your patch, without sacrificing the quality of
> d/dx when it is used for other (non-sample_d) purposes.  This approach
> would have the additional advantage that the benefit would apply to any
> shader that uses the sample_d message, regardless of whether or not that
> shader uses d/dx and d/dy to compute its gradients.
> 
> Would you mind trying this register to see if it produces an equivalent
> performance benefit in both your micro-benchmark and Xonotic with Ultra
> settings?  The register is located at address 07028h in register space
> MMIO 0/2/0.  When setting it, the upper 16 bits are a write mask, so to
> set the register to 0 you would store 0x001f0000, and to set it to 0x1f
> you would store 0x001f001f.
> 
> Since the SAMPLER_MODE setting allows us to trade off quality vs
> performance, we're also interested to know whether a value less than
> 0x1f is sufficient to produce the performance improvement in Xonotic--it
> would be nice if we could find a "sweet spot" for this setting that
> produces the performance improvement we need without sacrificing too
> much quality.

How about if we just give a driconf option to adjust it.  Then gamers
can make their own choice.  For applications where it know it makes a
big difference, we can provide a default non-0 value in the system driconf.

> Finally, do you have any ability to see whether the Windows driver sets
> this register, and if so what it sets it to?  That would provide some
> nice confirmation that we aren't barking up the wrong tree here.
> 
> As a follow-up task, I'm planning to write a patch that improves the
> quality of our d/dy calculation to be comparable to d/dx.  Based on our
> current understanding of what's going on, I suspect that my patch may
> have a slight effect on the SAMPLER_MODE sweet spot.  I'll try to get
> that patch out today, and I'll Cc you so that you can try it out.
> 
> Thanks so much for finding this, Chia-I, and thanks for your patience as
> we've been sorting through trying to find the true explanation for the
> performance improvement.