[Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row
Ian Romanick
idr at freedesktop.org
Fri Sep 20 08:30:57 PDT 2013
On 09/20/2013 09:50 AM, Paul Berry wrote:
> On 17 September 2013 19:54, Chia-I Wu <olvaffe at gmail.com
> <mailto:olvaffe at gmail.com>> wrote:
>
> Hi Paul,
>
> On Mon, Sep 16, 2013 at 3:46 PM, Chia-I Wu <olvaffe at gmail.com
> <mailto:olvaffe at gmail.com>> wrote:
> > On Sat, Sep 14, 2013 at 5:15 AM, Paul Berry
> <stereotype441 at gmail.com <mailto:stereotype441 at gmail.com>> wrote:
> >> On 12 September 2013 22:06, Chia-I Wu <olvaffe at gmail.com
> <mailto:olvaffe at gmail.com>> wrote:
> >>>
> >>> From: Chia-I Wu <olv at lunarg.com <mailto:olv at lunarg.com>>
> >>>
> >>> Consider only the top-left and top-right pixels to approximate
> DDX in a
> >>> 2x2
> >>> subspan, unless the application or the user requests a more accurate
> >>> approximation. This results in a less accurate approximation.
> However,
> >>> it
> >>> improves the performance of Xonotic with Ultra settings by
> 24.3879% +/-
> >>> 0.832202% (at 95.0% confidence) on Haswell. No noticeable image
> quality
> >>> difference observed.
> >>>
> >>> No piglit gpu.tests regressions (tested with v1)
> >>>
> >>> I failed to come up with an explanation for the performance
> difference, as
> >>> the
> >>> change does not affect Ivy Bridge. If anyone has the insight,
> please
> >>> kindly
> >>> enlighten me. Performance differences may also be observed on
> other games
> >>> that call textureGrad and dFdx.
> >>>
> >>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
> >>> Update
> >>> comments.
> >>
> >>
> >> I'm not entirely comfortable making a change that has a known
> negative
> >> impact on computational accuracy (even one that leads to such an
> impressive
> >> performance improvement) when we don't have any theories as to
> why the
> >> performance improvement happens, or why the improvement doesn't
> apply to Ivy
> >> Bridge. In my experience, making changes to the codebase without
> >> understanding why they improve things almost always leads to
> improvements
> >> that are brittle, since it's likely that the true source of the
> improvement
> >> is a coincidence that will be wiped out by some future change (or
> won't be
> >> relevant to client programs other than this particular
> benchmark). Having a
> >> theory as to why the performance improvement happens would help us be
> >> confident that we're applying the right fix under the right
> circumstances.
> > That is how I feel as I've mentioned. I am really glad to have the
> > discussion. I have done some experiments actually. It is just that
> > those experiments only tell me what theories are likely to be wrong.
> > They could not tell me if a theory is right.
> Do the experiments make sense to you? What other experiments do you
> want to see conducted?
>
> It could be hard to get direct proof without knowing the internal
> working..
>
>
> Sorry for the slow reply. We had some internal discussions with the
> hardware architects about this, and it appears that the first theory is
> correct: Haswell has an optimization in its sample_d processing which
> allows it to assume that all pixels in a 2x2 subspan will resolve to the
> same LOD provided that all the gradients in the 2x2 subspan are
> sufficiently similar to each other. There's a register called
> SAMPLER_MODE which determines how similar the gradients have to be in
> order to trigger the optimization. It can be set to values between 0
> and 0x1f, where 0 (the default) means "only trigger the optimization if
> the gradients are exactly equal" and 0x1f means "trigger the
> optimization as frequently as possible". Obviously triggering the
> optimization more often reduces the quality of the rendered output
> slightly, because it forces all pixels within a 2x2 subspan to sample
> from the same LOD.
>
> We believe that setting this register to 0x1f should produce an
> equivalent speed-up to your patch, without sacrificing the quality of
> d/dx when it is used for other (non-sample_d) purposes. This approach
> would have the additional advantage that the benefit would apply to any
> shader that uses the sample_d message, regardless of whether or not that
> shader uses d/dx and d/dy to compute its gradients.
>
> Would you mind trying this register to see if it produces an equivalent
> performance benefit in both your micro-benchmark and Xonotic with Ultra
> settings? The register is located at address 07028h in register space
> MMIO 0/2/0. When setting it, the upper 16 bits are a write mask, so to
> set the register to 0 you would store 0x001f0000, and to set it to 0x1f
> you would store 0x001f001f.
>
> Since the SAMPLER_MODE setting allows us to trade off quality vs
> performance, we're also interested to know whether a value less than
> 0x1f is sufficient to produce the performance improvement in Xonotic--it
> would be nice if we could find a "sweet spot" for this setting that
> produces the performance improvement we need without sacrificing too
> much quality.
How about if we just give a driconf option to adjust it. Then gamers
can make their own choice. For applications where it know it makes a
big difference, we can provide a default non-0 value in the system driconf.
> Finally, do you have any ability to see whether the Windows driver sets
> this register, and if so what it sets it to? That would provide some
> nice confirmation that we aren't barking up the wrong tree here.
>
> As a follow-up task, I'm planning to write a patch that improves the
> quality of our d/dy calculation to be comparable to d/dx. Based on our
> current understanding of what's going on, I suspect that my patch may
> have a slight effect on the SAMPLER_MODE sweet spot. I'll try to get
> that patch out today, and I'll Cc you so that you can try it out.
>
> Thanks so much for finding this, Chia-I, and thanks for your patience as
> we've been sorting through trying to find the true explanation for the
> performance improvement.
More information about the mesa-dev
mailing list