[Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row

Chia-I Wu olvaffe at gmail.com
Sun Sep 22 21:09:48 PDT 2013


On Fri, Sep 20, 2013 at 10:50 PM, Paul Berry <stereotype441 at gmail.com> wrote:
> On 17 September 2013 19:54, Chia-I Wu <olvaffe at gmail.com> wrote:
>>
>> Hi Paul,
>>
>> On Mon, Sep 16, 2013 at 3:46 PM, Chia-I Wu <olvaffe at gmail.com> wrote:
>> > On Sat, Sep 14, 2013 at 5:15 AM, Paul Berry <stereotype441 at gmail.com>
>> > wrote:
>> >> On 12 September 2013 22:06, Chia-I Wu <olvaffe at gmail.com> wrote:
>> >>>
>> >>> From: Chia-I Wu <olv at lunarg.com>
>> >>>
>> >>> Consider only the top-left and top-right pixels to approximate DDX in
>> >>> a
>> >>> 2x2
>> >>> subspan, unless the application or the user requests a more accurate
>> >>> approximation.  This results in a less accurate approximation.
>> >>> However,
>> >>> it
>> >>> improves the performance of Xonotic with Ultra settings by 24.3879%
>> >>> +/-
>> >>> 0.832202% (at 95.0% confidence) on Haswell.  No noticeable image
>> >>> quality
>> >>> difference observed.
>> >>>
>> >>> No piglit gpu.tests regressions (tested with v1)
>> >>>
>> >>> I failed to come up with an explanation for the performance
>> >>> difference, as
>> >>> the
>> >>> change does not affect Ivy Bridge.  If anyone has the insight, please
>> >>> kindly
>> >>> enlighten me.  Performance differences may also be observed on other
>> >>> games
>> >>> that call textureGrad and dFdx.
>> >>>
>> >>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
>> >>> Update
>> >>>     comments.
>> >>
>> >>
>> >> I'm not entirely comfortable making a change that has a known negative
>> >> impact on computational accuracy (even one that leads to such an
>> >> impressive
>> >> performance improvement) when we don't have any theories as to why the
>> >> performance improvement happens, or why the improvement doesn't apply
>> >> to Ivy
>> >> Bridge.  In my experience, making changes to the codebase without
>> >> understanding why they improve things almost always leads to
>> >> improvements
>> >> that are brittle, since it's likely that the true source of the
>> >> improvement
>> >> is a coincidence that will be wiped out by some future change (or won't
>> >> be
>> >> relevant to client programs other than this particular benchmark).
>> >> Having a
>> >> theory as to why the performance improvement happens would help us be
>> >> confident that we're applying the right fix under the right
>> >> circumstances.
>> > That is how I feel as I've mentioned.  I am really glad to have the
>> > discussion.  I have done some experiments actually.  It is just that
>> > those experiments only tell me what theories are likely to be wrong.
>> > They could not tell me if a theory is right.
>> Do the experiments make sense to you?  What other experiments do you
>> want to see conducted?
>>
>> It could be hard to get direct proof without knowing the internal
>> working..
>
>
> Sorry for the slow reply.  We had some internal discussions with the
> hardware architects about this, and it appears that the first theory is
> correct: Haswell has an optimization in its sample_d processing which allows
> it to assume that all pixels in a 2x2 subspan will resolve to the same LOD
> provided that all the gradients in the 2x2 subspan are sufficiently similar
> to each other.  There's a register called SAMPLER_MODE which determines how
> similar the gradients have to be in order to trigger the optimization.  It
> can be set to values between 0 and 0x1f, where 0 (the default) means "only
> trigger the optimization if the gradients are exactly equal" and 0x1f means
> "trigger the optimization as frequently as possible".  Obviously triggering
> the optimization more often reduces the quality of the rendered output
> slightly, because it forces all pixels within a 2x2 subspan to sample from
> the same LOD.
>
> We believe that setting this register to 0x1f should produce an equivalent
> speed-up to your patch, without sacrificing the quality of d/dx when it is
> used for other (non-sample_d) purposes.  This approach would have the
> additional advantage that the benefit would apply to any shader that uses
> the sample_d message, regardless of whether or not that shader uses d/dx and
> d/dy to compute its gradients.
>
> Would you mind trying this register to see if it produces an equivalent
> performance benefit in both your micro-benchmark and Xonotic with Ultra
> settings?  The register is located at address 07028h in register space MMIO
> 0/2/0.  When setting it, the upper 16 bits are a write mask, so to set the
> register to 0 you would store 0x001f0000, and to set it to 0x1f you would
> store 0x001f001f.
>
> Since the SAMPLER_MODE setting allows us to trade off quality vs
> performance, we're also interested to know whether a value less than 0x1f is
> sufficient to produce the performance improvement in Xonotic--it would be
> nice if we could find a "sweet spot" for this setting that produces the
> performance improvement we need without sacrificing too much quality.
Great finding!  I will see if setting the register helps.

Since my goal was to figure out why sample_d is slower than sample and
the DDX change worked, that made me wonder if sample assumes all
pixels in a subspan resolve to the same LOD.  To understand that
better, I modified piglit arb_shader_texture_lod-texgrad test to scale
and rotate the triangle around the X-axis.  This was to make the top
and bottom rows have a better chance to have different gradients.

I then ran the test three times with

 a. unmodified driver (texture2D-texture2DGradARB.png)
 b. patched driver (texture2D-texture2DGradARB-coarse-granularity.png)
 c. both triangles rendered with texture2D (texture2D-texture2D.png)

It is hard to tell the differences from the snapshots.  But when you
run ImageMagick compare on

 - c. and a. (before.png)
 - c. and b. (after.png)

you can see that before the DDX change, the rendered triangles have
different colors every other row.  After the DDX change, there is no
difference.  With this observation, I believe sample executes at 2x2
granularity.  Not that texture2D and texture2DGrad have to behave
exactly the same, but it is good to for them to be consistent.


>
> Finally, do you have any ability to see whether the Windows driver sets this
> register, and if so what it sets it to?  That would provide some nice
> confirmation that we aren't barking up the wrong tree here.
>
>
> As a follow-up task, I'm planning to write a patch that improves the quality
> of our d/dy calculation to be comparable to d/dx.  Based on our current
> understanding of what's going on, I suspect that my patch may have a slight
> effect on the SAMPLER_MODE sweet spot.  I'll try to get that patch out
> today, and I'll Cc you so that you can try it out.
>
> Thanks so much for finding this, Chia-I, and thanks for your patience as
> we've been sorting through trying to find the true explanation for the
> performance improvement.
>
[snipped]


-- 
olv at LunarG.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: texture2D-texture2D.png
Type: image/png
Size: 3650 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130923/d124ebba/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: texture2D-texture2DGradARB.png
Type: image/png
Size: 4307 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130923/d124ebba/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: texture2D-texture2DGradARB-coarse-granularity.png
Type: image/png
Size: 3650 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130923/d124ebba/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: before.png
Type: image/png
Size: 2518 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130923/d124ebba/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: after.png
Type: image/png
Size: 2204 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130923/d124ebba/attachment-0009.png>


More information about the mesa-dev mailing list