[Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row

Sun Sep 22 23:48:33 PDT 2013

On Mon, Sep 23, 2013 at 12:09 PM, Chia-I Wu <olvaffe at gmail.com> wrote:
> On Fri, Sep 20, 2013 at 10:50 PM, Paul Berry <stereotype441 at gmail.com> wrote:
>> On 17 September 2013 19:54, Chia-I Wu <olvaffe at gmail.com> wrote:
>>>
>>> Hi Paul,
>>>
>>> On Mon, Sep 16, 2013 at 3:46 PM, Chia-I Wu <olvaffe at gmail.com> wrote:
>>> > On Sat, Sep 14, 2013 at 5:15 AM, Paul Berry <stereotype441 at gmail.com>
>>> > wrote:
>>> >> On 12 September 2013 22:06, Chia-I Wu <olvaffe at gmail.com> wrote:
>>> >>>
>>> >>> From: Chia-I Wu <olv at lunarg.com>
>>> >>>
>>> >>> Consider only the top-left and top-right pixels to approximate DDX in
>>> >>> a
>>> >>> 2x2
>>> >>> subspan, unless the application or the user requests a more accurate
>>> >>> approximation.  This results in a less accurate approximation.
>>> >>> However,
>>> >>> it
>>> >>> improves the performance of Xonotic with Ultra settings by 24.3879%
>>> >>> +/-
>>> >>> 0.832202% (at 95.0% confidence) on Haswell.  No noticeable image
>>> >>> quality
>>> >>> difference observed.
>>> >>>
>>> >>> No piglit gpu.tests regressions (tested with v1)
>>> >>>
>>> >>> I failed to come up with an explanation for the performance
>>> >>> difference, as
>>> >>> the
>>> >>> change does not affect Ivy Bridge.  If anyone has the insight, please
>>> >>> kindly
>>> >>> enlighten me.  Performance differences may also be observed on other
>>> >>> games
>>> >>> that call textureGrad and dFdx.
>>> >>>
>>> >>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
>>> >>> Update
>>> >>>     comments.
>>> >>
>>> >>
>>> >> I'm not entirely comfortable making a change that has a known negative
>>> >> impact on computational accuracy (even one that leads to such an
>>> >> impressive
>>> >> performance improvement) when we don't have any theories as to why the
>>> >> performance improvement happens, or why the improvement doesn't apply
>>> >> to Ivy
>>> >> Bridge.  In my experience, making changes to the codebase without
>>> >> understanding why they improve things almost always leads to
>>> >> improvements
>>> >> that are brittle, since it's likely that the true source of the
>>> >> improvement
>>> >> is a coincidence that will be wiped out by some future change (or won't
>>> >> be
>>> >> relevant to client programs other than this particular benchmark).
>>> >> Having a
>>> >> theory as to why the performance improvement happens would help us be
>>> >> confident that we're applying the right fix under the right
>>> >> circumstances.
>>> > That is how I feel as I've mentioned.  I am really glad to have the
>>> > discussion.  I have done some experiments actually.  It is just that
>>> > those experiments only tell me what theories are likely to be wrong.
>>> > They could not tell me if a theory is right.
>>> Do the experiments make sense to you?  What other experiments do you
>>> want to see conducted?
>>>
>>> It could be hard to get direct proof without knowing the internal
>>> working..
>>
>>
>> Sorry for the slow reply.  We had some internal discussions with the
>> hardware architects about this, and it appears that the first theory is
>> correct: Haswell has an optimization in its sample_d processing which allows
>> it to assume that all pixels in a 2x2 subspan will resolve to the same LOD
>> provided that all the gradients in the 2x2 subspan are sufficiently similar
>> to each other.  There's a register called SAMPLER_MODE which determines how
>> similar the gradients have to be in order to trigger the optimization.  It
>> can be set to values between 0 and 0x1f, where 0 (the default) means "only
>> trigger the optimization if the gradients are exactly equal" and 0x1f means
>> "trigger the optimization as frequently as possible".  Obviously triggering
>> the optimization more often reduces the quality of the rendered output
>> slightly, because it forces all pixels within a 2x2 subspan to sample from
>> the same LOD.
>>
>> We believe that setting this register to 0x1f should produce an equivalent
>> speed-up to your patch, without sacrificing the quality of d/dx when it is
>> used for other (non-sample_d) purposes.  This approach would have the
>> additional advantage that the benefit would apply to any shader that uses
>> the sample_d message, regardless of whether or not that shader uses d/dx and
>> d/dy to compute its gradients.
>>
>> Would you mind trying this register to see if it produces an equivalent
>> performance benefit in both your micro-benchmark and Xonotic with Ultra
>> settings?  The register is located at address 07028h in register space MMIO
>> 0/2/0.  When setting it, the upper 16 bits are a write mask, so to set the
>> register to 0 you would store 0x001f0000, and to set it to 0x1f you would
>> store 0x001f001f.
>>
>> Since the SAMPLER_MODE setting allows us to trade off quality vs
>> performance, we're also interested to know whether a value less than 0x1f is
>> sufficient to produce the performance improvement in Xonotic--it would be
>> nice if we could find a "sweet spot" for this setting that produces the
>> performance improvement we need without sacrificing too much quality.
> Great finding!  I will see if setting the register helps.
Changing the register does not work very well.  Attached is a patch to
drm-intel-nightly branch of the kernel that exposes SAMPLER_MODE in
debugfs, if you want to play with it.

Xonotic first.  The frame rate peaks at ~100fps when SAMPLER_MODE is
set to 0x17.  This is the same number the DDX change can do.  Setting
SAMPLER_MODE to a smaller value results in a lower frame rate.
Setting it to a higher value also results in a lower frame rate, and
there are easily noticeable artifacts (likely due to more texels are
sampled from LOD 0).

With modified piglit arb_shader_texture_lod-texgrad test, you can see
the artifacts when SAMPLER_MODE is 0x12, and it gets worse with larger
values.  I took a screen shot with the register set to 0x17.  I also
took a screen shot with the DDX change applied to see the interaction
between them.

>
> Since my goal was to figure out why sample_d is slower than sample and
> the DDX change worked, that made me wonder if sample assumes all
> pixels in a subspan resolve to the same LOD.  To understand that
> better, I modified piglit arb_shader_texture_lod-texgrad test to scale
> and rotate the triangle around the X-axis.  This was to make the top
> and bottom rows have a better chance to have different gradients.
>
> I then ran the test three times with
>
>  a. unmodified driver (texture2D-texture2DGradARB.png)
>  b. patched driver (texture2D-texture2DGradARB-coarse-granularity.png)
>  c. both triangles rendered with texture2D (texture2D-texture2D.png)
>
> It is hard to tell the differences from the snapshots.  But when you
> run ImageMagick compare on
>
>  - c. and a. (before.png)
>  - c. and b. (after.png)
>
> you can see that before the DDX change, the rendered triangles have
> different colors every other row.  After the DDX change, there is no
> difference.  With this observation, I believe sample executes at 2x2
> granularity.  Not that texture2D and texture2DGrad have to behave
> exactly the same, but it is good to for them to be consistent.
>
>
>>
>> Finally, do you have any ability to see whether the Windows driver sets this
>> register, and if so what it sets it to?  That would provide some nice
>> confirmation that we aren't barking up the wrong tree here.
>>
>>
>> As a follow-up task, I'm planning to write a patch that improves the quality
>> of our d/dy calculation to be comparable to d/dx.  Based on our current
>> understanding of what's going on, I suspect that my patch may have a slight
>> effect on the SAMPLER_MODE sweet spot.  I'll try to get that patch out
>> today, and I'll Cc you so that you can try it out.
>>
>> Thanks so much for finding this, Chia-I, and thanks for your patience as
>> we've been sorting through trying to find the true explanation for the
>> performance improvement.
>>
> [snipped]
>
>
> --
> olv at LunarG.com

-- 
olv at LunarG.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-i915-expose-register-SAMPLE_MODE-in-debugfs.patch
Type: application/octet-stream
Size: 2074 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130923/30403b00/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SAMPLER_MODE_0x17.png
Type: image/png
Size: 5404 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130923/30403b00/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SAMPLER_MODE_0x17-coarse-granularity.png
Type: image/png
Size: 3643 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20130923/30403b00/attachment-0003.png>