<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Sep 13, 2013 at 2:15 PM, Paul Berry <span dir="ltr"><<a href="mailto:stereotype441@gmail.com" target="_blank">stereotype441@gmail.com</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div>On 12 September 2013 22:06, Chia-I Wu <span dir="ltr"><<a href="mailto:olvaffe@gmail.com" target="_blank">olvaffe@gmail.com</a>></span> wrote:<br> </div><div class="gmail_extra"><div class="gmail_quote"><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> From: Chia-I Wu <<a href="mailto:olv@lunarg.com" target="_blank">olv@lunarg.com</a>><br> <br> Consider only the top-left and top-right pixels to approximate DDX in a 2x2<br> subspan, unless the application or the user requests a more accurate<br> approximation. This results in a less accurate approximation. However, it<br> improves the performance of Xonotic with Ultra settings by 24.3879% +/-<br> 0.832202% (at 95.0% confidence) on Haswell. No noticeable image quality<br> difference observed.<br> <br> No piglit gpu.tests regressions (tested with v1)<br> <br> I failed to come up with an explanation for the performance difference, as the<br> change does not affect Ivy Bridge. If anyone has the insight, please kindly<br> enlighten me. Performance differences may also be observed on other games<br> that call textureGrad and dFdx.<br> <br> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option. Update<br> comments.<br></blockquote><div><br></div></div><div>I'm not entirely comfortable making a change that has a known negative impact on computational accuracy (even one that leads to such an impressive performance improvement) when we don't have any theories as to why the performance improvement happens, or why the improvement doesn't apply to Ivy Bridge. In my experience, making changes to the codebase without understanding why they improve things almost always leads to improvements that are brittle, since it's likely that the true source of the improvement is a coincidence that will be wiped out by some future change (or won't be relevant to client programs other than this particular benchmark). Having a theory as to why the performance improvement happens would help us be confident that we're applying the right fix under the right circumstances.<br> </div></div></div></div></blockquote><div><br></div><div>There's another angle to approach this and that is to develop a simple test case that will show the different results across a range of computational accuracy and run the test on proprietary drivers for the same hardware to determine what settings they are using.</div> <div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"> <div> <br></div><div>For example, here's one theory as to why we might be seeing an improvement: perhaps Haswell's sample_d processing is smart enough to realize that when all the gradient values within a sub-span are the same, that means that all of the sampling for the sub-span will come from the same LOD, and that allows it to short-cut some expensive step in the LOD calculation. Perhaps the same improvement isn't seen on Ivy Bridge because Ivy Bridge's sample_d processing logic is less sophisticated, so it's unable to perform the optimization. If this is the case, then conditioning the optimization on brw->is_haswell (as you've done) makes sense.<br> <br></div><div>Another possible explanation for the Haswell vs Ivy Bridge difference is that perhaps Ivy Bridge, being a lower-performing chip, has other bottlenecks that make the optimization irrelevant for this particular benchmark, but potentially still useful for other benchmarks. For instance, maybe when this benchmark executes on Ivy Bridge, the texture that's being sampled from is located in sufficiently distant memory that optimizing the sample_d's memory accesses makes no difference, since the bottleneck is the speed with which the texture can be read into cache, rather than the speed of operation of sample_d. If this explanation is correct, then it might be worth applying the optimization to both Ivy Bridge and Haswell (and perhaps Sandy Bridge as well), since it might conceivably benefit those other chips when running applications that place less cache pressure on the chip.<br> </div></div></div></div></blockquote><div><br></div><div>This scenario is where I'd place my bets, especially given that the numbers are based on Xonotic. I benchmarked this patch using Xonotic on Bay Trail as is and by replacing !brw->is_haswell with !brw->is_baytrail. With ultra and ultimate levels at medium and high resolutions, the results were all essentially the same at comparable resolutions and quality levels.</div> <div><br></div><div>I don't see any justification to tie this change to just Haswell hardware. There's all sorts of reasons why doing that sounds like a big mistake. In fact, another _explanation_ to add to your list is maybe there's another is_haswell test elsewhere in the driver that is responsible for the performance anomaly.</div> <div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> <div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div> <br></div><div>Another possibile explanation is that Haswell has a bug in its sample_d logic which causes it to be slow under some conditions, and this lower-accuracy DDX computation happens to work around it. If that's the case, we might want to consider not using sample_d at all on Haswell, and instead calculating the LOD in the shader and using sample_l instead. If this is the correct explanation, then that might let us have faster performance without sacrificing DDX accuracy.<br> <br></div><div>A final possible explanation for the performance improvement is that perhaps for some reason sample_d performs more optimally when the DDX and DDY computations have similar accuracies to each other. Before your patch, our computation of DDX was more accurate than DDY; your patch decreases the accuracy of DDX to match DDY. If this explanation is correct, then a better solution would probably be to improve the accuracy of DDY to make it comparable to DDX, rather than the other way around.<br> <br></div><div>Before we land this patch, can we do some experiments to try to figure out which of these explanations (if any) is correct?<br></div><div><div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> <br> Signed-off-by: Chia-I Wu <<a href="mailto:olv@lunarg.com" target="_blank">olv@lunarg.com</a>><br> ---<br> src/mesa/drivers/dri/i965/brw_context.c | 1 +<br> src/mesa/drivers/dri/i965/brw_context.h | 1 +<br> src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40 ++++++++++++++++++++++++-------<br> src/mesa/drivers/dri/i965/intel_screen.c | 4 ++++<br> 4 files changed, 38 insertions(+), 8 deletions(-)<br> <br> diff --git a/src/mesa/drivers/dri/i965/brw_context.c b/src/mesa/drivers/dri/i965/brw_context.c<br> index 4fcc9fb..1cdfb9d 100644<br> --- a/src/mesa/drivers/dri/i965/brw_context.c<br> +++ b/src/mesa/drivers/dri/i965/brw_context.c<br> @@ -470,6 +470,7 @@ brwCreateContext(int api,<br> brw_draw_init( brw );<br> <br> brw->precompile = driQueryOptionb(&brw->optionCache, "shader_precompile");<br> + brw->accurate_derivative = driQueryOptionb(&brw->optionCache, "accurate_derivative");<br> <br> ctx->Const.ContextFlags = 0;<br> if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0)<br> diff --git a/src/mesa/drivers/dri/i965/brw_context.h b/src/mesa/drivers/dri/i965/brw_context.h<br> index c566bba..8bfc54a 100644<br> --- a/src/mesa/drivers/dri/i965/brw_context.h<br> +++ b/src/mesa/drivers/dri/i965/brw_context.h<br> @@ -964,6 +964,7 @@ struct brw_context<br> bool always_flush_cache;<br> bool disable_throttling;<br> bool precompile;<br> + bool accurate_derivative;<br> <br> driOptionCache optionCache;<br> /** @} */<br> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp<br> index bfb3d33..69aeab1 100644<br> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp<br> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp<br> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src<br> *<br> * arg0: <a href="http://ss0.tl" target="_blank">ss0.tl</a> <a href="http://ss0.tr" target="_blank">ss0.tr</a> ss0.bl <a href="http://ss0.br" target="_blank">ss0.br</a> <a href="http://ss1.tl" target="_blank">ss1.tl</a> <a href="http://ss1.tr" target="_blank">ss1.tr</a> ss1.bl <a href="http://ss1.br" target="_blank">ss1.br</a><br> *<br> - * and we're trying to produce:<br> + * Ideally, we want to produce:<br> *<br> * DDX DDY<br> * dst: (<a href="http://ss0.tr" target="_blank">ss0.tr</a> - <a href="http://ss0.tl" target="_blank">ss0.tl</a>) (<a href="http://ss0.tl" target="_blank">ss0.tl</a> - ss0.bl)<br> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct brw_reg dst, struct brw_reg src<br> *<br> * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same result<br> * for each pair, and vertstride = 2 jumps us 2 elements after processing a<br> - * pair. But for DDY, it's harder, as we want to produce the pairs swizzled<br> - * between each other. We could probably do it like ddx and swizzle the right<br> - * order later, but bail for now and just produce<br> + * pair. But the ideal approximation of DDX may impose a huge performance<br> + * cost on sample_d. As such, we favor ((<a href="http://ss0.tr" target="_blank">ss0.tr</a> - <a href="http://ss0.tl" target="_blank">ss0.tl</a>)x4 (<a href="http://ss1.tr" target="_blank">ss1.tr</a> -<br> + * <a href="http://ss1.tl" target="_blank">ss1.tl</a>)x4) unless the app or the user requests otherwise.<br> + *<br> + * For DDY, it's harder, as we want to produce the pairs swizzled between each<br> + * other. We could probably do it like ddx and swizzle the right order later,<br> + * but bail for now and just produce<br> * ((<a href="http://ss0.tl" target="_blank">ss0.tl</a> - ss0.bl)x4 (<a href="http://ss1.tl" target="_blank">ss1.tl</a> - ss1.bl)x4)<br> */<br> void<br> fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct brw_reg src)<br> {<br> + unsigned vstride, width;<br> +<br> + /* Produce accurate result only when requested. We emit only one<br> + * instruction for either case, but the problem is the result may affect<br> + * how fast sample_d executes.<br> + *<br> + * Since the performance difference is only observed on Haswell, ignore the<br> + * hints on other GENs for now.<br> + */<br> + if (!brw->is_haswell ||<br> + brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST ||<br> + brw->accurate_derivative) {<br> + vstride = BRW_VERTICAL_STRIDE_2;<br> + width = BRW_WIDTH_2;<br> + }<br> + else {<br> + vstride = BRW_VERTICAL_STRIDE_4;<br> + width = BRW_WIDTH_4;<br> + }<br> +<br> struct brw_reg src0 = brw_reg(src.file, <a href="http://src.nr" target="_blank">src.nr</a>, 1,<br> BRW_REGISTER_TYPE_F,<br> - BRW_VERTICAL_STRIDE_2,<br> - BRW_WIDTH_2,<br> + vstride,<br> + width,<br> BRW_HORIZONTAL_STRIDE_0,<br> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);<br> struct brw_reg src1 = brw_reg(src.file, <a href="http://src.nr" target="_blank">src.nr</a>, 0,<br> BRW_REGISTER_TYPE_F,<br> - BRW_VERTICAL_STRIDE_2,<br> - BRW_WIDTH_2,<br> + vstride,<br> + width,<br> BRW_HORIZONTAL_STRIDE_0,<br> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);<br> brw_ADD(p, dst, src0, negate(src1));<br> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c b/src/mesa/drivers/dri/i965/intel_screen.c<br> index eb6515e..ee08ffd 100644<br> --- a/src/mesa/drivers/dri/i965/intel_screen.c<br> +++ b/src/mesa/drivers/dri/i965/intel_screen.c<br> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] =<br> DRI_CONF_SECTION_END<br> DRI_CONF_SECTION_QUALITY<br> DRI_CONF_FORCE_S3TC_ENABLE("false")<br> +<br> + DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false")<br> + DRI_CONF_DESC(en, "Perform more accurate derivative calculation")<br> + DRI_CONF_OPT_END<br> DRI_CONF_SECTION_END<br> DRI_CONF_SECTION_DEBUG<br> DRI_CONF_NO_RAST("false")<br> <span><font color="#888888">--<br> 1.8.3.1<br> <br> _______________________________________________<br> mesa-dev mailing list<br> <a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.org</a><br> <a href="http://lists.freedesktop.org/mailman/listinfo/mesa-dev" target="_blank">http://lists.freedesktop.org/mailman/listinfo/mesa-dev</a><br> </font></span></blockquote></div></div></div><br></div></div> <br>_______________________________________________<br> mesa-dev mailing list<br> <a href="mailto:mesa-dev@lists.freedesktop.org" target="_blank">mesa-dev@lists.freedesktop.org</a><br> <a href="http://lists.freedesktop.org/mailman/listinfo/mesa-dev" target="_blank">http://lists.freedesktop.org/mailman/listinfo/mesa-dev</a><br> <br></blockquote></div><br></div></div>