<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Jul 5, 2018 at 2:18 PM, Jason Ekstrand <span dir="ltr"><<a href="mailto:jason@jlekstrand.net" target="_blank">jason@jlekstrand.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="h5">On Thu, Jul 5, 2018 at 11:03 AM, Francisco Jerez <span dir="ltr"><<a href="mailto:currojerez@riseup.net" target="_blank">currojerez@riseup.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="m_-3872441533371173900HOEnZb"><div class="m_-3872441533371173900h5">Jason Ekstrand <<a href="mailto:jason@jlekstrand.net" target="_blank">jason@jlekstrand.net</a>> writes:<br> <br> > On Wed, Jul 4, 2018 at 1:20 PM, Francisco Jerez <<a href="mailto:currojerez@riseup.net" target="_blank">currojerez@riseup.net</a>><br> > wrote:<br> ><br> >> Jason Ekstrand <<a href="mailto:jason@jlekstrand.net" target="_blank">jason@jlekstrand.net</a>> writes:<br> >><br> >> > Many fragment shaders do a discard using relatively little information<br> >> > but still put the discard fairly far down in the shader for no good<br> >> > reason. If the discard is moved higher up, we can possibly avoid doing<br> >> > some or almost all of the work in the shader. When this lets us skip<br> >> > texturing operations, it's an especially high win.<br> >> ><br> >> > One of the biggest offenders here is DXVK. The D3D APIs have different<br> >> > rules for discards than OpenGL and Vulkan. One effective way (which is<br> >> > what DXVK uses) to implement DX behavior on top of GL or Vulkan is to<br> >> > wait until the very end of the shader to discard. This ends up in the<br> >> > pessimal case where we always do all of the work before discarding.<br> >> > This pass helps some DXVK shaders significantly.<br> >> ><br> >><br> >> One thing to keep in mind is that this sort of transformation is trading<br> >> off run-time of fragment shader invocations that don't call discard (or<br> >> do so non-uniformly, which means that the code the discard jump is<br> >> protecting will be executed anyway, so doing this can actually increase<br> >> the critical path of the program) in favour of invocations that call<br> >> discard uniformly (so executing discard early will effectively terminate<br> >> the program early).<br> ><br> ><br> > It's not really a uniform vs. non-uniform thing. Even if a shader only<br> > discards some of the fragments, it sill reduces the number of live channels<br> > which reduces the cost of later non-uniform control-flow.<br> ><br> <br> </div></div>Which only helps if the shader's control flow is sufficiently<br> non-uniform that the additional cost from performing those computations<br> early pays off -- Or not at all if the discarded fragments need to be<br> executed (non-compliantly) anyway in order to provide<br> derivatives_safe_after_discard<wbr>. However, if the discard condition is<br> uniform (across a warp), the thread can be terminated early by the<br> back-end most certainly, which gives you the maximum pay-off. Uniform<br> discard conditions are therefore the best-case scenario for this<br> optimization pass.<span><br></span></blockquote><div><br></div></div></div><div>Yes, that is correct. Fortunately, things that discard tend to discard fairly large chunks of the polygon at one time so this case is fairly common.<br></div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span> ><br> >> Optimizing for the latter case is an essentially<br> >> heuristic assumption that needs to be verified experimentally. Have you<br> >> tested the effect of this pass on non-DX workloads extensively?<br> >><br> ><br> > Yes, it is a trade-off. No, I have not done particularly extensive<br> > testing. We do, however, know of non-DXVK workloads that would benefit<br> > from this. I believe Manhattan is one such example though I have not yet<br> > benchmarked it.<br> ><br> <br> </span>You should grab some numbers then to make sure there are no<br> regressions...</blockquote><div><br></div></span><div>I'm working on that. Unfortunately the perf system is giving me trouble so I don't have the numbers yet.<br></div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">But keep in mind that the i965 scheduler is already<br> performing a similar optimization (locally, but with cycle-count<br> information). This will only help over the existing optimization if the<br> shaders that represent a bottleneck in Manhattan have sufficient control<br> flow for the basic block boundaries to represent a problem to the<br> (local) scheduler.<br></blockquote><div><br></div></span><div>I'm not sure about the manhattan shader but the Skyrim shader does have control flow which the discard has to get moved above.<br></div></div></div></div> </blockquote></div></div><div class="gmail_extra"><br></div><div class="gmail_extra">I have results from the perf system now and somehow this pass makes manhattan noticeably worse. I'll look into that.<br></div></div>