On 11 December 2012 23:49, Aras Pranckevicius <span dir="ltr"><<a href="mailto:aras@unity3d.com" target="_blank">aras@unity3d.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> <br><div class="gmail_quote"><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> For the initial implementation I've chosen a strategy that operates<br> exclusively at the GLSL IR level, so that it doesn't require the<br> cooperation of the driver back-ends.</blockquote><div><br></div></div><div>Wouldn't this negatively affect performance of some GPUs?</div></div></blockquote><div><br></div><div>I'm glad you asked--I've actually had quite a bit of in person discussion with Eric and Ian about this. </div> <div><br></div><div>With the i965 back-end, we're expecting a slight performance improvement, based on the following reasoning:<br></div><div><br></div><div>- Most of the packing/unpacking operations in the shader will be coalesced with other operations by optimization passes, so they won't negatively impact performance. This is especially true in the fragment shader, where operations are scalarized, so the packing/unpacking should just turn into simple scalar copies, and those should be completely eliminated by copy propagation. Most programs spend most of their time in the fragment shader anyhow, so the performance penalty is already limited to shaders that have a smaller contribution to execution time.</div> <div><br></div><div>- The extra operations we are talking about are register-to-register moves--no memory access is involved, and no ALU resources are tied up. So there's a pretty small upper limit to the performance penalty even in the case where optimization can't eliminate the copy.</div> <div><br></div><div>- Having packed varyings will mean that the vertex shader spends less time writing its output to the VUE, and the fragment shader spends less time reading its input from the VUE. We don't know exactly how long these VUE reads/writes take (it is difficult to measure them because they are part of the process of starting and terminating threads), but it's very likely that they take longer than register moves. So the already-small performance penalty discussed above is probably offset by a larger performance improvement due to more efficient utilization of the VUE.</div> <div><br></div><div>I can't speak with authority on the inner workings of the other GPUs supported by Mesa, but it seems like most of the arguments above are general enough to apply to most GPU architectures, not just i965.</div> <div><br></div><div>Of course, there could be some important factor that I'm missing that makes all of this analysis completely wrong and causes varying packing to carry a huge penalty on some architectures. If that's the case, I think the best way to address the problem is to find an application that is slowed down by varying packing and run experiments to understand why.</div> <div><br></div><div>If worse comes to worst, we could of course modify the varying packing code so that it only takes effect when there are a large number of varyings that there is no alternative. But that would carry a two disadvantages: it would complicate the linker (especially the handling of transform feedback) to have to handle both packed and unpacked varying formats, and it would reduce test coverage of varying packing to almost nil (since most of our piglit tests use a small number of varyings). Because of those disadvantages, and the fact that our current understanding leads us to expect a performance improvement, I'd like to save this strategy for a last resort.</div> <div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_quote"><div><br></div><div>Not sure if relevant for Mesa, but e.g. on PowerVR SGX it's really bad to pack two vec2 texture coordinates into a single vec4. That's because var.xy texture read can be "prefetched", whereas <a href="http://var.zw" target="_blank">var.zw</a> texture read is not prefetched (essentially treated as a dependent texture read), and often causes stalls in the shader execution.</div> </div></blockquote><div><br></div><div>Interesting--I had not thought of that possibility. On i965 all texture reads have to be done explicitly by the fragment shader (there is no prefetching IIRC), so this penalty doesn't apply. Does anyone know if a penalty like this exists in any of Mesa's other back-ends? If so that might suggest some good experiments to try. I'm open to revising my opinion if someone measures a significant performance degradation, particularly with a real-world app.</div> </div></div>