[Mesa-dev] [PATCH 00/10] glsl: Implement varying packing.

Wed Dec 12 09:09:32 PST 2012

On Wed, Dec 12, 2012 at 5:06 PM, Paul Berry <stereotype441 at gmail.com> wrote:
> On 11 December 2012 23:49, Aras Pranckevicius <aras at unity3d.com> wrote:
>>
>>
>>> For the initial implementation I've chosen a strategy that operates
>>> exclusively at the GLSL IR level, so that it doesn't require the
>>> cooperation of the driver back-ends.
>>
>>
>> Wouldn't this negatively affect performance of some GPUs?
>
>
> I'm glad you asked--I've actually had quite a bit of in person discussion
> with Eric and Ian about this.
>
> With the i965 back-end, we're expecting a slight performance improvement,
> based on the following reasoning:
>
> - Most of the packing/unpacking operations in the shader will be coalesced
> with other operations by optimization passes, so they won't negatively
> impact performance.  This is especially true in the fragment shader, where
> operations are scalarized, so the packing/unpacking should just turn into
> simple scalar copies, and those should be completely eliminated by copy
> propagation.  Most programs spend most of their time in the fragment shader
> anyhow, so the performance penalty is already limited to shaders that have a
> smaller contribution to execution time.
>
> - The extra operations we are talking about are register-to-register
> moves--no memory access is involved, and no ALU resources are tied up.  So
> there's a pretty small upper limit to the performance penalty even in the
> case where optimization can't eliminate the copy.
>
> - Having packed varyings will mean that the vertex shader spends less time
> writing its output to the VUE, and the fragment shader spends less time
> reading its input from the VUE.  We don't know exactly how long these VUE
> reads/writes take (it is difficult to measure them because they are part of
> the process of starting and terminating threads), but it's very likely that
> they take longer than register moves.  So the already-small performance
> penalty discussed above is probably offset by a larger performance
> improvement due to more efficient utilization of the VUE.
>
> I can't speak with authority on the inner workings of the other GPUs
> supported by Mesa, but it seems like most of the arguments above are general
> enough to apply to most GPU architectures, not just i965.
>
> Of course, there could be some important factor that I'm missing that makes
> all of this analysis completely wrong and causes varying packing to carry a
> huge penalty on some architectures.  If that's the case, I think the best
> way to address the problem is to find an application that is slowed down by
> varying packing and run experiments to understand why.
>
> If worse comes to worst, we could of course modify the varying packing code
> so that it only takes effect when there are a large number of varyings that
> there is no alternative.  But that would carry a two disadvantages: it would
> complicate the linker (especially the handling of transform feedback) to
> have to handle both packed and unpacked varying formats, and it would reduce
> test coverage of varying packing to almost nil (since most of our piglit
> tests use a small number of varyings).  Because of those disadvantages, and
> the fact that our current understanding leads us to expect a performance
> improvement, I'd like to save this strategy for a last resort.
>
>>
>>
>> Not sure if relevant for Mesa, but e.g. on PowerVR SGX it's really bad to
>> pack two vec2 texture coordinates into a single vec4. That's because var.xy
>> texture read can be "prefetched", whereas var.zw texture read is not
>> prefetched (essentially treated as a dependent texture read), and often
>> causes stalls in the shader execution.
>
>
> Interesting--I had not thought of that possibility.  On i965 all texture
> reads have to be done explicitly by the fragment shader (there is no
> prefetching IIRC), so this penalty doesn't apply.  Does anyone know if a
> penalty like this exists in any of Mesa's other back-ends?  If so that might
> suggest some good experiments to try.  I'm open to revising my opinion if
> someone measures a significant performance degradation, particularly with a
> real-world app.

R300 and R400 support 4 texture indirections (as defined by
ARB_fragment_program). Adding ALU instructions before the first TEX
instruction increases the number of texture indirections by 1, which
might make some shaders not be executable on the hardware at all.

I think this optimization should be disabled on drivers where the
texture indirection limit is too low.

Marek