[Mesa-dev] [PATCH 1/5] glsl: Add a union location_tree * type to hold more storage information.

Sat Jan 28 09:54:50 PST 2012

On 24 January 2012 13:05, Vincent Lejeune <vljn at ovi.com> wrote:

> Ir_variable::location field is currently represented by a single int.
> Datas of composite type (arrays, records) is assumed to be stored in a
> linear fashion from this base location. In some situation this assumption
> is not optimal : struct fields cannot be stored in a custom order and a
>  single location (which is often used to identify a vec4-like register in
> hardware) cannot be shared by several datas.
> Location_tree data structure is a tree datatype that can store information
> for every fields of a record (allowing to reorder them) and store the
> location of the leaf type (float, int, veci, everything that can be hold
> in a hardware reg) in a byte-aligned way, so that 2 datas/fields can be
> stored in a same reg using different components.
>

Hi Vincent,

I'm glad you're working on this.  We really need varying packing in order
to be strictly OpenGL conformant.  I was planning on working on it myself
sometime in the next few months, but I have so far been caught up in other
tasks.

Unfortunately I have some big concerns about the approach you are taking:

1. The new data structure your patch series introduces (location_tree) is
extremely general.  It tries to allow for every possible way that a data
type could be laid out inside a shader, including layouts used by both UBOs
and varyings (even though we don't support UBOs yet), including constraints
from layout qualifiers (which AFAIK we don't support, at least not for
varyings), and including varying types that contain structures (which
aren't allowed until GLSL 1.50).  This extra generality makes the code
difficult to understand and to test.  For example, I suspect there's a bug
because i965 assumes that all instances of a given structure have the same
layout, so that it can just copy from one to the other in bulk.  That
assumption will no longer be true with this patch series, because local
structs will have the layout chosen by the i965 back-end, whereas varying
structs will have the layout chosen by varying packing.  But I can't test
that because varying structs don't exist yet.  I also believe that the
extra generality is going to come at a performance cost, because many
operations that used to be simple (e.g. figuring out which locations are
covered by a single varying variable) now require traversing large data
structures and allocating/deallocating memory.

I would strongly recommend trying a more incremental approach, where for
now we make the data structures just general enough to be able to pack the
easy cases (for example to pack two vec2's together, or a vec3 with a vec1,
where neither is an array).  Once that is working, we can move on to harder
cases (like matrices, and ultimately, arrays).  As for supporting packing
of UBOs and structs, and handling layout qualifiers, let's wait on those
until Mesa supports the features that require them.

2. I don't understand how the code deals with packing of varying arrays,
and I'm particularly concerned that it doesn't appear to be making
allowances for different packing needs in different architectures.  For
example, if MAX_VARYING_COMPONENTS is 60, each of the following ought to be
possible (not all at once of course):

varying float foo[60];
varying vec2 foo[30];
varying vec3 foo[20];
varying vec4 foo[15];

I can see two possible ways to pack arrays, and it's not clear from your
code which approach you've taken.  Taking vec2 foo[30] as an example,
here's one possible way to pack the array:

location 0: x=foo[0].x y=foo[0].y z=foo[15].x w=foo[15].y
location 1: x=foo[1].x y=foo[1].y z=foo[16].x w=foo[16].y
...
location 14: x=foo[14].x y=foo[14].y z=foo[29].x w=foo[29].y

I call this a "vertical" packing, because if you look at the locations of
consecutive array elements in the table, they go vertically down the table
first, and then horizontally across once they run out of room.

Here's a packing I would call a "horizontal" packing:

location 0: x=foo[0].x y=foo[0].y z=foo[1].x w=foo[1].y
location 1: x=foo[2].x y=foo[2].y z=foo[3].x w=foo[3].y
...
location 14: x=foo[28].x y=foo[28].y z=foo[29].x w=foo[29].y

Another way to think of this is that each element of the array gets
assigned a horizontal coordinate (x=0, y=1, z=2, w=3) and a vertical
coordinate (the location).  In the horizontal packing example, the
coordinates are h=((index%2)*2) and v=(index/2).  In the vertical packing
example, h=(index/15)*2 and v=(index%15).  Both of the above packings might
be said to have a width of 4 and a height of 15.  Note that we might not
always want to use a width of 4 (see below).

Both the horizontal and vertical arrangements have their pros and cons, and
we might even want different back-ends to be able to favor one or the
other.  For example, in i965's vertex shader, the time it takes to access
an element in a packed array will probably be dominated by the time it
takes to compute the horizontal and vertical coordinates.  In this example,
the coordinates for horizontal packing can be computed using bitwise
operations (because all of the multiplies, divisions, and mods are by
powers of two), so it will be probably be faster than vertical packing.
 But other array sizes will have different tradeoffs.  For example, if an
array is small enough that it can fit in a single vertical column, vertical
packing will be faster, because then the coordinate formulas will be h=0
and v=index.  Other architectures are likely to have different tradeoffs
than i965.

The vec3[20] case is particularly tricky because if you pack it
horizontally with a width of 4, you get a complex pattern:

location 0: x=foo[0].x y=foo[0].y z=foo[0].z w=foo[1].x
location 1: x=foo[1].y y=foo[1].z z=foo[2].x w=foo[2].y
location 2: x=foo[2].z y=foo[3].x z=foo[3].y w=foo[3].z
location 3: x=foo[4].x y=foo[4].y z=foo[4].z w=foo[5].x
...
location 15: x=foo[18].z y=foo[19].x z=foo[19].y w=foo[19].z

The formulas for the horizontal and vertical coordinates are now
h=((index*3)%4) and v=((index*3)/4), which are expensive to compute for
i965.  To make matters worse, accessing the array elements that cross
between two locations will require 2 MOV instructions.  This makes the
horizontal layout very slow for i965.  But I can imagine there might be
other architectures where this is layout is fast.

The alternative, packing vertically, looks like this:

location 0: x=foo[0].x y=foo[0].y z=foo[0].z
location 1: x=foo[1].x y=foo[1].y z=foo[0].z
...
location 19: x=foo[19].x y=foo[19].y z=foo[19].z

In other words, no packing at all, unless there happen to be some scalar
floats that can be packed into the unused w column.  On some architectures,
vertically packing vec3s is impractical, because it takes up too many
locations.  But on i965 gen6, we could actually get away with this, because
we have a few more varying locations than we advertise.

Because of these complex tradeoffs, I think we need a solution for packing
arrays that allows architecture-specific decisions about whether to pack
arrays vertically or horizontally.

3. We really need some extensive Piglit tests to be sure this code is ok.
At the moment we have none.  At a minimum, I would recommend writing tests
to cover the following cases:

varying float[MAX_VARYING_COMPONENTS];
varying vec2[MAX_VARYING_COMPONENTS/2];
varying vec3[MAX_VARYING_COMPONENTS/3];
varying vec4[MAX_VARYING_COMPONENTS/4];
varying mat2[MAX_VARYING_COMPONENTS/4];

And so on for every matrix type, including non-square matrices.  The tests
should access the arrays using non-constant indices (e.g. using a uniform)
so that we can verify that relative addressing works correctly.  (It's not
good enough to write a loop in the shader that accesses each component of
the array, because if the loop is unrolled all of the indices will become
constant).  The matrix tests should do actual matrix operations (not just
access components), so that we can make sure the back-end is interpreting a
packed matrix correctly.

For each of the above combinations, we should also have a test that uses N
separate varyings rather than a single array of size N, so that we test the
packing of both arrays and individual variables.

We also should test a number of combinations of various types.  For example
we should do a test where one varying is an array of
MAX_VARYING_COMPONENTS/4 floats, and another is an array of
MAX_VARYING_COMPONENTS/4 vec3s.  This will verify that vec3's and floats
get paired up properly.

We should also have some test cases where some varyings are flat, others
are noperspective, and others are smooth, and the test verifies that each
varying is interpolated in the correct way.  This is important because some
back-ends can't pack together varyings of different interpolation types.
(For example, i965 gen6 can pack together smooth with noperspective, but it
can't pack either of these together with flat).

In all of these tests, we should also verify that the varying data is
correct when transform feedback is used.  Since transform feedback can
request array data either in bulk or component-by-component, we should test
both of those.

There should also be variants of each of the array tests in which the data
is prepared in a local array and then copied in bulk to the varying (and
vice versa for the fragment shader).  We may have to take special care to
make sure these bulk copies don't get optimized out, so that we are truly
testing that bulk copying works correctly.

We should probably also have a few hand-crafted tests that use a large
hodge-podge of different varying types and interpolation qualifiers, to
reduce the risk of a subtle bug being missed by the other tests.

And of course we should test any corner cases that we discover over the
course of implementing this feature.

4. Some of the more complex packings may be a lot of work to implement in
the back-ends (for example the horizontal vec3 case I talked about above).
Before we go too far we should investigate whether it would be better to do
this with lowering passes at the GLSL IR level rather than by writing
varying packing code for every single back-end.  I can see pros and cons of
both approaches.  On the one hand, it's appealing to only have to write the
code once in GLSL land, and all back-ends get the benefit for free.  On the
other hand, rearranging the varyings in GLSL IR complicates transform
feedback (as I discovered when trying to get gl_ClipDistance to work with
transform feedback).  Also, there may be some kinds of packing that can be
implemented more efficiently if they're done directly by the back end.

A possible hybrid approach would be to start off by implementing varying
packing with a GLSL IR lowering pass.  Then, if we discover that some
back-ends can pack and un-pack more efficiently with custom code, we can
switch off the lowering pass for those back-ends and modify the back-end
code instead.  This approach would have the advantage that all drivers
would get at least some of the benefit of varying packing from the start,
and you wouldn't have to get too familiar with the details of each back-end
until you wanted to heavily optimize things.

5. The patches as they exist right now don't look bisectable--it looks like
patch 1 won't even build because it accesses
ir_variable::complete_location, which doesn't exist until a later patch.
Ideally, every patch in a series should build properly and add a small
increment of functionality--it's not ok if the first commit in a series
breaks some drivers (or breaks the build) and then later commits fix it,
because if a developer later tries to do a "git bisect", and they land in
the middle of the patch series, they may be misled.  IMHO, a sign that
you're doing this well is if the first several patches in a series are all
refactorings that prepare for the new feature, with no actual change in
functionality, and then the last few patches add the new feature.  If those
later patches each say "fixes piglit test blah blah blah" in their commit
message, that's even better, because then when we're reviewing the patch
it'll be easy to understand what user-visible effect it has.

One possible way to do this with varying packing would be as follows
(assuming we do the hybrid approach I suggested in point 4):

First patch series: only try to pack floats and vectors.  Don't try to pack
arrays or matrices.
- Patch 1: add a field to ir_variable that indicates the "horizontal
coordinate" of the location it's been packed into.  (The existing
"location" is the vertical coordinate).  Always set the horizontal
coordinate to 0, so there is no functional change yet.
- Patch 2: add a GLSL IR lowering pass that combines together varyings with
the same vertical coordinate, doing appropriate swizzling to account for
the horizontal coordinate.  Again, there's no functional change since we
aren't packing anything yet.
- Patch 3: modify the linker to pack varyings by assigning each variable
both a horizontal and vertical coordinate, instead of just a single
location.  Assuming you've got good piglit test coverage, this should cause
a bunch of tests to start passing.

Second patch series: try to pack arrays and matrices.  Always pack
vertically.
- Patch 1: add a field to ir_variable that indicates the "height" of the
packed array (I think this will be sufficient for now).  Always set the
height to the size of the array, so there is no functional change yet.
- Patch 2: add a GLSL IR lowering pass that rewrites accesses to arrays and
matrices to deal with how they are packed.  Again, there is no functional
change since we're not packing yet.
- Patch 3: modify the linker to pack arrays and matrices by assigning them
appropriate locations and heights.  Again, this should cause a bunch of
tests to start passing.

Third patch series: Allow arrays and matrices to be packed horizontally
too.  In this patch series, maybe we add some heuristics that each back-end
can customize to account for the way different packings are more or less
efficient on different drivers.

...And so on.

This is just my two cents, and I'd be glad to hear other people's opinions.
 Most of what I'm concerned about is making sure we implement this feature
in small, well-tested increments.  I'm hoping to get some time to work on
varying packing after Mesa 8.0 releases, so with luck, maybe Vincent and I
can team up on some of these tasks.

Thanks,

Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/mesa-dev/attachments/20120128/9c5c4d8f/attachment-0001.html>