[Mesa-dev] [PATCH 8/8] i965/vs: Simplify fs_visitor's ATTR file.

Tue Aug 18 10:39:00 PDT 2015

On Tue, Aug 18, 2015 at 1:28 AM, Kenneth Graunke <kenneth at whitecape.org> wrote:
> On Monday, August 17, 2015 09:53:20 PM Jason Ekstrand wrote:
>> On Aug 17, 2015 4:08 PM, "Kenneth Graunke" <kenneth at whitecape.org> wrote:
>> >
>> > Previously, ATTR was indexed by VERT_ATTRIB_* slots; at the end of
>> > compilation, assign_vs_urb_setup() translated those into GRF units,
>> > and converted ATTR to HW_REGs.
>> >
>> > This patch moves the transslation earlier, making ATTR work in terms of
>> > GRF units from the beginning.  assign_vs_urb_setup() simply has to add
>> > the number of payload registers and push constants to obtain the final
>> > hardware GRF number.  (We can't do this earlier as those values aren't
>> > known.)
>> >
>> > ATTR still supports reg_offset; however, it's simply added to reg.
>> > It's not clear whether this is valuable or not.
>> >
>> > Signed-off-by: Kenneth Graunke <kenneth at whitecape.org>
>> > ---
>> >  src/mesa/drivers/dri/i965/brw_fs.cpp         | 26 +++------------
>> >  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |  2 +-
>> >  src/mesa/drivers/dri/i965/brw_nir.c          | 48 +++++++++++++++++++++++++---
>> >  3 files changed, 50 insertions(+), 26 deletions(-)
>> >
>> > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > index efabb52..f556ed6 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > @@ -1511,10 +1511,9 @@ void
>> >  fs_visitor::assign_vs_urb_setup()
>> >  {
>> >     brw_vs_prog_data *vs_prog_data = (brw_vs_prog_data *) prog_data;
>> > -   int grf, count, slot, channel, attr;
>> >
>> >     assert(stage == MESA_SHADER_VERTEX);
>> > -   count = _mesa_bitcount_64(vs_prog_data->inputs_read);
>> > +   int count = _mesa_bitcount_64(vs_prog_data->inputs_read);
>> >     if (vs_prog_data->uses_vertexid || vs_prog_data->uses_instanceid)
>> >        count++;
>> >
>> > @@ -1534,25 +1533,10 @@ fs_visitor::assign_vs_urb_setup()
>> >     foreach_block_and_inst(block, fs_inst, inst, cfg) {
>> >        for (int i = 0; i < inst->sources; i++) {
>> >           if (inst->src[i].file == ATTR) {
>> > -
>> > -            if (inst->src[i].reg == VERT_ATTRIB_MAX) {
>> > -               slot = count - 1;
>> > -            } else {
>> > -               /* Attributes come in in a contiguous block, ordered by their
>> > -                * gl_vert_attrib value.  That means we can compute the slot
>> > -                * number for an attribute by masking out the enabled
>> > -                * attributes before it and counting the bits.
>> > -                */
>> > -               attr = inst->src[i].reg + inst->src[i].reg_offset / 4;
>> > -               slot = _mesa_bitcount_64(vs_prog_data->inputs_read &
>> > -                                        BITFIELD64_MASK(attr));
>> > -            }
>> > -
>> > -            channel = inst->src[i].reg_offset & 3;
>> > -
>> > -            grf = payload.num_regs +
>> > -               prog_data->curb_read_length +
>> > -               slot * 4 + channel;
>> > +            int grf = payload.num_regs +
>> > +                      prog_data->curb_read_length +
>> > +                      inst->src[i].reg +
>> > +                      inst->src[i].reg_offset;
>> >
>> >              inst->src[i].file = HW_REG;
>> >              inst->src[i].fixed_hw_reg =
>> > diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> > index 111db8c..0ec8ef5 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> > +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>> > @@ -53,7 +53,7 @@ fs_reg *
>> >  fs_visitor::emit_vs_system_value(int location)
>> >  {
>> >     fs_reg *reg = new(this->mem_ctx)
>> > -      fs_reg(ATTR, VERT_ATTRIB_MAX, BRW_REGISTER_TYPE_D);
>> > +      fs_reg(ATTR, 4*_mesa_bitcount_64(prog->InputsRead), BRW_REGISTER_TYPE_D);

It would be good to add a comment here about why you're using 4 *
popcount(InputsRead).  It's not immediately obvious.

>> >     brw_vs_prog_data *vs_prog_data = (brw_vs_prog_data *) prog_data;
>> >
>> >     switch (location) {
>> > diff --git a/src/mesa/drivers/dri/i965/brw_nir.c b/src/mesa/drivers/dri/i965/brw_nir.c
>> > index 2e2d02b..e0cf61e 100644
>> > --- a/src/mesa/drivers/dri/i965/brw_nir.c
>> > +++ b/src/mesa/drivers/dri/i965/brw_nir.c
>> > @@ -27,8 +27,36 @@
>> >  #include "glsl/nir/glsl_to_nir.h"
>> >  #include "program/prog_to_nir.h"
>> >
>> > +static bool
>> > +remap_vs_attrs(nir_block *block, void *closure)
>> > +{
>> > +   GLbitfield64 inputs_read = *((GLbitfield64 *) closure);
>> > +
>> > +   nir_foreach_instr(block, instr) {
>> > +      if (instr->type != nir_instr_type_intrinsic)
>> > +         continue;
>> > +
>> > +      nir_intrinsic_instr *intrin = nir_instr_as_intrinsic(instr);
>> > +
>> > +      /* We set EmitNoIndirect for VS inputs, so there are no indirects. */
>> > +      assert(intrin->intrinsic != nir_intrinsic_load_input_indirect);
>> > +
>> > +      if (intrin->intrinsic == nir_intrinsic_load_input) {
>> > +         /* Attributes come in a contiguous block, ordered by their
>> > +          * gl_vert_attrib value.  That means we can compute the slot
>> > +          * number for an attribute by masking out the enabled attributes
>> > +          * before it and counting the bits.
>> > +          */
>> > +         int attr = intrin->const_index[0];
>> > +         int slot = _mesa_bitcount_64(inputs_read & BITFIELD64_MASK(attr));
>>
>> Um... Can't we just set var->data.driver_location to this value? What's the
>> point of it being its own pass?
>
> Nope.  I'm not sure how to do it in one pass.
>
> 1. We initialize var->driver_location to the VERT_ATTRIB_* value for the
>    base of the variable.
>
> 2. We use nir_lower_io() to follow the dereference chains and compute
>    the location of the data we're actually accessing.
>
>    OpenGL specifies that array slots and matrix columns each get their
>    own generic vertex attribute slot (each of which is a vec4).  So,
>    we use type_size_vec4() to make get_io_offset count in vec4 units.
>
> 3. However...vertex attributes are packed - unused slots don't actually
>    take up space, so everything gets compacted down to consecutive
>    slots.  This is the bitcount trick.
>
> So, for example, if we have a shader that declares:
>
>    layout(location = 1) in mat4 m;
>
> then var->data.location will be VERT_ATTRIB_GENERIC1.  A dereference
> of m[3] would compute a slot value of VERT_ATTRIB_GENERIC1 + 3
> (VERT_ATTRIB_GENERIC4).
>
> However, if the shader doesn't access m[0], m[1], or m[2]...then those
> columns won't actually be uploaded.  We can't tell this by looking at
> a single dereference, or even a variable declaration.  It depends on
> what the whole shader does.
>
> So we look at prog->InputsRead.  For a particular VERT_ATTRIB_* slot,
> we mask off the bits before that one, and bitcount them.  This tells
> us how many attributes are actually active before then, which is the
> actual VUE offset/register count.
>
> I'm not a huge fan of the two-pass approach, but the only alternative
> would be to make get_io_offset() look at InputsRead somehow and figure
> out whether array elements actually matter when adding up offsets.
> Which means something a lot fancier than using type_size()...
>
> Alternatively, we could have nir_lower_io take a callback that remaps
> the driver location after computing it...which still gives us the two
> stage computation...but avoids the second walk over the IR.  But I don't
> know if that's generally useful...

Yeah, I can't think of any one-pass way to do it either that isn't
terrible.  I'm fine with the two-pass.  Please add as much of the
above as you think appropriate as a comment (or comments) to
lower_scalar_vs_inputs and renam_vs_attrs.  It might also be good to
call it "remap_vs_inputs" to match the name of the other function.

Also, Have you tried doing this with the vec4 backend?  I think we
should be able to do it without affecting the vec4_visitor code.
Also, how does this interact with geometry shaders?  (That might be
easier to test in vec4).

I think, overall, I like this approach.  It makes the lowering do what
it's actually supposed to do, namely, turn variables into the actual
offsets we care about.  I'd just like to see how it interacts with
other stages.

> Does that make more sense?
>
>> > +         intrin->const_index[0] = 4 * slot;
>> > +      }
>> > +   }
>> > +   return true;
>> > +}
>> > +