[Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

Fri Apr 3 08:10:02 PDT 2015

On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez <currojerez at riseup.net> wrote:
> Jason Ekstrand <jason at jlekstrand.net> writes:
>
>> On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez <currojerez at riseup.net> wrote:
>>> Jason Ekstrand <jason at jlekstrand.net> writes:
>>>
>>>> Instead of the complicated and broken-by-design pile of heuristics we had
>>>> before, we now have a straightforward lowering:
>>>>
>>>>  1) All header sources are copied directly using force_writemask_all and,
>>>>     since they are guaranteed to be a single register, there are no
>>>>     force_sechalf issues.
>>>>
>>>>  2) All non-header sources are copied using the exact same force_sechalf
>>>>     and saturate modifiers as the LOAD_PAYLOAD operation itself.
>>>>
>>>>  3) In order to accommodate older gens that need interleaved colors,
>>>>     lower_load_payload detects when the destination is a COMPR4 register
>>>>     and automatically interleaves the non-header sources.  The
>>>>     lower_load_payload pass does the right thing here regardless of whether
>>>>     or not the hardware actually supports COMPR4.
>>>
>>> I had a quick glance at the series and it seems to be going in the right
>>> direction.
>>>
>>> One thing I honestly don't like is the ad-hoc and IMHO premature
>>> treatment of payload headers, it still feels like the LOAD_PAYLOAD
>>> instruction has more complex semantics than necessary and the benefit is
>>> unclear to me.
>>>
>>> I suppose that your motivation was to avoid setting force_writemask_all
>>> in LOAD_PAYLOAD instructions with header.  The optimizer should be able
>>> to cope with those flags and get rid of them from the resulting moves
>>> where they are redundant, and if it's not able to it looks like
>>> something that should be fixed anyway.  The explicit handling of headers
>>> is responsible for much of the churn in this series and is likely to
>>> complicate users of LOAD_PAYLOAD and optimization passes that have to
>>> manipulate them.
>>
>> Avoiding force_writemask_all is only half of the motivation and the
>> small half at that.  A header source, more properly defined, is a
>> single physical register that, conceptually, applies to all channels.
>> Effectively, a header source (I should have stated this clearly) has
>> two properties:
>>
>> 1) It has force_writemask_all set
>> 2) It is exactly one physical hardware register.
>>
>> This second property is the more important of the two.  Most of the
>> disaster of the previous LOAD_PAYLOAD implementation was that we did a
>> pile of guesswork and had a ill-conceved "effective width" think in
>> order to figure out how big the register actually was.  Making the
>> user specify which sources are header sources eliminates that
>> guesswork.  It also has the nice side-effect that we can do the right
>> force_writemask_all and we can properly handle COMPR4 for the the
>> user.
>
> Yeah, true, but this seems like the least orthogonal and most annoying
> to use solution for this problem, it forces the caller to provide
> redundant information, it takes into account the saturate flag on some
> arguments and not others, it shuffles sources with respect to the
> specified order when COMPR4 is set, but only for the first four
> non-header sources.  I think any of the following solutions would be
> better-behaved than the current approach:

I don't know that saying which sources are headers is really
redundant.  It's explicit which is what we want.  Yes, the COMPR4
thing is a bit magical but we have to do COMPR4 in lower_load_payload
so we have to have some way of doing it and this method puts the
interleving code in one place instead of two.

> 1/ Use the source width to determine the size of each copy.  This would
>    imply that the source width carries semantic information and hence
>    would have to be left alone by e.g. copy propagation.

That's what do now and it's terrible.  The "effective width" field was
basically a width that gets kept.

> 2/ Use the instruction exec size and flags to determine the properties
>    of *all* copies.  This means that if a header is present the exec
>    size would necessarily have to be 8 and the halves of a 16-wide
>    register would have to be specified separately, which sounds annoying
>    at first but in practice wouldn't necessarily be because it could be
>    handled by the LOAD_PAYLOAD() helper based on the argument widths
>    without running into problems with optimization passes changing the
>    meaning of the instruction.  The semantics of the instruction itself
>    would be as stupid as possible, but the implementation could still
>    trivially recognise 16-wide and COMPR4 copies using the exact same
>    mechanism you are using now.

Yes, that might work.  I'll try and take a swing at it today.  It will
*hopefully* have less code churn than the solution in this series
because the magic will still happen, just in a different place.  Of
course, this solution also requires that we lower everything with
force_writemask_all and *hopefully* we can get rid of those and set
force_sechalf appropriately in optimization passes.

> 3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead
>    simple, say COLLECT and LOAD_HEADER.  "COLLECT dst, src0, ..., srcn"
>    would be equivalent to the LOAD_PAYLOAD instruction described in 2,
>    but it would only be used to load full-width non-header sources of
>    the payload, so you would avoid the need to split 16-wide registers
>    in half.  "LOAD_HEADER dst, header, payload" would handle the
>    asymmetric requirements of prepending a header, like using 8-wide
>    instead of exec_size-wide copies and setting force_writemask_all.
>    You could use mlen to specify the size of payload as is usual for
>    instructions taking a payload source.

I thought about doing something like that but it seems more convoluted
than would be at all worth it.

>> --Jason
>>
>>> Thanks for looking into this BTW.
>>>
>>>> ---
>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp         | 154 ++++++++++++++-------------
>>>>  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |   3 -
>>>>  2 files changed, 80 insertions(+), 77 deletions(-)
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> index f8e26c0..bc45a38 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>> @@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload()
>>>>  {
>>>>     bool progress = false;
>>>>
>>>> -   int vgrf_to_reg[alloc.count];
>>>> -   int reg_count = 0;
>>>> -   for (unsigned i = 0; i < alloc.count; ++i) {
>>>> -      vgrf_to_reg[i] = reg_count;
>>>> -      reg_count += alloc.sizes[i];
>>>> -   }
>>>> -
>>>> -   struct {
>>>> -      bool written:1; /* Whether this register has ever been written */
>>>> -      bool force_writemask_all:1;
>>>> -      bool force_sechalf:1;
>>>> -   } metadata[reg_count];
>>>> -   memset(metadata, 0, sizeof(metadata));
>>>> -
>>>>     foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
>>>> -      if (inst->dst.file == GRF) {
>>>> -         const int dst_reg = vgrf_to_reg[inst->dst.reg] + inst->dst.reg_offset;
>>>> -         bool force_sechalf = inst->force_sechalf &&
>>>> -                              !inst->force_writemask_all;
>>>> -         bool toggle_sechalf = inst->dst.width == 16 &&
>>>> -                               type_sz(inst->dst.type) == 4 &&
>>>> -                               !inst->force_writemask_all;
>>>> -         for (int i = 0; i < inst->regs_written; ++i) {
>>>> -            metadata[dst_reg + i].written = true;
>>>> -            metadata[dst_reg + i].force_sechalf = force_sechalf;
>>>> -            metadata[dst_reg + i].force_writemask_all = inst->force_writemask_all;
>>>> -            force_sechalf = (toggle_sechalf != force_sechalf);
>>>> -         }
>>>> -      }
>>>> -
>>>>        if (inst->opcode == SHADER_OPCODE_LOAD_PAYLOAD) {
>>>>           assert(inst->dst.file == MRF || inst->dst.file == GRF);
>>>> +
>>>>           fs_reg dst = inst->dst;
>>>>
>>>> -         for (int i = 0; i < inst->sources; i++) {
>>>> -            dst.width = inst->src[i].effective_width;
>>>> -            dst.type = inst->src[i].type;
>>>> -
>>>> -            if (inst->src[i].file == BAD_FILE) {
>>>> -               /* Do nothing but otherwise increment as normal */
>>>> -            } else if (dst.file == MRF &&
>>>> -                       dst.width == 8 &&
>>>> -                       brw->has_compr4 &&
>>>> -                       i + 4 < inst->sources &&
>>>> -                       inst->src[i + 4].equals(horiz_offset(inst->src[i], 8))) {
>>>> -               fs_reg compr4_dst = dst;
>>>> -               compr4_dst.reg += BRW_MRF_COMPR4;
>>>> -               compr4_dst.width = 16;
>>>> -               fs_reg compr4_src = inst->src[i];
>>>> -               compr4_src.width = 16;
>>>> -               fs_inst *mov = MOV(compr4_dst, compr4_src);
>>>> +         /* Get rid of COMPR4.  We'll add it back in if we need it */
>>>> +         if (dst.file == MRF && dst.reg & BRW_MRF_COMPR4)
>>>> +            dst.reg = dst.reg & ~BRW_MRF_COMPR4;
>>>> +
>>>> +         dst.width = 8;
>>>> +         for (uint8_t i = 0; i < inst->header_size; i++) {
>>>> +            if (inst->src[i].file != BAD_FILE) {
>>>> +               fs_reg mov_dst = retype(dst, BRW_REGISTER_TYPE_UD);
>>>> +               fs_reg mov_src = retype(inst->src[i], BRW_REGISTER_TYPE_UD);
>>>> +               mov_src.width = 8;
>>>> +               fs_inst *mov = MOV(mov_dst, mov_src);
>>>>                 mov->force_writemask_all = true;
>>>>                 inst->insert_before(block, mov);
>>>> -               /* Mark i+4 as BAD_FILE so we don't emit a MOV for it */
>>>> -               inst->src[i + 4].file = BAD_FILE;
>>>> -            } else {
>>>> -               fs_inst *mov = MOV(dst, inst->src[i]);
>>>> -               if (inst->src[i].file == GRF) {
>>>> -                  int src_reg = vgrf_to_reg[inst->src[i].reg] +
>>>> -                                inst->src[i].reg_offset;
>>>> -                  mov->force_sechalf = metadata[src_reg].force_sechalf;
>>>> -                  mov->force_writemask_all = metadata[src_reg].force_writemask_all;
>>>> -               } else {
>>>> -                  /* We don't have any useful metadata for immediates or
>>>> -                   * uniforms.  Assume that any of the channels of the
>>>> -                   * destination may be used.
>>>> -                   */
>>>> -                  assert(inst->src[i].file == IMM ||
>>>> -                         inst->src[i].file == UNIFORM);
>>>> -                  mov->force_writemask_all = true;
>>>> -               }
>>>> +            }
>>>> +            dst = offset(dst, 1);
>>>> +         }
>>>>
>>>> -               if (dst.file == GRF) {
>>>> -                  const int dst_reg = vgrf_to_reg[dst.reg] + dst.reg_offset;
>>>> -                  const bool force_writemask = mov->force_writemask_all;
>>>> -                  metadata[dst_reg].force_writemask_all = force_writemask;
>>>> -                  metadata[dst_reg].force_sechalf = mov->force_sechalf;
>>>> -                  if (dst.width * type_sz(dst.type) > 32) {
>>>> -                     assert(!mov->force_sechalf);
>>>> -                     metadata[dst_reg + 1].force_writemask_all = force_writemask;
>>>> -                     metadata[dst_reg + 1].force_sechalf = !force_writemask;
>>>> +         dst.width = inst->exec_size;
>>>> +         if (inst->dst.file == MRF && (inst->dst.reg & BRW_MRF_COMPR4) &&
>>>> +             inst->exec_size > 8) {
>>>> +            /* In this case, the payload portion of the LOAD_PAYLOAD isn't
>>>> +             * a straightforward copy.  Instead, the result of the
>>>> +             * LOAD_PAYLOAD is treated as interlaced and the first four
>>>> +             * non-header sources are unpacked as:
>>>> +             *
>>>> +             * m + 0: r0
>>>> +             * m + 1: g0
>>>> +             * m + 2: b0
>>>> +             * m + 3: a0
>>>> +             * m + 4: r1
>>>> +             * m + 5: g1
>>>> +             * m + 6: b1
>>>> +             * m + 7: a1
>>>> +             *
>>>> +             * This is used for gen <= 5 fb writes.
>>>> +             */
>>>> +            assert(inst->exec_size == 16);
>>>> +            assert(inst->header_size + 4 <= inst->sources);
>>>> +            for (uint8_t i = inst->header_size; i < inst->header_size + 4; i++) {
>>>> +               if (inst->src[i].file != BAD_FILE) {
>>>> +                  if (brw->has_compr4) {
>>>> +                     fs_reg compr4_dst = retype(dst, inst->src[i].type);
>>>> +                     compr4_dst.reg |= BRW_MRF_COMPR4;
>>>> +
>>>> +                     fs_inst *mov = MOV(compr4_dst, inst->src[i]);
>>>> +                     mov->force_writemask_all = inst->force_writemask_all;
>>>> +                     mov->saturate = inst->saturate;
>>>> +                     inst->insert_before(block, mov);
>>>> +                  } else {
>>>> +                     /* Platform doesn't have COMPR4.  We have to fake it */
>>>> +                     fs_reg mov_dst = retype(dst, inst->src[i].type);
>>>> +                     mov_dst.width = 8;
>>>> +
>>>> +                     fs_inst *mov = MOV(mov_dst, half(inst->src[i], 0));
>>>> +                     mov->force_writemask_all = inst->force_writemask_all;
>>>> +                     mov->saturate = inst->saturate;
>>>> +                     inst->insert_before(block, mov);
>>>> +
>>>> +                     mov = MOV(offset(mov_dst, 4), half(inst->src[i], 1));
>>>> +                     mov->force_writemask_all = inst->force_writemask_all;
>>>> +                     mov->saturate = inst->saturate;
>>>> +                     mov->force_sechalf = true;
>>>> +                     inst->insert_before(block, mov);
>>>>                    }
>>>>                 }
>>>>
>>>> -               inst->insert_before(block, mov);
>>>> +               dst.reg++;
>>>>              }
>>>>
>>>> +            dst.reg += 4;
>>>> +
>>>> +            /* The COMPR4 code took care of the first 4 sources.  We'll let
>>>> +             * the regular path handle any remaining sources.  Yes, we are
>>>> +             * modifying the instruction but we're about to delete it so
>>>> +             * this really doesn't hurt anything.
>>>> +             */
>>>> +            inst->header_size += 4;
>>>> +         }
>>>> +
>>>> +         for (uint8_t i = inst->header_size; i < inst->sources; i++) {
>>>> +            if (inst->src[i].file != BAD_FILE) {
>>>> +               fs_inst *mov = MOV(retype(dst, inst->src[i].type),
>>>> +                                  inst->src[i]);
>>>> +               mov->force_writemask_all = inst->force_writemask_all;
>>>> +               mov->saturate = inst->saturate;
>>>> +               inst->insert_before(block, mov);
>>>> +            }
>>>>              dst = offset(dst, 1);
>>>>           }
>>>>
>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>> index b7eeb47..a0d8b798 100644
>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
>>>> @@ -3471,9 +3471,6 @@ fs_visitor::setup_color_payload(fs_reg *dst, fs_reg color, unsigned components,
>>>>                                color.type, color.width);
>>>>              inst = emit(MOV(dst[len], offset(color, i)));
>>>>              inst->saturate = key->clamp_fragment_color;
>>>> -         } else if (color.width == 16) {
>>>> -            /* We need two BAD_FILE slots for a 16-wide color */
>>>> -            len++;
>>>>           }
>>>>           len++;
>>>>        }
>>>> --
>>>> 2.3.4
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev