[Mesa-dev] [PATCH 3/6] i965/fs: Do a general SEND dependency workaround for the original 965.

Ian Romanick idr at freedesktop.org
Thu Feb 14 14:12:45 PST 2013


On 02/13/2013 07:07 PM, Kenneth Graunke wrote:
> On 02/06/2013 05:29 PM, Eric Anholt wrote:
>> We'd been ad-hoc inserting instructions in some SEND messages with no
>> knowledge of when it was required (so extra instructions), but not all
>> SENDs
>> (so not often enough).  This should do much better than that, though it's
>> still flow-control-ignorant.
>>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=58960
>> NOTE: Candidate for the stable branches.
>> ---
>>   src/mesa/drivers/dri/i965/brw_fs.cpp      |  225
>> +++++++++++++++++++++++++++++
>>   src/mesa/drivers/dri/i965/brw_fs.h        |    4 +
>>   src/mesa/drivers/dri/i965/brw_fs_emit.cpp |   42 ------
>>   3 files changed, 229 insertions(+), 42 deletions(-)
>
> Presumably we should have one of these for the vertex shader as well. :(
>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> index fdccd75..264c8c2 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> @@ -258,6 +258,26 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(fs_reg
>> dst, fs_reg surf_index,
>>      return instructions;
>>   }
>>
>> +/**
>> + * A helper for MOV generation for fixing up broken hardware SEND
>> dependency
>> + * handling.
>> + */
>> +fs_inst *
>> +fs_visitor::DEP_RESOLVE_MOV(int grf)
>> +{
>> +   fs_inst *inst = MOV(brw_null_reg(), fs_reg(GRF, grf,
>> BRW_REGISTER_TYPE_F));
>> +
>> +   inst->ir = NULL;
>> +   inst->annotation = "send dependency resolve";
>> +
>> +   /* The caller always wants uncompressed to emit the minimal extra
>> +    * dependencies, and to avoid having to deal with aligning its
>> regs to 2.
>> +    */
>> +   inst->force_uncompressed = true;
>> +
>> +   return inst;
>> +}
>> +
>>   bool
>>   fs_inst::equals(fs_inst *inst)
>>   {
>> @@ -2228,6 +2248,205 @@ fs_visitor::remove_duplicate_mrf_writes()
>>      return progress;
>>   }
>>
>> +static void
>> +clear_deps_for_inst_src(fs_inst *inst, int dispatch_width, bool *deps,
>> +                        int first_grf, int grf_len)
>> +{
>> +   bool inst_16wide = (dispatch_width > 8 &&
>> +                       !inst->force_uncompressed &&
>> +                       !inst->force_sechalf);
>> +
>> +   /* Clear the flag for registers that actually got read (as
>> expected). */
>> +   for (int i = 0; i < 3; i++) {
>> +      int grf;
>> +      if (inst->src[i].file == GRF) {
>> +         grf = inst->src[i].reg;
>> +      } else if (inst->src[i].file == FIXED_HW_REG &&
>> +                 inst->src[i].fixed_hw_reg.file ==
>> BRW_GENERAL_REGISTER_FILE) {
>> +         grf = inst->src[i].fixed_hw_reg.nr;
>> +      } else {
>> +         continue;
>> +      }
>> +
>> +      if (grf >= first_grf &&
>> +          grf < first_grf + grf_len) {
>> +         deps[grf - first_grf] = false;
>> +         if (inst_16wide)
>> +            deps[grf - first_grf + 1] = false;
>> +      }
>> +   }
>> +}
>> +
>> +/**
>> + * Implements this workaround for the original 965:
>> + *
>> + *     "[DevBW, DevCL] Implementation Restrictions: As the hardware
>> does not
>> + *      check for post destination dependencies on this instruction,
>> software
>> + *      must ensure that there is no destination hazard for the case
>> of ‘write
>> + *      followed by a posted write’ shown in the following example.
>> + *
>> + *      1. mov r3 0
>> + *      2. send r3.xy <rest of send instruction>
>> + *      3. mov r2 r3
>> + *
>> + *      Due to no post-destination dependency check on the ‘send’,
>> the above
>> + *      code sequence could have two instructions (1 and 2) in flight
>> at the
>> + *      same time that both consider ‘r3’ as the target of their
>> final writes.
>> + */
>> +void
>> +fs_visitor::insert_gen4_pre_send_dependency_workarounds(fs_inst *inst)
>> +{
>> +   int write_len = inst->regs_written() * dispatch_width / 8;
>> +   int first_write_grf = inst->dst.reg;
>> +   bool needs_dep[16];
>
> Perhaps:
> bool needs_dep[BRW_MAX_MRF];
>
> I guess it's fine, though...even if someday we expand BRW_MAX_MRF to
> more than 16 (for Sandybridge), this workaround only applies to Gen4
> which will always have exactly 16.
>
>> +   assert(write_len < (int)sizeof(needs_dep) - 1);
>> +
>> +   memset(needs_dep, false, sizeof(needs_dep));
>> +   memset(needs_dep, true, write_len);
>
> The second memset only works if sizeof(bool) == 1.  While that's likely
> true with our toolchain, it's explicitly _not_ guaranteed.

I suspect we've made a *LOT* of assumptions that sizeof(bool) == 1. 
After all, the stdbool.h that we use for C code has:

typedef unsigned char _Bool;

Perhaps a STATIC_ASSERT(sizeof(bool) == 1) somewhere would be better.

>  From the C++98 specification, section 5.3.3 Sizeof [expr.sizeof],
> paragraph 1: "in particular, sizeof(bool) and sizeof(wchar_t) are
> implementation-defined." Footnote 65 reiterates: "sizeof(bool) is not
> required to be 1."
>
> I'd be more comfortable with:
>
>      memset(needs_dep, true, write_len * sizeof(bool));
>
> or the obvious loop.
>
>> +
>> +   clear_deps_for_inst_src(inst, dispatch_width,
>> +                           needs_dep, first_write_grf, write_len);
>> +
>> +   /* Walk backwards looking for writes to registers we're writing which
>> +    * aren't read since being written.  If we hit the start of the
>> program,
>> +    * we assume that there are no outstanding dependencies on entry
>> to the
>> +    * program.
>> +    */
>> +   for (fs_inst *scan_inst = (fs_inst *)inst->prev;
>> +        scan_inst != NULL;
>> +        scan_inst = (fs_inst *)scan_inst->prev) {
>> +
>> +      /* If we hit flow control, assume that there *are* outstanding
>> +       * dependencies, and force their cleanup before our instruction.
>> +       */
>> +      if (scan_inst->is_flow_control()) {
>> +         for (int i = 0; i < write_len; i++) {
>> +            if (needs_dep[i]) {
>> +               inst->insert_before(DEP_RESOLVE_MOV(first_write_grf +
>> i));
>> +            }
>> +         }
>> +      }
>> +
>> +      bool scan_inst_16wide = (dispatch_width > 8 &&
>> +                               !scan_inst->force_uncompressed &&
>> +                               !scan_inst->force_sechalf);
>> +
>> +      /* We insert our reads as late as possible on the assumption
>> that any
>> +       * instruction but a MOV that might have left us an outstanding
>> +       * dependency has more latency than a MOV.
>> +       */
>> +      if (scan_inst->dst.file == GRF &&
>> +          scan_inst->dst.reg >= first_write_grf &&
>> +          scan_inst->dst.reg < first_write_grf + write_len &&
>> +          needs_dep[scan_inst->dst.reg - first_write_grf]) {
>> +         inst->insert_before(DEP_RESOLVE_MOV(scan_inst->dst.reg));
>> +         needs_dep[scan_inst->dst.reg - first_write_grf] = false;
>> +         if (scan_inst_16wide)
>> +            needs_dep[scan_inst->dst.reg - first_write_grf + 1] = false;
>> +      }
>> +
>> +      /* Clear the flag for registers that actually got read (as
>> expected). */
>> +      clear_deps_for_inst_src(scan_inst, dispatch_width,
>> +                              needs_dep, first_write_grf, write_len);
>> +
>> +      /* Continue the loop only if we haven't resolved all the
>> dependencies */
>> +      int i;
>> +      for (i = 0; i < write_len; i++) {
>> +         if (needs_dep[i])
>> +            break;
>> +      }
>> +      if (i == write_len)
>> +         return;
>> +   }
>> +}
>> +
>> +/**
>> + * Implements this workaround for the original 965:
>> + *
>> + *     "[DevBW, DevCL] Errata: A destination register from a send can
>> not be
>> + *      used as a destination register until after it has been
>> sourced by an
>> + *      instruction with a different destination register.
>> + */
>> +void
>> +fs_visitor::insert_gen4_post_send_dependency_workarounds(fs_inst *inst)
>> +{
>> +   int write_len = inst->regs_written() * dispatch_width / 8;
>> +   int first_write_grf = inst->dst.reg;
>> +   bool needs_dep[16];
>
> Perhaps "bool needs_dep[BRW_MAX_MRF];" here too.
>
>> +   assert(write_len < (int)sizeof(needs_dep) - 1);
>> +
>> +   memset(needs_dep, false, sizeof(needs_dep));
>> +   memset(needs_dep, true, write_len);
>
>      memset(needs_dep, true, write_len * sizeof(bool));
>
>> +   /* Walk forwards looking for writes to registers we're writing
>> which aren't
>> +    * read before being written.
>> +    */
>> +   for (fs_inst *scan_inst = (fs_inst *)inst->next;
>> +        !scan_inst->is_tail_sentinel();
>> +        scan_inst = (fs_inst *)scan_inst->next) {
>> +      /* If we hit flow control, force resolve all remaining
>> dependencies. */
>> +      if (scan_inst->is_flow_control()) {
>> +         for (int i = 0; i < write_len; i++) {
>> +            if (needs_dep[i])
>> +
>> scan_inst->insert_before(DEP_RESOLVE_MOV(first_write_grf + i));
>> +         }
>> +      }
>> +
>> +      /* Clear the flag for registers that actually got read (as
>> expected). */
>> +      clear_deps_for_inst_src(scan_inst, dispatch_width,
>> +                              needs_dep, first_write_grf, write_len);
>> +
>> +      /* We insert our reads as late as possible since they're
>> reading the
>> +       * result of a SEND, which has massive latency.
>> +       */
>
> Oh, good idea.
>
>> +      if (scan_inst->dst.file == GRF &&
>> +          scan_inst->dst.reg >= first_write_grf &&
>> +          scan_inst->dst.reg < first_write_grf + write_len &&
>> +          needs_dep[scan_inst->dst.reg - first_write_grf]) {
>> +         scan_inst->insert_before(DEP_RESOLVE_MOV(scan_inst->dst.reg));
>> +         needs_dep[scan_inst->dst.reg - first_write_grf] = false;
>> +      }
>> +
>> +      /* Continue the loop only if we haven't resolved all the
>> dependencies */
>> +      int i;
>> +      for (i = 0; i < write_len; i++) {
>> +         if (needs_dep[i])
>> +            break;
>> +      }
>> +      if (i == write_len)
>> +         return;
>> +   }
>> +
>> +   /* If we hit the end of the program, resolve all remaining
>> dependencies out
>> +    * of paranoia.
>> +    */
>> +   fs_inst *last_inst = (fs_inst *)this->instructions.get_tail();
>> +   assert(last_inst->eot);
>> +   for (int i = 0; i < write_len; i++) {
>> +      if (needs_dep[i])
>> +         last_inst->insert_before(DEP_RESOLVE_MOV(first_write_grf + i));
>> +   }
>> +}
>> +
>> +void
>> +fs_visitor::insert_gen4_send_dependency_workarounds()
>> +{
>> +   if (intel->gen != 4 || intel->is_g4x)
>> +      return;
>
> Is this actually necessary for g4x?
>
> In the documentation, I definitely see [DevBW, DevCL].  This is
> Broadwater (original gen 4 desktop), and Crestline (original gen 4
> mobile).  But I don't see any mention of DevEL (Eaglelake/G45) or DevCTG
> (Cantiga/GM45).
>
> Other than these small comments, this looks good - and much safer than
> the existing hacks!  So assuming the bool thing gets fixed, this is:
>
> Reviewed-by: Kenneth Graunke <kenneth at whitecape.org>
>
>> +   /* Note that we're done with register allocation, so GRF fs_regs
>> always
>> +    * have a .reg_offset of 0.
>> +    */
>> +
>> +   foreach_list_safe(node, &this->instructions) {
>> +      fs_inst *inst = (fs_inst *)node;
>> +
>> +      if (inst->mlen != 0 && inst->dst.file == GRF) {
>> +         insert_gen4_pre_send_dependency_workarounds(inst);
>> +         insert_gen4_post_send_dependency_workarounds(inst);
>> +      }
>> +   }
>> +}
>> +
>>   void
>>   fs_visitor::dump_instruction(fs_inst *inst)
>>   {
>> @@ -2522,6 +2741,12 @@ fs_visitor::run()
>>      assert(force_uncompressed_stack == 0);
>>      assert(force_sechalf_stack == 0);
>>
>> +   /* This must come after all optimization and register allocation,
>> since
>> +    * it inserts dead code that happens to have side effects, and it
>> does
>> +    * so based on the actual physical registers in use.
>> +    */
>> +   insert_gen4_send_dependency_workarounds();
>> +
>>      if (failed)
>>         return false;
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h
>> b/src/mesa/drivers/dri/i965/brw_fs.h
>> index dbf9e05..ee9afa9 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs.h
>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h
>> @@ -285,6 +285,7 @@ public:
>>      fs_inst *IF(fs_reg src0, fs_reg src1, uint32_t condition);
>>      fs_inst *CMP(fs_reg dst, fs_reg src0, fs_reg src1,
>>                   uint32_t condition);
>> +   fs_inst *DEP_RESOLVE_MOV(int grf);
>>
>>      int type_size(const struct glsl_type *type);
>>      fs_inst *get_instruction_generating_reg(fs_inst *start,
>> @@ -329,6 +330,9 @@ public:
>>      bool remove_duplicate_mrf_writes();
>>      bool virtual_grf_interferes(int a, int b);
>>      void schedule_instructions(bool post_reg_alloc);
>> +   void insert_gen4_send_dependency_workarounds();
>> +   void insert_gen4_pre_send_dependency_workarounds(fs_inst *inst);
>> +   void insert_gen4_post_send_dependency_workarounds(fs_inst *inst);
>>      void fail(const char *msg, ...);
>>
>>      void push_force_uncompressed();
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> index 7644652..aa3a616 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> @@ -605,29 +605,8 @@ fs_generator::generate_unspill(fs_inst *inst,
>> struct brw_reg dst)
>>   {
>>      assert(inst->mlen != 0);
>>
>> -   /* Clear any post destination dependencies that would be ignored by
>> -    * the block read.  See the B-Spec for pre-gen5 send instruction.
>> -    *
>> -    * This could use a better solution, since texture sampling and
>> -    * math reads could potentially run into it as well -- anywhere
>> -    * that we have a SEND with a destination that is a register that
>> -    * was written but not read within the last N instructions (what's
>> -    * N?  unsure).  This is rare because of dead code elimination, but
>> -    * not impossible.
>> -    */
>> -   if (intel->gen == 4 && !intel->is_g4x)
>> -      brw_MOV(p, brw_null_reg(), dst);
>> -
>>      brw_oword_block_read_scratch(p, dst,
>> brw_message_reg(inst->base_mrf), 1,
>>                   inst->offset);
>> -
>> -   if (intel->gen == 4 && !intel->is_g4x) {
>> -      /* gen4 errata: destination from a send can't be used as a
>> -       * destination until it's been read.  Just read it so we don't
>> -       * have to worry.
>> -       */
>> -      brw_MOV(p, brw_null_reg(), dst);
>> -   }
>>   }
>>
>>   void
>> @@ -638,19 +617,6 @@
>> fs_generator::generate_uniform_pull_constant_load(fs_inst *inst,
>>   {
>>      assert(inst->mlen != 0);
>>
>> -   /* Clear any post destination dependencies that would be ignored by
>> -    * the block read.  See the B-Spec for pre-gen5 send instruction.
>> -    *
>> -    * This could use a better solution, since texture sampling and
>> -    * math reads could potentially run into it as well -- anywhere
>> -    * that we have a SEND with a destination that is a register that
>> -    * was written but not read within the last N instructions (what's
>> -    * N?  unsure).  This is rare because of dead code elimination, but
>> -    * not impossible.
>> -    */
>> -   if (intel->gen == 4 && !intel->is_g4x)
>> -      brw_MOV(p, brw_null_reg(), dst);
>> -
>>      assert(index.file == BRW_IMMEDIATE_VALUE &&
>>         index.type == BRW_REGISTER_TYPE_UD);
>>      uint32_t surf_index = index.dw1.ud;
>> @@ -661,14 +627,6 @@
>> fs_generator::generate_uniform_pull_constant_load(fs_inst *inst,
>>
>>      brw_oword_block_read(p, dst, brw_message_reg(inst->base_mrf),
>>               read_offset, surf_index);
>> -
>> -   if (intel->gen == 4 && !intel->is_g4x) {
>> -      /* gen4 errata: destination from a send can't be used as a
>> -       * destination until it's been read.  Just read it so we don't
>> -       * have to worry.
>> -       */
>> -      brw_MOV(p, brw_null_reg(), dst);
>> -   }
>>   }
>>
>>   void
>>
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>



More information about the mesa-dev mailing list