[Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

Thu Jul 19 22:34:59 UTC 2018

Chema Casanova <jmcasanova at igalia.com> writes:

> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>> Jose Maria Casanova Crespo <jmcasanova at igalia.com> writes:
>> 
>>> For a register source/destination of an instruction the function returns
>>> the read/write byte pattern of a 32-byte registers as a unsigned int.
>>>
>>> The returned pattern takes into account the exec_size of the instruction,
>>> the type bitsize, the stride and if the register is source or destination.
>>>
>>> The objective of the functions if to help to know the read/written bytes
>>> of the instructions to improve the liveness analysis for partial read/writes.
>>>
>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize
>>> parameter they have a different read pattern.
>>> ---
>>>  src/intel/compiler/brw_fs.cpp  | 183 +++++++++++++++++++++++++++++++++
>>>  src/intel/compiler/brw_ir_fs.h |   1 +
>>>  2 files changed, 184 insertions(+)
>>>
>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>>> index 2b8363ca362..f3045c4ff6c 100644
>>> --- a/src/intel/compiler/brw_fs.cpp
>>> +++ b/src/intel/compiler/brw_fs.cpp
>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>             this->dst.offset % REG_SIZE != 0);
>>>  }
>>>  
>>> +/**
>>> + * Returns a 32-bit uint whose bits represent if the associated register byte
>>> + * has been read/written by the instruction. The returned pattern takes into
>>> + * account the exec_size of the instruction, the type bitsize and the register
>>> + * stride and the register is source or destination for the instruction.
>>> + *
>>> + * The objective of this function is to identify which parts of the register
>>> + * are read or written for operations that don't read/write a full register.
>>> + * So we can identify in live range variable analysis if a partial write has
>>> + * completelly defined the part of the register used by a partial read. So we
>>> + * avoid extending the liveness range because all data read was already
>>> + * defined although the wasn't completely written.
>>> + */
>>> +unsigned
>>> +fs_inst::register_byte_use_pattern(const fs_reg &r, boolean is_dst) const
>>> +{
>>> +   if (is_dst) {
>
>> Please split into two functions (like fs_inst::src_read and
>> ::src_written) since that would make the call-sites of this method more
>> self-documenting than a boolean parameter.  You should be able to share
>> code by refactoring the common logic into a separate function (see below
>> for some suggestions on how that could be achieved).
>
> Sure, it would improve readability and simplifies the logic, I've chosen
> dst_write_pattern and src_read_pattern.
>
>> 
>>> +      /* We don't know what is written so we return the worts case */
>> 
>> "worst"
>
> Fixed.
>
>>> +      if (this->predicate && this->opcode != BRW_OPCODE_SEL)
>>> +         return 0;
>>> +      /* We assume that send destinations are completely written */
>>> +      if (this->is_send_from_grf())
>>> +         return ~0u;
>> 
>> Some send-like instructions won't be caught by this condition, you
>> should check for this->mlen != 0 in addition.
>
> Would it be enough to check for (this->mlen > 0) and forget about
> is_send_from_grf? I am using this approach in v2 I am sending.
>

I don't think the mlen > 0 condition would catch all cases either...
E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC.  You probably need both
conditions.  Sucks...

>>> +   } else {
>>> +      /* byte_scattered_write_logical pattern of src[1] is 32-bit aligned
>>> +       * so the read pattern depends on the bitsize stored at src[4]
>>> +       */
>>> +      if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL &&
>>> +          this->src[1].nr == r.nr) {
>
>> I feel uncomfortable about attempting to guess the source the caller is
>> referring to by comparing the registers for equality.  E.g.  you could
>> potentially end up with two sources that compare equal but have
>> different semantics (e.g. as a result of CSE) which might cause it to
>> get the wrong answer.  It would probably be better to pass a source
>> index and a byte offset as argument instead of an fs_reg.
>
> I've didn't thought about CSE, I'm now receiving the number of source
> and the reg_offset. I'm using reg_offset instead of byte offsets as it
> simplifies the logic. Now we are using always the base src register to
> do all the calculation
>>> +         switch (this->src[4].ud) {
>>> +         case 32:
>>> +            return ~0u;
>>> +         case 16:
>>> +            return 0x33333333;
>>> +         case 8:
>>> +            return 0x11111111;
>>> +         default:
>>> +            unreachable("Unsupported bitsize at byte_scattered_write_logical");
>>> +         }
>> 
>> Replace the above switch statement with a call to "periodic_mask(8, 4,
>> this->src[4].ud / 8)" (see below for the definition).
>
> Ok.
>
>>> +      }
>>> +      /* As for byte_scattered_write_logical but we need to take into account
>>> +       * that data written are in the payload offset 32 with SIMD8 and offset
>>> +       * 64 with SIMD16.
>>> +       */
>>> +      if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE &&
>>> +          this->src[0].nr == r.nr) {
>>> +         fs_reg payload = this->src[0];
>>> +         payload.offset = REG_SIZE * this->exec_size / 8;
>> 
>> byte_offset() is your friend.
>
> I've removed the overlap logic, and I'm just checking if we are in the
> reg_offset 1 on SIMD8 or reg_offset 2-3 in SIMD16.
>
>>> +         if (regions_overlap(r, REG_SIZE,
>>> +                             payload, REG_SIZE * this->exec_size / 8)) {
>>> +            switch (this->src[2].ud) {
>>> +            case 32:
>>> +               return ~0u;
>>> +            case 16:
>>> +               return 0x33333333;
>>> +            case 8:
>>> +               return 0x11111111;
>>> +            default:
>>> +               unreachable("Unsupported bitsize at byte_scattered_write");
>>> +            }
>> 
>> Replace the above switch statement with a call to "periodic_mask(8, 4,
>> this->src[2].ud / 8)".
>
> Ok.
>
>>> +         } else {
>>> +            return ~0u;
>>> +         }
>>> +      }
>>> +   }
>>> +
>>> +   /* We define the most conservative value in order to calculate liveness
>>> +    * range. If it is a destination nothing is defined and if is a source
>>> +    * all the bytes of the register could be read. So for release builds
>>> +    * the unreachables would have always safe return value. */
>>> +   unsigned pattern =  is_dst ? 0 : ~0u;
>>> +
>>> +   /* In the general case we calculate the pattern for a specific register
>>> +    * on base of the type_size and stride. We calculate the SIMD8 pattern
>>> +    * and then we adjust the patter if needed for different exec_sizes
>>> +    * and offset
>>> +    */
>>> +   switch (type_sz(r.type)){
>>> +   case 1:
>>> +      switch (r.stride) {
>>> +      case 0:
>>> +         pattern = 0X1;
>>> +         break;
>>> +      case 1:
>>> +         pattern = 0xff;
>>> +         break;
>>> +      case 2:
>>> +         pattern = 0x5555;
>>> +         break;
>>> +      case 4:
>>> +         pattern = 0x11111111;
>>> +         break;
>>> +      case 8:
>>> +         pattern = 0x01010101;
>>> +         break;
>>> +      default:
>>> +         unreachable("Unknown pattern unsupported 8-bit stride");
>>> +      }
>>> +      break;
>>> +   case 2:
>>> +      switch (r.stride) {
>>> +      case 0:
>>> +         pattern = 0X3;
>>> +         break;
>>> +      case 1:
>>> +         pattern = 0xffff;
>>> +         break;
>>> +      case 2:
>>> +         pattern = 0x33333333;
>>> +         break;
>>> +      case 4:
>>> +         pattern = 0x03030303;
>>> +         break;
>>> +      case 8:
>>> +         pattern = 0x00030003;
>>> +         break;
>>> +      default:
>>> +         unreachable("Unknown pattern unsupported 16-bit stride");
>>> +      }
>>> +      break;
>>> +   case 4:
>>> +      switch (r.stride) {
>>> +      case 0:
>>> +         pattern = 0Xf;
>>> +         break;
>>> +      case 1:
>>> +         pattern = ~0u;
>>> +         break;
>>> +      case 2:
>>> +         pattern = 0x0f0f0f0f;
>>> +         break;
>>> +      case 4:
>>> +         pattern = 0x000f000f;
>>> +         break;
>>> +      default:
>>> +         unreachable("Unknown pattern unsupported 32-bit stride");
>>> +      }
>>> +      break;
>>> +   case 8:
>>> +      switch (r.stride) {
>>> +      case 0:
>>> +         pattern = 0Xff;
>>> +         break;
>>> +      case 1:
>>> +         pattern = ~0u;
>>> +         break;
>>> +      case 2:
>>> +         pattern = 0x00ff00ff;
>>> +         break;
>>> +      case 4:
>>> +         pattern = 0xff;
>>> +         break;
>>> +      default:
>>> +         unreachable("Unknown pattern unsupported 64-bit stride");
>>> +      }
>>> +      break;
>>> +   default:
>>> +      unreachable("Unknown pattern for unsupported bitsize ");
>>> +   }
>>> +
>>> +   if (this->exec_size > 8  && r.stride * type_sz(r.type) * 8 < REG_SIZE) {
>>> +      /* For exec_size greater than SIMD8 we repeat the pattern until it
>>> +       * represents a full register already represent a full register */
>>> +      pattern = pattern | (pattern << (8 * r.stride * type_sz(r.type)));
>>> +      if (this->exec_size > 16 && r.stride * type_sz(r.type) * 16 < REG_SIZE)
>>> +         pattern = pattern | (pattern << (16 * r.stride * type_sz(r.type)));
>>> +   } else if (this->exec_size < 8 &&
>>> +              r.stride * type_sz(r.type) * this->exec_size < REG_SIZE) {
>>> +      /* For exec_size smaller than SIMD8 we reduce the pattern if its size
>>> +       * is smaller than a full register. */
>>> +      pattern = pattern >> (MIN2(REG_SIZE, 8 * type_sz(r.type) * r.stride) -
>>> +                            this->exec_size * type_sz(r.type) * r.stride);
>>> +   }
>>> +
>
>> This seems really mad, no clue whether it's correct.   Why not replace
>> the above ~110 lines with a call to the following (fully
>> untested) 5-LOC function:
>
> Your suggestion seems to work perfectly fine, my original approach was
> trying to avoid the loop of creating the read/write pattern but after
> testing my v2 I wasn't able to notice any performance difference running
> shader-db and having the same results. I was originally trying that
> SIMD8 patterns were already constants. Sorry for the added complexity.
>

The loop runs for a logarithmic number of iterations though, so it has
the exact same run-time complexity as your original patch, roughly the
same amount of branches at run-time (but possibly less indirect
branches!), and it should be compiled into a substantially lower number
of instructions, which may actually cause it to perform better due to
more favourable caching.  It's hard to tell though which one will
perform better in practice without benchmarking them, and as you
probably realized this is so far from being a bottleneck that whatever
the difference was is likely lost in the noise.  So it really doesn't
matter which one performs better...

>> 
>> | uint32_t
>> | periodic_mask(unsigned count, unsigned step, unsigned bits)
>> | {
>> |    uint32_t m = (count ? (1 << bits) - 1 : 0);
>> |    const unsigined max = MIN2(count * step, sizeof(m) * CHAR_BITS);
>> | 
>> |    for (unsigned shift = step; shift < max; shift *= 2)
>> |       m |= m << shift;
>> | 
>> |    return m;
>> | }
>
> I've used your function just changing the sizeof(m) * CHAR_BITS for the
> REG_SIZE, to not include the limits.h.

My intention was to make the function as agnostic to IR details as
possible: the only reason there is a limit of 32 bits is because that's
the size of the type used to hold the return value.  Using sizeof makes
sure that e.g. extending the code to 64 bits is as simple as changing
the datatype to uint64_t.

> And I've also included an offset parameter that allows us to shift the
> bits of the pattern when we have an offset inside the register.
>

That sounds fine.

>>> +   /* We adjust the pattern to the byte_offset of the register */
>>> +   pattern = pattern << (r.offset % REG_SIZE);
>>> +
>> 
>> This doesn't really work except for r equal to the first GRF read by the
>> source.  Regions with non-zero sub-GRF offset that straddle multiple
>> GRFs are not really very common at the IR level, but they're allowed by
>> the hardware with some restrictions, so it would make sense to at least
>> handle them conservatively in order to keep things from breaking silently.
>
>
> I included an assert in the periodic_mask function to detect if the
> combination of offset and mask goes over the REG_SIZE, so we would
> detect an straddle at this level of the IR.
>
> assert(offset + max - (step - bits) <= REG_SIZE);

Uhm...  What's the definition of "offset" here?  You seem to be passing
the offset relative to the start of the VGRF modulo REG_SIZE but that
doesn't really make sense to me whenever "reg_offset" is non-zero.  I
think you want to pass "src[i].offset % REG_SIZE - reg_offset *
REG_SIZE" as a signed integer in order to get the offset of the first
byte actually written by the instruction relative to the first byte of
the GRF window of the pattern.  You don't really need to assert-fail
when the offset is greater or equal to 32 (which shouldn't actually
happen in practice), "return 0" gives you the correct behavior.  For
negative offsets (which means the pattern starts after the first byte
written by the instruction) you can just return ~0u conservatively
whenever the current logic wouldn't work [assuming you don't feel like
implementing the code to handle that case accurately ;)].

> Thanks for the review, I think that v2 has better shape.
>

No problem.

> Chema
>
>>> +   assert(pattern);
>>> +
>>> +   return pattern;
>>> +}
>>> +
>>> +
>>>  unsigned
>>>  fs_inst::components_read(unsigned i) const
>>>  {
>>> diff --git a/src/intel/compiler/brw_ir_fs.h b/src/intel/compiler/brw_ir_fs.h
>>> index 92dad269a34..5ea6294b8ad 100644
>>> --- a/src/intel/compiler/brw_ir_fs.h
>>> +++ b/src/intel/compiler/brw_ir_fs.h
>>> @@ -350,6 +350,7 @@ public:
>>>     bool equals(fs_inst *inst) const;
>>>     bool is_send_from_grf() const;
>>>     bool is_partial_write() const;
>>> +   unsigned register_byte_use_pattern(const fs_reg &r, boolean is_dst) const;
>>>     bool is_copy_payload(const brw::simple_allocator &grf_alloc) const;
>>>     unsigned components_read(unsigned i) const;
>>>     unsigned size_read(int arg) const;
>>> -- 
>>> 2.17.1
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180719/f26faa30/attachment-0001.sig>