[Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

Fri Jul 20 20:10:16 UTC 2018

Chema Casanova <jmcasanova at igalia.com> writes:

> El 20/07/18 a las 00:34, Francisco Jerez escribió:
>> Chema Casanova <jmcasanova at igalia.com> writes:
>> 
>>> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>>>> Jose Maria Casanova Crespo <jmcasanova at igalia.com> writes:
>>>>
>>>>> For a register source/destination of an instruction the function returns
>>>>> the read/write byte pattern of a 32-byte registers as a unsigned int.
>>>>>
>>>>> The returned pattern takes into account the exec_size of the instruction,
>>>>> the type bitsize, the stride and if the register is source or destination.
>>>>>
>>>>> The objective of the functions if to help to know the read/written bytes
>>>>> of the instructions to improve the liveness analysis for partial read/writes.
>>>>>
>>>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize
>>>>> parameter they have a different read pattern.
>>>>> ---
>>>>>  src/intel/compiler/brw_fs.cpp  | 183 +++++++++++++++++++++++++++++++++
>>>>>  src/intel/compiler/brw_ir_fs.h |   1 +
>>>>>  2 files changed, 184 insertions(+)
>>>>>
>>>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>>>>> index 2b8363ca362..f3045c4ff6c 100644
>>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>>>             this->dst.offset % REG_SIZE != 0);
>>>>>  }
>>>>>  
>>>>> +/**
>>>>> + * Returns a 32-bit uint whose bits represent if the associated register byte
>>>>> + * has been read/written by the instruction. The returned pattern takes into
>>>>> + * account the exec_size of the instruction, the type bitsize and the register
>>>>> + * stride and the register is source or destination for the instruction.
>>>>> + *
>>>>> + * The objective of this function is to identify which parts of the register
>>>>> + * are read or written for operations that don't read/write a full register.
>>>>> + * So we can identify in live range variable analysis if a partial write has
>>>>> + * completelly defined the part of the register used by a partial read. So we
>>>>> + * avoid extending the liveness range because all data read was already
>>>>> + * defined although the wasn't completely written.
>>>>> + */
>>>>> +unsigned
>>>>> +fs_inst::register_byte_use_pattern(const fs_reg &r, boolean is_dst) const
>>>>> +{
>>>>> +   if (is_dst) {
>>>
>>>> Please split into two functions (like fs_inst::src_read and
>>>> ::src_written) since that would make the call-sites of this method more
>>>> self-documenting than a boolean parameter.  You should be able to share
>>>> code by refactoring the common logic into a separate function (see below
>>>> for some suggestions on how that could be achieved).
>>>
>>> Sure, it would improve readability and simplifies the logic, I've chosen
>>> dst_write_pattern and src_read_pattern.
>>>
>>>>
>>>>> +      /* We don't know what is written so we return the worts case */
>>>>
>>>> "worst"
>>>
>>> Fixed.
>>>
>>>>> +      if (this->predicate && this->opcode != BRW_OPCODE_SEL)
>>>>> +         return 0;
>>>>> +      /* We assume that send destinations are completely written */
>>>>> +      if (this->is_send_from_grf())
>>>>> +         return ~0u;
>>>>
>>>> Some send-like instructions won't be caught by this condition, you
>>>> should check for this->mlen != 0 in addition.
>>>
>>> Would it be enough to check for (this->mlen > 0) and forget about
>>> is_send_from_grf? I am using this approach in v2 I am sending.
>>>
>> 
>> I don't think the mlen > 0 condition would catch all cases either...
>> E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC.  You probably need both
>> conditions.  Sucks...
>
> That is true, so now we have the:
>  (this->is_send_from_grf() || this->mlen != 0)
>
>>>>> +   } else {
>>>>> +      /* byte_scattered_write_logical pattern of src[1] is 32-bit aligned
>>>>> +       * so the read pattern depends on the bitsize stored at src[4]
>>>>> +       */
>>>>> +      if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL &&
>>>>> +          this->src[1].nr == r.nr) {
>>>
>>>> I feel uncomfortable about attempting to guess the source the caller is
>>>> referring to by comparing the registers for equality.  E.g.  you could
>>>> potentially end up with two sources that compare equal but have
>>>> different semantics (e.g. as a result of CSE) which might cause it to
>>>> get the wrong answer.  It would probably be better to pass a source
>>>> index and a byte offset as argument instead of an fs_reg.
>>>
>>> I've didn't thought about CSE, I'm now receiving the number of source
>>> and the reg_offset. I'm using reg_offset instead of byte offsets as it
>>> simplifies the logic. Now we are using always the base src register to
>>> do all the calculation
>>>>> +         switch (this->src[4].ud) {
>>>>> +         case 32:
>>>>> +            return ~0u;
>>>>> +         case 16:
>>>>> +            return 0x33333333;
>>>>> +         case 8:
>>>>> +            return 0x11111111;
>>>>> +         default:
>>>>> +            unreachable("Unsupported bitsize at byte_scattered_write_logical");
>>>>> +         }
>>>>
>>>> Replace the above switch statement with a call to "periodic_mask(8, 4,
>>>> this->src[4].ud / 8)" (see below for the definition).
>>>
>>> Ok.
>>>
>>>>> +      }
>>>>> +      /* As for byte_scattered_write_logical but we need to take into account
>>>>> +       * that data written are in the payload offset 32 with SIMD8 and offset
>>>>> +       * 64 with SIMD16.
>>>>> +       */
>>>>> +      if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE &&
>>>>> +          this->src[0].nr == r.nr) {
>>>>> +         fs_reg payload = this->src[0];
>>>>> +         payload.offset = REG_SIZE * this->exec_size / 8;
>>>>
>>>> byte_offset() is your friend.
>>>
>>> I've removed the overlap logic, and I'm just checking if we are in the
>>> reg_offset 1 on SIMD8 or reg_offset 2-3 in SIMD16.
>>>
>>>>> +         if (regions_overlap(r, REG_SIZE,
>>>>> +                             payload, REG_SIZE * this->exec_size / 8)) {
>>>>> +            switch (this->src[2].ud) {
>>>>> +            case 32:
>>>>> +               return ~0u;
>>>>> +            case 16:
>>>>> +               return 0x33333333;
>>>>> +            case 8:
>>>>> +               return 0x11111111;
>>>>> +            default:
>>>>> +               unreachable("Unsupported bitsize at byte_scattered_write");
>>>>> +            }
>>>>
>>>> Replace the above switch statement with a call to "periodic_mask(8, 4,
>>>> this->src[2].ud / 8)".
>>>
>>> Ok.
>>>
>>>>> +         } else {
>>>>> +            return ~0u;
>>>>> +         }
>>>>> +      }
>>>>> +   }
>>>>> +
>>>>> +   /* We define the most conservative value in order to calculate liveness
>>>>> +    * range. If it is a destination nothing is defined and if is a source
>>>>> +    * all the bytes of the register could be read. So for release builds
>>>>> +    * the unreachables would have always safe return value. */
>>>>> +   unsigned pattern =  is_dst ? 0 : ~0u;
>>>>> +
>>>>> +   /* In the general case we calculate the pattern for a specific register
>>>>> +    * on base of the type_size and stride. We calculate the SIMD8 pattern
>>>>> +    * and then we adjust the patter if needed for different exec_sizes
>>>>> +    * and offset
>>>>> +    */
>>>>> +   switch (type_sz(r.type)){
>>>>> +   case 1:
>>>>> +      switch (r.stride) {
>>>>> +      case 0:
>>>>> +         pattern = 0X1;
>>>>> +         break;
>>>>> +      case 1:
>>>>> +         pattern = 0xff;
>>>>> +         break;
>>>>> +      case 2:
>>>>> +         pattern = 0x5555;
>>>>> +         break;
>>>>> +      case 4:
>>>>> +         pattern = 0x11111111;
>>>>> +         break;
>>>>> +      case 8:
>>>>> +         pattern = 0x01010101;
>>>>> +         break;
>>>>> +      default:
>>>>> +         unreachable("Unknown pattern unsupported 8-bit stride");
>>>>> +      }
>>>>> +      break;
>>>>> +   case 2:
>>>>> +      switch (r.stride) {
>>>>> +      case 0:
>>>>> +         pattern = 0X3;
>>>>> +         break;
>>>>> +      case 1:
>>>>> +         pattern = 0xffff;
>>>>> +         break;
>>>>> +      case 2:
>>>>> +         pattern = 0x33333333;
>>>>> +         break;
>>>>> +      case 4:
>>>>> +         pattern = 0x03030303;
>>>>> +         break;
>>>>> +      case 8:
>>>>> +         pattern = 0x00030003;
>>>>> +         break;
>>>>> +      default:
>>>>> +         unreachable("Unknown pattern unsupported 16-bit stride");
>>>>> +      }
>>>>> +      break;
>>>>> +   case 4:
>>>>> +      switch (r.stride) {
>>>>> +      case 0:
>>>>> +         pattern = 0Xf;
>>>>> +         break;
>>>>> +      case 1:
>>>>> +         pattern = ~0u;
>>>>> +         break;
>>>>> +      case 2:
>>>>> +         pattern = 0x0f0f0f0f;
>>>>> +         break;
>>>>> +      case 4:
>>>>> +         pattern = 0x000f000f;
>>>>> +         break;
>>>>> +      default:
>>>>> +         unreachable("Unknown pattern unsupported 32-bit stride");
>>>>> +      }
>>>>> +      break;
>>>>> +   case 8:
>>>>> +      switch (r.stride) {
>>>>> +      case 0:
>>>>> +         pattern = 0Xff;
>>>>> +         break;
>>>>> +      case 1:
>>>>> +         pattern = ~0u;
>>>>> +         break;
>>>>> +      case 2:
>>>>> +         pattern = 0x00ff00ff;
>>>>> +         break;
>>>>> +      case 4:
>>>>> +         pattern = 0xff;
>>>>> +         break;
>>>>> +      default:
>>>>> +         unreachable("Unknown pattern unsupported 64-bit stride");
>>>>> +      }
>>>>> +      break;
>>>>> +   default:
>>>>> +      unreachable("Unknown pattern for unsupported bitsize ");
>>>>> +   }
>>>>> +
>>>>> +   if (this->exec_size > 8  && r.stride * type_sz(r.type) * 8 < REG_SIZE) {
>>>>> +      /* For exec_size greater than SIMD8 we repeat the pattern until it
>>>>> +       * represents a full register already represent a full register */
>>>>> +      pattern = pattern | (pattern << (8 * r.stride * type_sz(r.type)));
>>>>> +      if (this->exec_size > 16 && r.stride * type_sz(r.type) * 16 < REG_SIZE)
>>>>> +         pattern = pattern | (pattern << (16 * r.stride * type_sz(r.type)));
>>>>> +   } else if (this->exec_size < 8 &&
>>>>> +              r.stride * type_sz(r.type) * this->exec_size < REG_SIZE) {
>>>>> +      /* For exec_size smaller than SIMD8 we reduce the pattern if its size
>>>>> +       * is smaller than a full register. */
>>>>> +      pattern = pattern >> (MIN2(REG_SIZE, 8 * type_sz(r.type) * r.stride) -
>>>>> +                            this->exec_size * type_sz(r.type) * r.stride);
>>>>> +   }
>>>>> +
>>>
>>>> This seems really mad, no clue whether it's correct.   Why not replace
>>>> the above ~110 lines with a call to the following (fully
>>>> untested) 5-LOC function:
>>>
>>> Your suggestion seems to work perfectly fine, my original approach was
>>> trying to avoid the loop of creating the read/write pattern but after
>>> testing my v2 I wasn't able to notice any performance difference running
>>> shader-db and having the same results. I was originally trying that
>>> SIMD8 patterns were already constants. Sorry for the added complexity.
>>>
>> 
>> The loop runs for a logarithmic number of iterations though, so it has
>> the exact same run-time complexity as your original patch, roughly the
>> same amount of branches at run-time (but possibly less indirect
>> branches!), and it should be compiled into a substantially lower number
>> of instructions, which may actually cause it to perform better due to
>> more favourable caching.  It's hard to tell though which one will
>> perform better in practice without benchmarking them, and as you
>> probably realized this is so far from being a bottleneck that whatever
>> the difference was is likely lost in the noise.  So it really doesn't
>> matter which one performs better...
>> 
>>>>
>>>> | uint32_t
>>>> | periodic_mask(unsigned count, unsigned step, unsigned bits)
>>>> | {
>>>> |    uint32_t m = (count ? (1 << bits) - 1 : 0);
>>>> |    const unsigined max = MIN2(count * step, sizeof(m) * CHAR_BITS);
>>>> | 
>>>> |    for (unsigned shift = step; shift < max; shift *= 2)
>>>> |       m |= m << shift;
>>>> | 
>>>> |    return m;
>>>> | }
>>>
>>> I've used your function just changing the sizeof(m) * CHAR_BITS for the
>>> REG_SIZE, to not include the limits.h.
>> 
>> My intention was to make the function as agnostic to IR details as
>> possible: the only reason there is a limit of 32 bits is because that's
>> the size of the type used to hold the return value.  Using sizeof makes
>> sure that e.g. extending the code to 64 bits is as simple as changing
>> the datatype to uint64_t.
>> 
>>> And I've also included an offset parameter that allows us to shift the
>>> bits of the pattern when we have an offset inside the register.
>>>
>> 
>> That sounds fine.
>> 
>>>>> +   /* We adjust the pattern to the byte_offset of the register */
>>>>> +   pattern = pattern << (r.offset % REG_SIZE);
>>>>> +
>>>>
>>>> This doesn't really work except for r equal to the first GRF read by the
>>>> source.  Regions with non-zero sub-GRF offset that straddle multiple
>>>> GRFs are not really very common at the IR level, but they're allowed by
>>>> the hardware with some restrictions, so it would make sense to at least
>>>> handle them conservatively in order to keep things from breaking silently.
>>>
>>>
>>> I included an assert in the periodic_mask function to detect if the
>>> combination of offset and mask goes over the REG_SIZE, so we would
>>> detect an straddle at this level of the IR.
>>>
>>> assert(offset + max - (step - bits) <= REG_SIZE);
>
>> Uhm...  What's the definition of "offset" here?  You seem to be passing
>> the offset relative to the start of the VGRF modulo REG_SIZE but that
>> doesn't really make sense to me whenever "reg_offset" is non-zero.
>
> Yes I agree on removing this assert in the periodic_mask as it a generic
> function. I was to focus on seen the integer as the register :)
>
> The motivation was because of the following cases that happen with
> 8/16-bit more usually:
>
> (1) mov(16)    g12<2>HF    g1<16,8,2>HF                  { align1 1H };
> (2) mov(16)    g12.1<2>HF  g2<16,8,2>HF                  { align1 1H };
>
> In the dst_write_pattern, at point of calling the general periodic_mask
> case we know that instruction is not expected to be a send message
> destination. So the pattern is the same for reg_offset = 0 and
> reg_offset = 1.
>

In this specific example, yes, but in general the pattern will not be
the same for multiple registers for arbitrary fs_reg::offset values.

> 0x3333333 for case (1) and 0xccccccc for case (2). But without repeating
> the pattern offset we would get 0xcccccccc for reg_offset=0 but
> 0x333333333 for reg_offset=1 witch would be incorrect.
>
>> I think you want to pass "src[i].offset % REG_SIZE - reg_offset *
>> REG_SIZE" as a signed integer in order to get the offset of the first
>> byte actually written by the instruction relative to the first byte of
>> the GRF window of the pattern.  You don't really need to assert-fail
>> when the offset is greater or equal to 32 (which shouldn't actually
>> happen in practice), "return 0" gives you the correct behavior.  For
>> negative offsets (which means the pattern starts after the first byte
>> written by the instruction) you can just return ~0u conservatively
>> whenever the current logic wouldn't work [assuming you don't feel like
>> implementing the code to handle that case accurately ;)].
>
> Maybe you feel more comfortable with the following approach for
> src_read_pattern.:
>
> For read sources we also assume that SENDs sources are completely
> read so return ~0u except the byte_scattered_write source exceptions.
>
> if (this->is_send_from_grf() || this->mlen != 0)
>       return ~0u;
>
> So after this I think that we can assume that the following condition
> should be always correct where source operand must reside within two
> adjacent 256-bit registers so the pattern would be periodic for
> all reg_offsets and we can use the (this->src[i].offset % REG_SIZE).
>

That's not a particularly futureproof assumption.  I'm okay with
handling non-reg_offset-periodic cases inaccurately for the moment for
the sake of simplicity, but there is no reason to have the code blow up
in such cases if you could just return ~0u.

> assert(reg_offset < DIV_ROUND_UP(this->src[i].stride ?
>     this->src[i].stride * type_sz(this->src[i].type) * this->exec_size :
>     type_sz(this->src[i].type), REG_SIZE));
>

I don't think it's useful to assert-fail on this condition, since it
doesn't handle all cases where the code below will be broken for a
certain valid IR, and the condition it does protect against can be
handled trivially in periodic_mask(), by returning zero when the 32-byte
pattern window falls outside the region read or written by the
instruction -- But currently you can't even know whether that's the
case, because periodic_mask() is unaware where the 32-byte window starts
relative to the source region, which is *all* it cares about, instead
you're passing an offset to it that is equal to that in some special
cases -- In all other cases the result of the function will be bogus.

> So we could call in this scenario.
>
> return periodic_mask(this->exec_size,
>                      this->src[i].stride * type_sz(this->src[i].type),
>                      type_sz(this->src[i].type),
>                      this->src[i].offset % REG_SIZE);
>
> The only issue I could think about that could generate issues would a
> case that I didn't found in our current code where a source is defined
> like r5.0<1;8;2>:w explained at PRM KBL vol07 Page 790 "A 16-element
> register region with interleaved rows (r5.0<1;8,2>:w)".
>

You couldn't really handle that even if you wanted to, because there is
no way to represent such a thing at the IR level, since the horizontal
and vertical strides cannot be controlled independently, only the stride
member is meaningful for VGRFs.

> What do you think?
>
> Chema
>
>
>>> Thanks for the review, I think that v2 has better shape.
>>>
>> 
>> No problem.
>> 
>>> Chema
>>>
>>>>> +   assert(pattern);
>>>>> +
>>>>> +   return pattern;
>>>>> +}
>>>>> +
>>>>> +
>>>>>  unsigned
>>>>>  fs_inst::components_read(unsigned i) const
>>>>>  {
>>>>> diff --git a/src/intel/compiler/brw_ir_fs.h b/src/intel/compiler/brw_ir_fs.h
>>>>> index 92dad269a34..5ea6294b8ad 100644
>>>>> --- a/src/intel/compiler/brw_ir_fs.h
>>>>> +++ b/src/intel/compiler/brw_ir_fs.h
>>>>> @@ -350,6 +350,7 @@ public:
>>>>>     bool equals(fs_inst *inst) const;
>>>>>     bool is_send_from_grf() const;
>>>>>     bool is_partial_write() const;
>>>>> +   unsigned register_byte_use_pattern(const fs_reg &r, boolean is_dst) const;
>>>>>     bool is_copy_payload(const brw::simple_allocator &grf_alloc) const;
>>>>>     unsigned components_read(unsigned i) const;
>>>>>     unsigned size_read(int arg) const;
>>>>> -- 
>>>>> 2.17.1
>>>>>
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180720/6d1bdd8c/attachment-0001.sig>