[Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

Francisco Jerez currojerez at riseup.net
Thu Jul 26 18:02:15 UTC 2018


Chema Casanova <jmcasanova at igalia.com> writes:

> El 20/07/18 a las 22:10, Francisco Jerez escribió:
>> Chema Casanova <jmcasanova at igalia.com> writes:
>> 
>>> El 20/07/18 a las 00:34, Francisco Jerez escribió:
>>>> Chema Casanova <jmcasanova at igalia.com> writes:
>>>>
>>>>> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>>>>>> Jose Maria Casanova Crespo <jmcasanova at igalia.com> writes:
>>>>>>
>>>>>>> For a register source/destination of an instruction the function returns
>>>>>>> the read/write byte pattern of a 32-byte registers as a unsigned int.
>>>>>>>
>>>>>>> The returned pattern takes into account the exec_size of the instruction,
>>>>>>> the type bitsize, the stride and if the register is source or destination.
>>>>>>>
>>>>>>> The objective of the functions if to help to know the read/written bytes
>>>>>>> of the instructions to improve the liveness analysis for partial read/writes.
>>>>>>>
>>>>>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>>>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize
>>>>>>> parameter they have a different read pattern.
>>>>>>> ---
>>>>>>>  src/intel/compiler/brw_fs.cpp  | 183 +++++++++++++++++++++++++++++++++
>>>>>>>  src/intel/compiler/brw_ir_fs.h |   1 +
>>>>>>>  2 files changed, 184 insertions(+)
>>>>>>>
>>>>>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>>>>>>> index 2b8363ca362..f3045c4ff6c 100644
>>>>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>>>>>             this->dst.offset % REG_SIZE != 0);
>>>>>>>  }
>>>>>>>  
>>>>>>> +/**
>>>>>>> + * Returns a 32-bit uint whose bits represent if the associated register byte
>>>>>>> + * has been read/written by the instruction. The returned pattern takes into
>>>>>>> + * account the exec_size of the instruction, the type bitsize and the register
>>>>>>> + * stride and the register is source or destination for the instruction.
>>>>>>> + *
>>>>>>> + * The objective of this function is to identify which parts of the register
>>>>>>> + * are read or written for operations that don't read/write a full register.
>>>>>>> + * So we can identify in live range variable analysis if a partial write has
>>>>>>> + * completelly defined the part of the register used by a partial read. So we
>>>>>>> + * avoid extending the liveness range because all data read was already
>>>>>>> + * defined although the wasn't completely written.
>>>>>>> + */
>>>>>>> +unsigned
>>>>>>> +fs_inst::register_byte_use_pattern(const fs_reg &r, boolean is_dst) const
>>>>>>> +{
>>>>>>> +   if (is_dst) {
>>>>>
>>>>>> Please split into two functions (like fs_inst::src_read and
>>>>>> ::src_written) since that would make the call-sites of this method more
>>>>>> self-documenting than a boolean parameter.  You should be able to share
>>>>>> code by refactoring the common logic into a separate function (see below
>>>>>> for some suggestions on how that could be achieved).
>>>>>
>>>>> Sure, it would improve readability and simplifies the logic, I've chosen
>>>>> dst_write_pattern and src_read_pattern.
>>>>>
>>>>>>
>>>>>>> +      /* We don't know what is written so we return the worts case */
>>>>>>
>>>>>> "worst"
>>>>>
>>>>> Fixed.
>>>>>
>>>>>>> +      if (this->predicate && this->opcode != BRW_OPCODE_SEL)
>>>>>>> +         return 0;
>>>>>>> +      /* We assume that send destinations are completely written */
>>>>>>> +      if (this->is_send_from_grf())
>>>>>>> +         return ~0u;
>>>>>>
>>>>>> Some send-like instructions won't be caught by this condition, you
>>>>>> should check for this->mlen != 0 in addition.
>>>>>
>>>>> Would it be enough to check for (this->mlen > 0) and forget about
>>>>> is_send_from_grf? I am using this approach in v2 I am sending.
>>>>>
>>>>
>>>> I don't think the mlen > 0 condition would catch all cases either...
>>>> E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC.  You probably need both
>>>> conditions.  Sucks...
>>>
>>> That is true, so now we have the:
>>>  (this->is_send_from_grf() || this->mlen != 0)
>>>
>>>>>>> +   } else {
>>>>>>> +      /* byte_scattered_write_logical pattern of src[1] is 32-bit aligned
>>>>>>> +       * so the read pattern depends on the bitsize stored at src[4]
>>>>>>> +       */
>>>>>>> +      if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL &&
>>>>>>> +          this->src[1].nr == r.nr) {
>>>>>
>>>>>> I feel uncomfortable about attempting to guess the source the caller is
>>>>>> referring to by comparing the registers for equality.  E.g.  you could
>>>>>> potentially end up with two sources that compare equal but have
>>>>>> different semantics (e.g. as a result of CSE) which might cause it to
>>>>>> get the wrong answer.  It would probably be better to pass a source
>>>>>> index and a byte offset as argument instead of an fs_reg.
>>>>>
>>>>> I've didn't thought about CSE, I'm now receiving the number of source
>>>>> and the reg_offset. I'm using reg_offset instead of byte offsets as it
>>>>> simplifies the logic. Now we are using always the base src register to
>>>>> do all the calculation
>>>>>>> +         switch (this->src[4].ud) {
>>>>>>> +         case 32:
>>>>>>> +            return ~0u;
>>>>>>> +         case 16:
>>>>>>> +            return 0x33333333;
>>>>>>> +         case 8:
>>>>>>> +            return 0x11111111;
>>>>>>> +         default:
>>>>>>> +            unreachable("Unsupported bitsize at byte_scattered_write_logical");
>>>>>>> +         }
>>>>>>
>>>>>> Replace the above switch statement with a call to "periodic_mask(8, 4,
>>>>>> this->src[4].ud / 8)" (see below for the definition).
>>>>>
>>>>> Ok.
>>>>>
>>>>>>> +      }
>>>>>>> +      /* As for byte_scattered_write_logical but we need to take into account
>>>>>>> +       * that data written are in the payload offset 32 with SIMD8 and offset
>>>>>>> +       * 64 with SIMD16.
>>>>>>> +       */
>>>>>>> +      if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE &&
>>>>>>> +          this->src[0].nr == r.nr) {
>>>>>>> +         fs_reg payload = this->src[0];
>>>>>>> +         payload.offset = REG_SIZE * this->exec_size / 8;
>>>>>>
>>>>>> byte_offset() is your friend.
>>>>>
>>>>> I've removed the overlap logic, and I'm just checking if we are in the
>>>>> reg_offset 1 on SIMD8 or reg_offset 2-3 in SIMD16.
>>>>>
>>>>>>> +         if (regions_overlap(r, REG_SIZE,
>>>>>>> +                             payload, REG_SIZE * this->exec_size / 8)) {
>>>>>>> +            switch (this->src[2].ud) {
>>>>>>> +            case 32:
>>>>>>> +               return ~0u;
>>>>>>> +            case 16:
>>>>>>> +               return 0x33333333;
>>>>>>> +            case 8:
>>>>>>> +               return 0x11111111;
>>>>>>> +            default:
>>>>>>> +               unreachable("Unsupported bitsize at byte_scattered_write");
>>>>>>> +            }
>>>>>>
>>>>>> Replace the above switch statement with a call to "periodic_mask(8, 4,
>>>>>> this->src[2].ud / 8)".
>>>>>
>>>>> Ok.
>>>>>
>>>>>>> +         } else {
>>>>>>> +            return ~0u;
>>>>>>> +         }
>>>>>>> +      }
>>>>>>> +   }
>>>>>>> +
>>>>>>> +   /* We define the most conservative value in order to calculate liveness
>>>>>>> +    * range. If it is a destination nothing is defined and if is a source
>>>>>>> +    * all the bytes of the register could be read. So for release builds
>>>>>>> +    * the unreachables would have always safe return value. */
>>>>>>> +   unsigned pattern =  is_dst ? 0 : ~0u;
>>>>>>> +
>>>>>>> +   /* In the general case we calculate the pattern for a specific register
>>>>>>> +    * on base of the type_size and stride. We calculate the SIMD8 pattern
>>>>>>> +    * and then we adjust the patter if needed for different exec_sizes
>>>>>>> +    * and offset
>>>>>>> +    */
>>>>>>> +   switch (type_sz(r.type)){
>>>>>>> +   case 1:
>>>>>>> +      switch (r.stride) {
>>>>>>> +      case 0:
>>>>>>> +         pattern = 0X1;
>>>>>>> +         break;
>>>>>>> +      case 1:
>>>>>>> +         pattern = 0xff;
>>>>>>> +         break;
>>>>>>> +      case 2:
>>>>>>> +         pattern = 0x5555;
>>>>>>> +         break;
>>>>>>> +      case 4:
>>>>>>> +         pattern = 0x11111111;
>>>>>>> +         break;
>>>>>>> +      case 8:
>>>>>>> +         pattern = 0x01010101;
>>>>>>> +         break;
>>>>>>> +      default:
>>>>>>> +         unreachable("Unknown pattern unsupported 8-bit stride");
>>>>>>> +      }
>>>>>>> +      break;
>>>>>>> +   case 2:
>>>>>>> +      switch (r.stride) {
>>>>>>> +      case 0:
>>>>>>> +         pattern = 0X3;
>>>>>>> +         break;
>>>>>>> +      case 1:
>>>>>>> +         pattern = 0xffff;
>>>>>>> +         break;
>>>>>>> +      case 2:
>>>>>>> +         pattern = 0x33333333;
>>>>>>> +         break;
>>>>>>> +      case 4:
>>>>>>> +         pattern = 0x03030303;
>>>>>>> +         break;
>>>>>>> +      case 8:
>>>>>>> +         pattern = 0x00030003;
>>>>>>> +         break;
>>>>>>> +      default:
>>>>>>> +         unreachable("Unknown pattern unsupported 16-bit stride");
>>>>>>> +      }
>>>>>>> +      break;
>>>>>>> +   case 4:
>>>>>>> +      switch (r.stride) {
>>>>>>> +      case 0:
>>>>>>> +         pattern = 0Xf;
>>>>>>> +         break;
>>>>>>> +      case 1:
>>>>>>> +         pattern = ~0u;
>>>>>>> +         break;
>>>>>>> +      case 2:
>>>>>>> +         pattern = 0x0f0f0f0f;
>>>>>>> +         break;
>>>>>>> +      case 4:
>>>>>>> +         pattern = 0x000f000f;
>>>>>>> +         break;
>>>>>>> +      default:
>>>>>>> +         unreachable("Unknown pattern unsupported 32-bit stride");
>>>>>>> +      }
>>>>>>> +      break;
>>>>>>> +   case 8:
>>>>>>> +      switch (r.stride) {
>>>>>>> +      case 0:
>>>>>>> +         pattern = 0Xff;
>>>>>>> +         break;
>>>>>>> +      case 1:
>>>>>>> +         pattern = ~0u;
>>>>>>> +         break;
>>>>>>> +      case 2:
>>>>>>> +         pattern = 0x00ff00ff;
>>>>>>> +         break;
>>>>>>> +      case 4:
>>>>>>> +         pattern = 0xff;
>>>>>>> +         break;
>>>>>>> +      default:
>>>>>>> +         unreachable("Unknown pattern unsupported 64-bit stride");
>>>>>>> +      }
>>>>>>> +      break;
>>>>>>> +   default:
>>>>>>> +      unreachable("Unknown pattern for unsupported bitsize ");
>>>>>>> +   }
>>>>>>> +
>>>>>>> +   if (this->exec_size > 8  && r.stride * type_sz(r.type) * 8 < REG_SIZE) {
>>>>>>> +      /* For exec_size greater than SIMD8 we repeat the pattern until it
>>>>>>> +       * represents a full register already represent a full register */
>>>>>>> +      pattern = pattern | (pattern << (8 * r.stride * type_sz(r.type)));
>>>>>>> +      if (this->exec_size > 16 && r.stride * type_sz(r.type) * 16 < REG_SIZE)
>>>>>>> +         pattern = pattern | (pattern << (16 * r.stride * type_sz(r.type)));
>>>>>>> +   } else if (this->exec_size < 8 &&
>>>>>>> +              r.stride * type_sz(r.type) * this->exec_size < REG_SIZE) {
>>>>>>> +      /* For exec_size smaller than SIMD8 we reduce the pattern if its size
>>>>>>> +       * is smaller than a full register. */
>>>>>>> +      pattern = pattern >> (MIN2(REG_SIZE, 8 * type_sz(r.type) * r.stride) -
>>>>>>> +                            this->exec_size * type_sz(r.type) * r.stride);
>>>>>>> +   }
>>>>>>> +
>>>>>
>>>>>> This seems really mad, no clue whether it's correct.   Why not replace
>>>>>> the above ~110 lines with a call to the following (fully
>>>>>> untested) 5-LOC function:
>>>>>
>>>>> Your suggestion seems to work perfectly fine, my original approach was
>>>>> trying to avoid the loop of creating the read/write pattern but after
>>>>> testing my v2 I wasn't able to notice any performance difference running
>>>>> shader-db and having the same results. I was originally trying that
>>>>> SIMD8 patterns were already constants. Sorry for the added complexity.
>>>>>
>>>>
>>>> The loop runs for a logarithmic number of iterations though, so it has
>>>> the exact same run-time complexity as your original patch, roughly the
>>>> same amount of branches at run-time (but possibly less indirect
>>>> branches!), and it should be compiled into a substantially lower number
>>>> of instructions, which may actually cause it to perform better due to
>>>> more favourable caching.  It's hard to tell though which one will
>>>> perform better in practice without benchmarking them, and as you
>>>> probably realized this is so far from being a bottleneck that whatever
>>>> the difference was is likely lost in the noise.  So it really doesn't
>>>> matter which one performs better...
>>>>
>>>>>>
>>>>>> | uint32_t
>>>>>> | periodic_mask(unsigned count, unsigned step, unsigned bits)
>>>>>> | {
>>>>>> |    uint32_t m = (count ? (1 << bits) - 1 : 0);
>>>>>> |    const unsigined max = MIN2(count * step, sizeof(m) * CHAR_BITS);
>>>>>> | 
>>>>>> |    for (unsigned shift = step; shift < max; shift *= 2)
>>>>>> |       m |= m << shift;
>>>>>> | 
>>>>>> |    return m;
>>>>>> | }
>>>>>
>>>>> I've used your function just changing the sizeof(m) * CHAR_BITS for the
>>>>> REG_SIZE, to not include the limits.h.
>>>>
>>>> My intention was to make the function as agnostic to IR details as
>>>> possible: the only reason there is a limit of 32 bits is because that's
>>>> the size of the type used to hold the return value.  Using sizeof makes
>>>> sure that e.g. extending the code to 64 bits is as simple as changing
>>>> the datatype to uint64_t.
>>>>
>>>>> And I've also included an offset parameter that allows us to shift the
>>>>> bits of the pattern when we have an offset inside the register.
>>>>>
>>>>
>>>> That sounds fine.
>>>>
>>>>>>> +   /* We adjust the pattern to the byte_offset of the register */
>>>>>>> +   pattern = pattern << (r.offset % REG_SIZE);
>>>>>>> +
>>>>>>
>>>>>> This doesn't really work except for r equal to the first GRF read by the
>>>>>> source.  Regions with non-zero sub-GRF offset that straddle multiple
>>>>>> GRFs are not really very common at the IR level, but they're allowed by
>>>>>> the hardware with some restrictions, so it would make sense to at least
>>>>>> handle them conservatively in order to keep things from breaking silently.
>>>>>
>>>>>
>>>>> I included an assert in the periodic_mask function to detect if the
>>>>> combination of offset and mask goes over the REG_SIZE, so we would
>>>>> detect an straddle at this level of the IR.
>>>>>
>>>>> assert(offset + max - (step - bits) <= REG_SIZE);
>>>
>>>> Uhm...  What's the definition of "offset" here?  You seem to be passing
>>>> the offset relative to the start of the VGRF modulo REG_SIZE but that
>>>> doesn't really make sense to me whenever "reg_offset" is non-zero.
>>>
>>> Yes I agree on removing this assert in the periodic_mask as it a generic
>>> function. I was to focus on seen the integer as the register :)
>>>
>>> The motivation was because of the following cases that happen with
>>> 8/16-bit more usually:
>>>
>>> (1) mov(16)    g12<2>HF    g1<16,8,2>HF                  { align1 1H };
>>> (2) mov(16)    g12.1<2>HF  g2<16,8,2>HF                  { align1 1H };
>>>
>>> In the dst_write_pattern, at point of calling the general periodic_mask
>>> case we know that instruction is not expected to be a send message
>>> destination. So the pattern is the same for reg_offset = 0 and
>>> reg_offset = 1.
>>>
>> 
>> In this specific example, yes, but in general the pattern will not be
>> the same for multiple registers for arbitrary fs_reg::offset values.
>
> I've been running some tests. I've found that the three opcodes reach
> this situation in shader-db (MOV, INDIRECT_MOV and OR).
>
> If we just manage the MOV opcode that we agree that the same pattern is
> repeated for the same source/dst used registers have the same spilling
> reduction that with my original patch.
>
>>> 0x3333333 for case (1) and 0xccccccc for case (2). But without repeating
>>> the pattern offset we would get 0xcccccccc for reg_offset=0 but
>>> 0x333333333 for reg_offset=1 witch would be incorrect.
>>>
>>>> I think you want to pass "src[i].offset % REG_SIZE - reg_offset *
>>>> REG_SIZE" as a signed integer in order to get the offset of the first
>>>> byte actually written by the instruction relative to the first byte of
>>>> the GRF window of the pattern.  You don't really need to assert-fail
>>>> when the offset is greater or equal to 32 (which shouldn't actually
>>>> happen in practice), "return 0" gives you the correct behavior.  For
>>>> negative offsets (which means the pattern starts after the first byte
>>>> written by the instruction) you can just return ~0u conservatively
>>>> whenever the current logic wouldn't work [assuming you don't feel like
>>>> implementing the code to handle that case accurately ;)].
>>>
>>> Maybe you feel more comfortable with the following approach for
>>> src_read_pattern.:
>>>
>>> For read sources we also assume that SENDs sources are completely
>>> read so return ~0u except the byte_scattered_write source exceptions.
>>>
>>> if (this->is_send_from_grf() || this->mlen != 0)
>>>       return ~0u;
>>>
>>> So after this I think that we can assume that the following condition
>>> should be always correct where source operand must reside within two
>>> adjacent 256-bit registers so the pattern would be periodic for
>>> all reg_offsets and we can use the (this->src[i].offset % REG_SIZE).
>>>
>> 
>> That's not a particularly futureproof assumption.  I'm okay with
>> handling non-reg_offset-periodic cases inaccurately for the moment for
>> the sake of simplicity, but there is no reason to have the code blow up
>> in such cases if you could just return ~0u.
>
> I've added some modifications in the v3, we return ~0u for reads and 0u
> for writes by default, but i added the case of !this->is_partial_write()
> that will return ~0u for writes, that won't we reached with current use,
> but to be coherent with any future uses of this function.
>
>>> assert(reg_offset < DIV_ROUND_UP(this->src[i].stride ?
>>>     this->src[i].stride * type_sz(this->src[i].type) * this->exec_size :
>>>     type_sz(this->src[i].type), REG_SIZE));
>>>
>
>> I don't think it's useful to assert-fail on this condition, since it
>> doesn't handle all cases where the code below will be broken for a
>> certain valid IR, and the condition it does protect against can be
>> handled trivially in periodic_mask(), by returning zero when the 32-byte
>> pattern window falls outside the region read or written by the
>> instruction -- But currently you can't even know whether that's the
>> case, because periodic_mask() is unaware where the 32-byte window starts
>> relative to the source region, which is *all* it cares about, instead
>> you're passing an offset to it that is equal to that in some special
>> cases -- In all other cases the result of the function will be bogus.
>
> I understand your concerns, so in my V3, I am using only the
> periodic_mask function when reg_offset is 0, and for the MOV special
> case that I understand we agree that the internal offset for
> reg_offset=1 is generally reg.offset % REG_SIZE. With that two cases
> covered we solve the register_pressure on current supported 64/16/8 bit
> cases.
>

I don't understand.  The MOV opcode is not substantially different from
any other ALU opcode with regards to regioning.  The assumptions this is
based on are invalid for general Align1 regions, whether the opcode is a
MOV or not.

>>> So we could call in this scenario.
>>>
>>> return periodic_mask(this->exec_size,
>>>                      this->src[i].stride * type_sz(this->src[i].type),
>>>                      type_sz(this->src[i].type),
>>>                      this->src[i].offset % REG_SIZE);
>>>
>>> The only issue I could think about that could generate issues would a
>>> case that I didn't found in our current code where a source is defined
>>> like r5.0<1;8;2>:w explained at PRM KBL vol07 Page 790 "A 16-element
>>> register region with interleaved rows (r5.0<1;8,2>:w)".
>>>
>> 
>> You couldn't really handle that even if you wanted to, because there is
>> no way to represent such a thing at the IR level, since the horizontal
>> and vertical strides cannot be controlled independently, only the stride
>> member is meaningful for VGRFs.
>
> Good, that simplifies things.
>
> Thanks again.
>
> Chema
>
>>> What do you think?
>>>
>>> Chema
>>>
>>>
>>>>> Thanks for the review, I think that v2 has better shape.
>>>>>
>>>>
>>>> No problem.
>>>>
>>>>> Chema
>>>>>
>>>>>>> +   assert(pattern);
>>>>>>> +
>>>>>>> +   return pattern;
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>>  unsigned
>>>>>>>  fs_inst::components_read(unsigned i) const
>>>>>>>  {
>>>>>>> diff --git a/src/intel/compiler/brw_ir_fs.h b/src/intel/compiler/brw_ir_fs.h
>>>>>>> index 92dad269a34..5ea6294b8ad 100644
>>>>>>> --- a/src/intel/compiler/brw_ir_fs.h
>>>>>>> +++ b/src/intel/compiler/brw_ir_fs.h
>>>>>>> @@ -350,6 +350,7 @@ public:
>>>>>>>     bool equals(fs_inst *inst) const;
>>>>>>>     bool is_send_from_grf() const;
>>>>>>>     bool is_partial_write() const;
>>>>>>> +   unsigned register_byte_use_pattern(const fs_reg &r, boolean is_dst) const;
>>>>>>>     bool is_copy_payload(const brw::simple_allocator &grf_alloc) const;
>>>>>>>     unsigned components_read(unsigned i) const;
>>>>>>>     unsigned size_read(int arg) const;
>>>>>>> -- 
>>>>>>> 2.17.1
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mesa-dev mailing list
>>>>>>> mesa-dev at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mesa-dev mailing list
>>>>>>> mesa-dev at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mesa-dev mailing list
>>>>>>> mesa-dev at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180726/818e1523/attachment-0001.sig>


More information about the mesa-dev mailing list