[Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst
Francisco Jerez
currojerez at riseup.net
Fri Jul 27 23:45:19 UTC 2018
Chema Casanova <jmcasanova at igalia.com> writes:
> El 27/07/18 a las 02:44, Francisco Jerez escribió:
>> Chema Casanova <jmcasanova at igalia.com> writes:
>>
>>> El 26/07/18 a las 20:02, Francisco Jerez escribió:
>>>> Chema Casanova <jmcasanova at igalia.com> writes:
>>>>
>>>>> El 20/07/18 a las 22:10, Francisco Jerez escribió:
>>>>>> Chema Casanova <jmcasanova at igalia.com> writes:
>>>>>>
>>>>>>> El 20/07/18 a las 00:34, Francisco Jerez escribió:
>>>>>>>> Chema Casanova <jmcasanova at igalia.com> writes:
>>>>>>>>
>>>>>>>>> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>>>>>>>>>> Jose Maria Casanova Crespo <jmcasanova at igalia.com> writes:
>>>>>>>>>>
>>>>>>>>>>> For a register source/destination of an instruction the function returns
>>>>>>>>>>> the read/write byte pattern of a 32-byte registers as a unsigned int.
>>>>>>>>>>>
>>>>>>>>>>> The returned pattern takes into account the exec_size of the instruction,
>>>>>>>>>>> the type bitsize, the stride and if the register is source or destination.
>>>>>>>>>>>
>>>>>>>>>>> The objective of the functions if to help to know the read/written bytes
>>>>>>>>>>> of the instructions to improve the liveness analysis for partial read/writes.
>>>>>>>>>>>
>>>>>>>>>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>>>>>>>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize
>>>>>>>>>>> parameter they have a different read pattern.
>>>>>>>>>>> ---
>>>>>>>>>>> src/intel/compiler/brw_fs.cpp | 183 +++++++++++++++++++++++++++++++++
>>>>>>>>>>> src/intel/compiler/brw_ir_fs.h | 1 +
>>>>>>>>>>> 2 files changed, 184 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>>>>>>>>>>> index 2b8363ca362..f3045c4ff6c 100644
>>>>>>>>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>>>>>>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>>>>>>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>>>>>>>>> this->dst.offset % REG_SIZE != 0);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> +/**
>>>>>>>>>>> + * Returns a 32-bit uint whose bits represent if the associated register byte
>>>>>>>>>>> + * has been read/written by the instruction. The returned pattern takes into
>>>>>>>>>>> + * account the exec_size of the instruction, the type bitsize and the register
>>>>>>>>>>> + * stride and the register is source or destination for the instruction.
>>>>>>>>>>> + *
>>>>>>>>>>> + * The objective of this function is to identify which parts of the register
>>>>>>>>>>> + * are read or written for operations that don't read/write a full register.
>>>>>>>>>>> + * So we can identify in live range variable analysis if a partial write has
>>>>>>>>>>> + * completelly defined the part of the register used by a partial read. So we
>>>>>>>>>>> + * avoid extending the liveness range because all data read was already
>>>>>>>>>>> + * defined although the wasn't completely written.
>>>>>>>>>>> + */
>>>>>>>>>>> +unsigned
>>>>>>>>>>> +fs_inst::register_byte_use_pattern(const fs_reg &r, boolean is_dst) const
>>>>>>>>>>> +{
>>>>>>>>>>> + if (is_dst) {
>>>>>>>>>
>>>>>>>>>> Please split into two functions (like fs_inst::src_read and
>>>>>>>>>> ::src_written) since that would make the call-sites of this method more
>>>>>>>>>> self-documenting than a boolean parameter. You should be able to share
>>>>>>>>>> code by refactoring the common logic into a separate function (see below
>>>>>>>>>> for some suggestions on how that could be achieved).
>>>>>>>>>
>>>>>>>>> Sure, it would improve readability and simplifies the logic, I've chosen
>>>>>>>>> dst_write_pattern and src_read_pattern.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> + /* We don't know what is written so we return the worts case */
>>>>>>>>>>
>>>>>>>>>> "worst"
>>>>>>>>>
>>>>>>>>> Fixed.
>>>>>>>>>
>>>>>>>>>>> + if (this->predicate && this->opcode != BRW_OPCODE_SEL)
>>>>>>>>>>> + return 0;
>>>>>>>>>>> + /* We assume that send destinations are completely written */
>>>>>>>>>>> + if (this->is_send_from_grf())
>>>>>>>>>>> + return ~0u;
>>>>>>>>>>
>>>>>>>>>> Some send-like instructions won't be caught by this condition, you
>>>>>>>>>> should check for this->mlen != 0 in addition.
>>>>>>>>>
>>>>>>>>> Would it be enough to check for (this->mlen > 0) and forget about
>>>>>>>>> is_send_from_grf? I am using this approach in v2 I am sending.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't think the mlen > 0 condition would catch all cases either...
>>>>>>>> E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC. You probably need both
>>>>>>>> conditions. Sucks...
>>>>>>>
>>>>>>> That is true, so now we have the:
>>>>>>> (this->is_send_from_grf() || this->mlen != 0)
>>>>>>>
>>>>>>>>>>> + } else {
>>>>>>>>>>> + /* byte_scattered_write_logical pattern of src[1] is 32-bit aligned
>>>>>>>>>>> + * so the read pattern depends on the bitsize stored at src[4]
>>>>>>>>>>> + */
>>>>>>>>>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL &&
>>>>>>>>>>> + this->src[1].nr == r.nr) {
>>>>>>>>>
>>>>>>>>>> I feel uncomfortable about attempting to guess the source the caller is
>>>>>>>>>> referring to by comparing the registers for equality. E.g. you could
>>>>>>>>>> potentially end up with two sources that compare equal but have
>>>>>>>>>> different semantics (e.g. as a result of CSE) which might cause it to
>>>>>>>>>> get the wrong answer. It would probably be better to pass a source
>>>>>>>>>> index and a byte offset as argument instead of an fs_reg.
>>>>>>>>>
>>>>>>>>> I've didn't thought about CSE, I'm now receiving the number of source
>>>>>>>>> and the reg_offset. I'm using reg_offset instead of byte offsets as it
>>>>>>>>> simplifies the logic. Now we are using always the base src register to
>>>>>>>>> do all the calculation
>>>>>>>>>>> + switch (this->src[4].ud) {
>>>>>>>>>>> + case 32:
>>>>>>>>>>> + return ~0u;
>>>>>>>>>>> + case 16:
>>>>>>>>>>> + return 0x33333333;
>>>>>>>>>>> + case 8:
>>>>>>>>>>> + return 0x11111111;
>>>>>>>>>>> + default:
>>>>>>>>>>> + unreachable("Unsupported bitsize at byte_scattered_write_logical");
>>>>>>>>>>> + }
>>>>>>>>>>
>>>>>>>>>> Replace the above switch statement with a call to "periodic_mask(8, 4,
>>>>>>>>>> this->src[4].ud / 8)" (see below for the definition).
>>>>>>>>>
>>>>>>>>> Ok.
>>>>>>>>>
>>>>>>>>>>> + }
>>>>>>>>>>> + /* As for byte_scattered_write_logical but we need to take into account
>>>>>>>>>>> + * that data written are in the payload offset 32 with SIMD8 and offset
>>>>>>>>>>> + * 64 with SIMD16.
>>>>>>>>>>> + */
>>>>>>>>>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE &&
>>>>>>>>>>> + this->src[0].nr == r.nr) {
>>>>>>>>>>> + fs_reg payload = this->src[0];
>>>>>>>>>>> + payload.offset = REG_SIZE * this->exec_size / 8;
>>>>>>>>>>
>>>>>>>>>> byte_offset() is your friend.
>>>>>>>>>
>>>>>>>>> I've removed the overlap logic, and I'm just checking if we are in the
>>>>>>>>> reg_offset 1 on SIMD8 or reg_offset 2-3 in SIMD16.
>>>>>>>>>
>>>>>>>>>>> + if (regions_overlap(r, REG_SIZE,
>>>>>>>>>>> + payload, REG_SIZE * this->exec_size / 8)) {
>>>>>>>>>>> + switch (this->src[2].ud) {
>>>>>>>>>>> + case 32:
>>>>>>>>>>> + return ~0u;
>>>>>>>>>>> + case 16:
>>>>>>>>>>> + return 0x33333333;
>>>>>>>>>>> + case 8:
>>>>>>>>>>> + return 0x11111111;
>>>>>>>>>>> + default:
>>>>>>>>>>> + unreachable("Unsupported bitsize at byte_scattered_write");
>>>>>>>>>>> + }
>>>>>>>>>>
>>>>>>>>>> Replace the above switch statement with a call to "periodic_mask(8, 4,
>>>>>>>>>> this->src[2].ud / 8)".
>>>>>>>>>
>>>>>>>>> Ok.
>>>>>>>>>
>>>>>>>>>>> + } else {
>>>>>>>>>>> + return ~0u;
>>>>>>>>>>> + }
>>>>>>>>>>> + }
>>>>>>>>>>> + }
>>>>>>>>>>> +
>>>>>>>>>>> + /* We define the most conservative value in order to calculate liveness
>>>>>>>>>>> + * range. If it is a destination nothing is defined and if is a source
>>>>>>>>>>> + * all the bytes of the register could be read. So for release builds
>>>>>>>>>>> + * the unreachables would have always safe return value. */
>>>>>>>>>>> + unsigned pattern = is_dst ? 0 : ~0u;
>>>>>>>>>>> +
>>>>>>>>>>> + /* In the general case we calculate the pattern for a specific register
>>>>>>>>>>> + * on base of the type_size and stride. We calculate the SIMD8 pattern
>>>>>>>>>>> + * and then we adjust the patter if needed for different exec_sizes
>>>>>>>>>>> + * and offset
>>>>>>>>>>> + */
>>>>>>>>>>> + switch (type_sz(r.type)){
>>>>>>>>>>> + case 1:
>>>>>>>>>>> + switch (r.stride) {
>>>>>>>>>>> + case 0:
>>>>>>>>>>> + pattern = 0X1;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 1:
>>>>>>>>>>> + pattern = 0xff;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 2:
>>>>>>>>>>> + pattern = 0x5555;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 4:
>>>>>>>>>>> + pattern = 0x11111111;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 8:
>>>>>>>>>>> + pattern = 0x01010101;
>>>>>>>>>>> + break;
>>>>>>>>>>> + default:
>>>>>>>>>>> + unreachable("Unknown pattern unsupported 8-bit stride");
>>>>>>>>>>> + }
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 2:
>>>>>>>>>>> + switch (r.stride) {
>>>>>>>>>>> + case 0:
>>>>>>>>>>> + pattern = 0X3;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 1:
>>>>>>>>>>> + pattern = 0xffff;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 2:
>>>>>>>>>>> + pattern = 0x33333333;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 4:
>>>>>>>>>>> + pattern = 0x03030303;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 8:
>>>>>>>>>>> + pattern = 0x00030003;
>>>>>>>>>>> + break;
>>>>>>>>>>> + default:
>>>>>>>>>>> + unreachable("Unknown pattern unsupported 16-bit stride");
>>>>>>>>>>> + }
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 4:
>>>>>>>>>>> + switch (r.stride) {
>>>>>>>>>>> + case 0:
>>>>>>>>>>> + pattern = 0Xf;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 1:
>>>>>>>>>>> + pattern = ~0u;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 2:
>>>>>>>>>>> + pattern = 0x0f0f0f0f;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 4:
>>>>>>>>>>> + pattern = 0x000f000f;
>>>>>>>>>>> + break;
>>>>>>>>>>> + default:
>>>>>>>>>>> + unreachable("Unknown pattern unsupported 32-bit stride");
>>>>>>>>>>> + }
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 8:
>>>>>>>>>>> + switch (r.stride) {
>>>>>>>>>>> + case 0:
>>>>>>>>>>> + pattern = 0Xff;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 1:
>>>>>>>>>>> + pattern = ~0u;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 2:
>>>>>>>>>>> + pattern = 0x00ff00ff;
>>>>>>>>>>> + break;
>>>>>>>>>>> + case 4:
>>>>>>>>>>> + pattern = 0xff;
>>>>>>>>>>> + break;
>>>>>>>>>>> + default:
>>>>>>>>>>> + unreachable("Unknown pattern unsupported 64-bit stride");
>>>>>>>>>>> + }
>>>>>>>>>>> + break;
>>>>>>>>>>> + default:
>>>>>>>>>>> + unreachable("Unknown pattern for unsupported bitsize ");
>>>>>>>>>>> + }
>>>>>>>>>>> +
>>>>>>>>>>> + if (this->exec_size > 8 && r.stride * type_sz(r.type) * 8 < REG_SIZE) {
>>>>>>>>>>> + /* For exec_size greater than SIMD8 we repeat the pattern until it
>>>>>>>>>>> + * represents a full register already represent a full register */
>>>>>>>>>>> + pattern = pattern | (pattern << (8 * r.stride * type_sz(r.type)));
>>>>>>>>>>> + if (this->exec_size > 16 && r.stride * type_sz(r.type) * 16 < REG_SIZE)
>>>>>>>>>>> + pattern = pattern | (pattern << (16 * r.stride * type_sz(r.type)));
>>>>>>>>>>> + } else if (this->exec_size < 8 &&
>>>>>>>>>>> + r.stride * type_sz(r.type) * this->exec_size < REG_SIZE) {
>>>>>>>>>>> + /* For exec_size smaller than SIMD8 we reduce the pattern if its size
>>>>>>>>>>> + * is smaller than a full register. */
>>>>>>>>>>> + pattern = pattern >> (MIN2(REG_SIZE, 8 * type_sz(r.type) * r.stride) -
>>>>>>>>>>> + this->exec_size * type_sz(r.type) * r.stride);
>>>>>>>>>>> + }
>>>>>>>>>>> +
>>>>>>>>>
>>>>>>>>>> This seems really mad, no clue whether it's correct. Why not replace
>>>>>>>>>> the above ~110 lines with a call to the following (fully
>>>>>>>>>> untested) 5-LOC function:
>>>>>>>>>
>>>>>>>>> Your suggestion seems to work perfectly fine, my original approach was
>>>>>>>>> trying to avoid the loop of creating the read/write pattern but after
>>>>>>>>> testing my v2 I wasn't able to notice any performance difference running
>>>>>>>>> shader-db and having the same results. I was originally trying that
>>>>>>>>> SIMD8 patterns were already constants. Sorry for the added complexity.
>>>>>>>>>
>>>>>>>>
>>>>>>>> The loop runs for a logarithmic number of iterations though, so it has
>>>>>>>> the exact same run-time complexity as your original patch, roughly the
>>>>>>>> same amount of branches at run-time (but possibly less indirect
>>>>>>>> branches!), and it should be compiled into a substantially lower number
>>>>>>>> of instructions, which may actually cause it to perform better due to
>>>>>>>> more favourable caching. It's hard to tell though which one will
>>>>>>>> perform better in practice without benchmarking them, and as you
>>>>>>>> probably realized this is so far from being a bottleneck that whatever
>>>>>>>> the difference was is likely lost in the noise. So it really doesn't
>>>>>>>> matter which one performs better...
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> | uint32_t
>>>>>>>>>> | periodic_mask(unsigned count, unsigned step, unsigned bits)
>>>>>>>>>> | {
>>>>>>>>>> | uint32_t m = (count ? (1 << bits) - 1 : 0);
>>>>>>>>>> | const unsigined max = MIN2(count * step, sizeof(m) * CHAR_BITS);
>>>>>>>>>> |
>>>>>>>>>> | for (unsigned shift = step; shift < max; shift *= 2)
>>>>>>>>>> | m |= m << shift;
>>>>>>>>>> |
>>>>>>>>>> | return m;
>>>>>>>>>> | }
>>>>>>>>>
>>>>>>>>> I've used your function just changing the sizeof(m) * CHAR_BITS for the
>>>>>>>>> REG_SIZE, to not include the limits.h.
>>>>>>>>
>>>>>>>> My intention was to make the function as agnostic to IR details as
>>>>>>>> possible: the only reason there is a limit of 32 bits is because that's
>>>>>>>> the size of the type used to hold the return value. Using sizeof makes
>>>>>>>> sure that e.g. extending the code to 64 bits is as simple as changing
>>>>>>>> the datatype to uint64_t.
>>>>>>>>
>>>>>>>>> And I've also included an offset parameter that allows us to shift the
>>>>>>>>> bits of the pattern when we have an offset inside the register.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That sounds fine.
>>>>>>>>
>>>>>>>>>>> + /* We adjust the pattern to the byte_offset of the register */
>>>>>>>>>>> + pattern = pattern << (r.offset % REG_SIZE);
>>>>>>>>>>> +
>>>>>>>>>>
>>>>>>>>>> This doesn't really work except for r equal to the first GRF read by the
>>>>>>>>>> source. Regions with non-zero sub-GRF offset that straddle multiple
>>>>>>>>>> GRFs are not really very common at the IR level, but they're allowed by
>>>>>>>>>> the hardware with some restrictions, so it would make sense to at least
>>>>>>>>>> handle them conservatively in order to keep things from breaking silently.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I included an assert in the periodic_mask function to detect if the
>>>>>>>>> combination of offset and mask goes over the REG_SIZE, so we would
>>>>>>>>> detect an straddle at this level of the IR.
>>>>>>>>>
>>>>>>>>> assert(offset + max - (step - bits) <= REG_SIZE);
>>>>>>>
>>>>>>>> Uhm... What's the definition of "offset" here? You seem to be passing
>>>>>>>> the offset relative to the start of the VGRF modulo REG_SIZE but that
>>>>>>>> doesn't really make sense to me whenever "reg_offset" is non-zero.
>>>>>>>
>>>>>>> Yes I agree on removing this assert in the periodic_mask as it a generic
>>>>>>> function. I was to focus on seen the integer as the register :)
>>>>>>>
>>>>>>> The motivation was because of the following cases that happen with
>>>>>>> 8/16-bit more usually:
>>>>>>>
>>>>>>> (1) mov(16) g12<2>HF g1<16,8,2>HF { align1 1H };
>>>>>>> (2) mov(16) g12.1<2>HF g2<16,8,2>HF { align1 1H };
>>>>>>>
>>>>>>> In the dst_write_pattern, at point of calling the general periodic_mask
>>>>>>> case we know that instruction is not expected to be a send message
>>>>>>> destination. So the pattern is the same for reg_offset = 0 and
>>>>>>> reg_offset = 1.
>>>>>>>
>>>>>>
>>>>>> In this specific example, yes, but in general the pattern will not be
>>>>>> the same for multiple registers for arbitrary fs_reg::offset values.
>>>>>
>>>>> I've been running some tests. I've found that the three opcodes reach
>>>>> this situation in shader-db (MOV, INDIRECT_MOV and OR).
>>>>>
>>>>> If we just manage the MOV opcode that we agree that the same pattern is
>>>>> repeated for the same source/dst used registers have the same spilling
>>>>> reduction that with my original patch.
>>>>>
>>>>>>> 0x3333333 for case (1) and 0xccccccc for case (2). But without repeating
>>>>>>> the pattern offset we would get 0xcccccccc for reg_offset=0 but
>>>>>>> 0x333333333 for reg_offset=1 witch would be incorrect.
>>>>>>>
>>>>>>>> I think you want to pass "src[i].offset % REG_SIZE - reg_offset *
>>>>>>>> REG_SIZE" as a signed integer in order to get the offset of the first
>>>>>>>> byte actually written by the instruction relative to the first byte of
>>>>>>>> the GRF window of the pattern. You don't really need to assert-fail
>>>>>>>> when the offset is greater or equal to 32 (which shouldn't actually
>>>>>>>> happen in practice), "return 0" gives you the correct behavior. For
>>>>>>>> negative offsets (which means the pattern starts after the first byte
>>>>>>>> written by the instruction) you can just return ~0u conservatively
>>>>>>>> whenever the current logic wouldn't work [assuming you don't feel like
>>>>>>>> implementing the code to handle that case accurately ;)].
>>>>>>>
>>>>>>> Maybe you feel more comfortable with the following approach for
>>>>>>> src_read_pattern.:
>>>>>>>
>>>>>>> For read sources we also assume that SENDs sources are completely
>>>>>>> read so return ~0u except the byte_scattered_write source exceptions.
>>>>>>>
>>>>>>> if (this->is_send_from_grf() || this->mlen != 0)
>>>>>>> return ~0u;
>>>>>>>
>>>>>>> So after this I think that we can assume that the following condition
>>>>>>> should be always correct where source operand must reside within two
>>>>>>> adjacent 256-bit registers so the pattern would be periodic for
>>>>>>> all reg_offsets and we can use the (this->src[i].offset % REG_SIZE).
>>>>>>>
>>>>>>
>>>>>> That's not a particularly futureproof assumption. I'm okay with
>>>>>> handling non-reg_offset-periodic cases inaccurately for the moment for
>>>>>> the sake of simplicity, but there is no reason to have the code blow up
>>>>>> in such cases if you could just return ~0u.
>>>>>
>>>>> I've added some modifications in the v3, we return ~0u for reads and 0u
>>>>> for writes by default, but i added the case of !this->is_partial_write()
>>>>> that will return ~0u for writes, that won't we reached with current use,
>>>>> but to be coherent with any future uses of this function.
>>>>>
>>>>>>> assert(reg_offset < DIV_ROUND_UP(this->src[i].stride ?
>>>>>>> this->src[i].stride * type_sz(this->src[i].type) * this->exec_size :
>>>>>>> type_sz(this->src[i].type), REG_SIZE));
>>>>>>>
>>>>>
>>>>>> I don't think it's useful to assert-fail on this condition, since it
>>>>>> doesn't handle all cases where the code below will be broken for a
>>>>>> certain valid IR, and the condition it does protect against can be
>>>>>> handled trivially in periodic_mask(), by returning zero when the 32-byte
>>>>>> pattern window falls outside the region read or written by the
>>>>>> instruction -- But currently you can't even know whether that's the
>>>>>> case, because periodic_mask() is unaware where the 32-byte window starts
>>>>>> relative to the source region, which is *all* it cares about, instead
>>>>>> you're passing an offset to it that is equal to that in some special
>>>>>> cases -- In all other cases the result of the function will be bogus.
>>>>>
>>>>> I understand your concerns, so in my V3, I am using only the
>>>>> periodic_mask function when reg_offset is 0, and for the MOV special
>>>>> case that I understand we agree that the internal offset for
>>>>> reg_offset=1 is generally reg.offset % REG_SIZE. With that two cases
>>>>> covered we solve the register_pressure on current supported 64/16/8 bit
>>>>> cases.
>>>>>
>>>
>>>> I don't understand. The MOV opcode is not substantially different from
>>>> any other ALU opcode with regards to regioning. The assumptions this is
>>>> based on are invalid for general Align1 regions, whether the opcode is a
>>>> MOV or not.
>>>
>>> Let's try step by step to find my misunderstanding :-)
>>>
>>> My original understanding was that for a general ALU operations to
>>> define witch parts of the VGRF are written/read we need to use 4
>>> different variables in the scalar backend for src/dst registers.
>>>
>>> - The type size.
>>> - The horizontal stride.
>>> - The execution size.
>>> - The initial byte offset of the first register of src[i]/dst
>>>
>>> If stride==1 && type_size * execution_size>=REG_SIZE && byte_offset== 0
>>> we have a full/write of one or two registers. In that case the solution
>>> is easy everything is written ~0u.
>>>
>>
>> Not necessarily, if "reg_offset * REG_SIZE >= byte_offset + type_size *
>> MAX2(1, stride * exec_size)" the correct answer is 0.
>
> Ok, I got what you mean. I was incorrectly assuming as precondition that
> reg_offset was always in range because that is what happens on
> setup_def_used at brw_fs_live_variables.cpp.
>
> I think we can use directly regs_written and regs_read that do the same
> calculation.
>
> if (reg_offset >= regs_written(this))
> return 0;
> --------------------------------------
> if (reg_offset >= regs_read(this, i))
> return 0;
>
Yes, that should work.
>
>>> If we have a partial write/read:
>>>
>>> I understood that you my initial patter proposal would only be ok for
>>> the first GRF of src[i]/dst (reg_offset == 0)
>>>
>>> periodic_mask(this->exec_size, /* count */
>>> this->src[i].stride * type_sz(this->src[i].type), /*step */
>>> type_sz(this->src[i].type), /* bits */
>>> this->src[i].offset % REG_SIZE); /* offset */
>>>
>>> In the case we manage only reg_offset == 0 we get a huge improvement
>>> reducing all problems many of the register_pressure we have now on all
>>> SIMD8 shaders with 8/16bits test cases.
>>>
>>> I understood that you didn't agree that for cases where src/destination
>>> use more than 1 GRF (reg_offset == 1) we can not guarantee that we can
>>> apply the same internal offset (this->src[i].offset % REG_SIZE) as the
>>> base register to calculate a patter. So It would be better to return ~0u
>>> on reads or 0u in writes.
>>>
>
>> Yes, but you could easily determine whether the mask is going to be
>> invariant with respect to reg_offset (where reg_offset is within bounds)
>> and in that case return the periodic_mask() expression above, otherwise
>> return 0/~0u depending on whether reg_offset is within bounds.
>
> Ok, so we are within bounds, we don't have a predicated write, we are
> not a send message. Then we have an ALU opcode and we return the
> periodic_mask.
>
Those are all necessary but not sufficient conditions for the
periodic_mask() expression above to give you the correct answer for any
in-bounds reg_offset > 0, you should check that byte_offset < type_size
* stride in addition.
>>> That is the point were I couldn't found a counter-case with the search I
>>> did to identify the different register region that appear in shaderdb
>>> generated MOV instructions and 16/8 bit CTS tests, that used two
>>> registers with a region pattern that implied a different byte_pattern,
>>> for the first and the second register. But it should exist :)
>>
>> I don't see how special-casing MOV could help you in any way except by
>> luck. If your assumptions were correct you would be able to apply them
>> to all ALU instructions with Align1 regions.
>
> It worked because all the current uses are in range, but future uses of
> this function would have return incorrect values as you pointed.
>
> I've already send v4. Let's see if we are converging :)
>
> Chema
>
>>> If my bad understanding is before this point, I should back to re-review
>>> the Regioning PRM section. Or maybe I messing the IR level and the
>>> emission level that is another option with different semantics :-?
>>>
>>> Thank you for your time.
>>>
>>> Chema
>>>
>>>>>>> So we could call in this scenario.
>>>>>>>
>>>>>>> return periodic_mask(this->exec_size,
>>>>>>> this->src[i].stride * type_sz(this->src[i].type),
>>>>>>> type_sz(this->src[i].type),
>>>>>>> this->src[i].offset % REG_SIZE);
>>>>>>>
>>>>>>> The only issue I could think about that could generate issues would a
>>>>>>> case that I didn't found in our current code where a source is defined
>>>>>>> like r5.0<1;8;2>:w explained at PRM KBL vol07 Page 790 "A 16-element
>>>>>>> register region with interleaved rows (r5.0<1;8,2>:w)".
>>>>>>>
>>>>>>
>>>>>> You couldn't really handle that even if you wanted to, because there is
>>>>>> no way to represent such a thing at the IR level, since the horizontal
>>>>>> and vertical strides cannot be controlled independently, only the stride
>>>>>> member is meaningful for VGRFs.
>>>>>
>>>>> Good, that simplifies things.
>>>>>
>>>>> Thanks again.
>>>>>
>>>>> Chema
>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Chema
>>>>>>>
>>>>>>>
>>>>>>>>> Thanks for the review, I think that v2 has better shape.
>>>>>>>>>
>>>>>>>>
>>>>>>>> No problem.
>>>>>>>>
>>>>>>>>> Chema
>>>>>>>>>
>>>>>>>>>>> + assert(pattern);
>>>>>>>>>>> +
>>>>>>>>>>> + return pattern;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +
>>>>>>>>>>> unsigned
>>>>>>>>>>> fs_inst::components_read(unsigned i) const
>>>>>>>>>>> {
>>>>>>>>>>> diff --git a/src/intel/compiler/brw_ir_fs.h b/src/intel/compiler/brw_ir_fs.h
>>>>>>>>>>> index 92dad269a34..5ea6294b8ad 100644
>>>>>>>>>>> --- a/src/intel/compiler/brw_ir_fs.h
>>>>>>>>>>> +++ b/src/intel/compiler/brw_ir_fs.h
>>>>>>>>>>> @@ -350,6 +350,7 @@ public:
>>>>>>>>>>> bool equals(fs_inst *inst) const;
>>>>>>>>>>> bool is_send_from_grf() const;
>>>>>>>>>>> bool is_partial_write() const;
>>>>>>>>>>> + unsigned register_byte_use_pattern(const fs_reg &r, boolean is_dst) const;
>>>>>>>>>>> bool is_copy_payload(const brw::simple_allocator &grf_alloc) const;
>>>>>>>>>>> unsigned components_read(unsigned i) const;
>>>>>>>>>>> unsigned size_read(int arg) const;
>>>>>>>>>>> --
>>>>>>>>>>> 2.17.1
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> mesa-dev mailing list
>>>>>>>>>>> mesa-dev at lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> mesa-dev mailing list
>>>>>>>>>>> mesa-dev at lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> mesa-dev mailing list
>>>>>>>>>>> mesa-dev at lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> mesa-dev mailing list
>>>>>>>>>>> mesa-dev at lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20180727/79049cd7/attachment-0001.sig>
More information about the mesa-dev
mailing list