[Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst
Chema Casanova
jmcasanova at igalia.com
Fri Jul 20 17:33:26 UTC 2018
El 20/07/18 a las 00:34, Francisco Jerez escribió:
> Chema Casanova <jmcasanova at igalia.com> writes:
>
>> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>>> Jose Maria Casanova Crespo <jmcasanova at igalia.com> writes:
>>>
>>>> For a register source/destination of an instruction the function returns
>>>> the read/write byte pattern of a 32-byte registers as a unsigned int.
>>>>
>>>> The returned pattern takes into account the exec_size of the instruction,
>>>> the type bitsize, the stride and if the register is source or destination.
>>>>
>>>> The objective of the functions if to help to know the read/written bytes
>>>> of the instructions to improve the liveness analysis for partial read/writes.
>>>>
>>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize
>>>> parameter they have a different read pattern.
>>>> ---
>>>> src/intel/compiler/brw_fs.cpp | 183 +++++++++++++++++++++++++++++++++
>>>> src/intel/compiler/brw_ir_fs.h | 1 +
>>>> 2 files changed, 184 insertions(+)
>>>>
>>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
>>>> index 2b8363ca362..f3045c4ff6c 100644
>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>> this->dst.offset % REG_SIZE != 0);
>>>> }
>>>>
>>>> +/**
>>>> + * Returns a 32-bit uint whose bits represent if the associated register byte
>>>> + * has been read/written by the instruction. The returned pattern takes into
>>>> + * account the exec_size of the instruction, the type bitsize and the register
>>>> + * stride and the register is source or destination for the instruction.
>>>> + *
>>>> + * The objective of this function is to identify which parts of the register
>>>> + * are read or written for operations that don't read/write a full register.
>>>> + * So we can identify in live range variable analysis if a partial write has
>>>> + * completelly defined the part of the register used by a partial read. So we
>>>> + * avoid extending the liveness range because all data read was already
>>>> + * defined although the wasn't completely written.
>>>> + */
>>>> +unsigned
>>>> +fs_inst::register_byte_use_pattern(const fs_reg &r, boolean is_dst) const
>>>> +{
>>>> + if (is_dst) {
>>
>>> Please split into two functions (like fs_inst::src_read and
>>> ::src_written) since that would make the call-sites of this method more
>>> self-documenting than a boolean parameter. You should be able to share
>>> code by refactoring the common logic into a separate function (see below
>>> for some suggestions on how that could be achieved).
>>
>> Sure, it would improve readability and simplifies the logic, I've chosen
>> dst_write_pattern and src_read_pattern.
>>
>>>
>>>> + /* We don't know what is written so we return the worts case */
>>>
>>> "worst"
>>
>> Fixed.
>>
>>>> + if (this->predicate && this->opcode != BRW_OPCODE_SEL)
>>>> + return 0;
>>>> + /* We assume that send destinations are completely written */
>>>> + if (this->is_send_from_grf())
>>>> + return ~0u;
>>>
>>> Some send-like instructions won't be caught by this condition, you
>>> should check for this->mlen != 0 in addition.
>>
>> Would it be enough to check for (this->mlen > 0) and forget about
>> is_send_from_grf? I am using this approach in v2 I am sending.
>>
>
> I don't think the mlen > 0 condition would catch all cases either...
> E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC. You probably need both
> conditions. Sucks...
That is true, so now we have the:
(this->is_send_from_grf() || this->mlen != 0)
>>>> + } else {
>>>> + /* byte_scattered_write_logical pattern of src[1] is 32-bit aligned
>>>> + * so the read pattern depends on the bitsize stored at src[4]
>>>> + */
>>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL &&
>>>> + this->src[1].nr == r.nr) {
>>
>>> I feel uncomfortable about attempting to guess the source the caller is
>>> referring to by comparing the registers for equality. E.g. you could
>>> potentially end up with two sources that compare equal but have
>>> different semantics (e.g. as a result of CSE) which might cause it to
>>> get the wrong answer. It would probably be better to pass a source
>>> index and a byte offset as argument instead of an fs_reg.
>>
>> I've didn't thought about CSE, I'm now receiving the number of source
>> and the reg_offset. I'm using reg_offset instead of byte offsets as it
>> simplifies the logic. Now we are using always the base src register to
>> do all the calculation
>>>> + switch (this->src[4].ud) {
>>>> + case 32:
>>>> + return ~0u;
>>>> + case 16:
>>>> + return 0x33333333;
>>>> + case 8:
>>>> + return 0x11111111;
>>>> + default:
>>>> + unreachable("Unsupported bitsize at byte_scattered_write_logical");
>>>> + }
>>>
>>> Replace the above switch statement with a call to "periodic_mask(8, 4,
>>> this->src[4].ud / 8)" (see below for the definition).
>>
>> Ok.
>>
>>>> + }
>>>> + /* As for byte_scattered_write_logical but we need to take into account
>>>> + * that data written are in the payload offset 32 with SIMD8 and offset
>>>> + * 64 with SIMD16.
>>>> + */
>>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE &&
>>>> + this->src[0].nr == r.nr) {
>>>> + fs_reg payload = this->src[0];
>>>> + payload.offset = REG_SIZE * this->exec_size / 8;
>>>
>>> byte_offset() is your friend.
>>
>> I've removed the overlap logic, and I'm just checking if we are in the
>> reg_offset 1 on SIMD8 or reg_offset 2-3 in SIMD16.
>>
>>>> + if (regions_overlap(r, REG_SIZE,
>>>> + payload, REG_SIZE * this->exec_size / 8)) {
>>>> + switch (this->src[2].ud) {
>>>> + case 32:
>>>> + return ~0u;
>>>> + case 16:
>>>> + return 0x33333333;
>>>> + case 8:
>>>> + return 0x11111111;
>>>> + default:
>>>> + unreachable("Unsupported bitsize at byte_scattered_write");
>>>> + }
>>>
>>> Replace the above switch statement with a call to "periodic_mask(8, 4,
>>> this->src[2].ud / 8)".
>>
>> Ok.
>>
>>>> + } else {
>>>> + return ~0u;
>>>> + }
>>>> + }
>>>> + }
>>>> +
>>>> + /* We define the most conservative value in order to calculate liveness
>>>> + * range. If it is a destination nothing is defined and if is a source
>>>> + * all the bytes of the register could be read. So for release builds
>>>> + * the unreachables would have always safe return value. */
>>>> + unsigned pattern = is_dst ? 0 : ~0u;
>>>> +
>>>> + /* In the general case we calculate the pattern for a specific register
>>>> + * on base of the type_size and stride. We calculate the SIMD8 pattern
>>>> + * and then we adjust the patter if needed for different exec_sizes
>>>> + * and offset
>>>> + */
>>>> + switch (type_sz(r.type)){
>>>> + case 1:
>>>> + switch (r.stride) {
>>>> + case 0:
>>>> + pattern = 0X1;
>>>> + break;
>>>> + case 1:
>>>> + pattern = 0xff;
>>>> + break;
>>>> + case 2:
>>>> + pattern = 0x5555;
>>>> + break;
>>>> + case 4:
>>>> + pattern = 0x11111111;
>>>> + break;
>>>> + case 8:
>>>> + pattern = 0x01010101;
>>>> + break;
>>>> + default:
>>>> + unreachable("Unknown pattern unsupported 8-bit stride");
>>>> + }
>>>> + break;
>>>> + case 2:
>>>> + switch (r.stride) {
>>>> + case 0:
>>>> + pattern = 0X3;
>>>> + break;
>>>> + case 1:
>>>> + pattern = 0xffff;
>>>> + break;
>>>> + case 2:
>>>> + pattern = 0x33333333;
>>>> + break;
>>>> + case 4:
>>>> + pattern = 0x03030303;
>>>> + break;
>>>> + case 8:
>>>> + pattern = 0x00030003;
>>>> + break;
>>>> + default:
>>>> + unreachable("Unknown pattern unsupported 16-bit stride");
>>>> + }
>>>> + break;
>>>> + case 4:
>>>> + switch (r.stride) {
>>>> + case 0:
>>>> + pattern = 0Xf;
>>>> + break;
>>>> + case 1:
>>>> + pattern = ~0u;
>>>> + break;
>>>> + case 2:
>>>> + pattern = 0x0f0f0f0f;
>>>> + break;
>>>> + case 4:
>>>> + pattern = 0x000f000f;
>>>> + break;
>>>> + default:
>>>> + unreachable("Unknown pattern unsupported 32-bit stride");
>>>> + }
>>>> + break;
>>>> + case 8:
>>>> + switch (r.stride) {
>>>> + case 0:
>>>> + pattern = 0Xff;
>>>> + break;
>>>> + case 1:
>>>> + pattern = ~0u;
>>>> + break;
>>>> + case 2:
>>>> + pattern = 0x00ff00ff;
>>>> + break;
>>>> + case 4:
>>>> + pattern = 0xff;
>>>> + break;
>>>> + default:
>>>> + unreachable("Unknown pattern unsupported 64-bit stride");
>>>> + }
>>>> + break;
>>>> + default:
>>>> + unreachable("Unknown pattern for unsupported bitsize ");
>>>> + }
>>>> +
>>>> + if (this->exec_size > 8 && r.stride * type_sz(r.type) * 8 < REG_SIZE) {
>>>> + /* For exec_size greater than SIMD8 we repeat the pattern until it
>>>> + * represents a full register already represent a full register */
>>>> + pattern = pattern | (pattern << (8 * r.stride * type_sz(r.type)));
>>>> + if (this->exec_size > 16 && r.stride * type_sz(r.type) * 16 < REG_SIZE)
>>>> + pattern = pattern | (pattern << (16 * r.stride * type_sz(r.type)));
>>>> + } else if (this->exec_size < 8 &&
>>>> + r.stride * type_sz(r.type) * this->exec_size < REG_SIZE) {
>>>> + /* For exec_size smaller than SIMD8 we reduce the pattern if its size
>>>> + * is smaller than a full register. */
>>>> + pattern = pattern >> (MIN2(REG_SIZE, 8 * type_sz(r.type) * r.stride) -
>>>> + this->exec_size * type_sz(r.type) * r.stride);
>>>> + }
>>>> +
>>
>>> This seems really mad, no clue whether it's correct. Why not replace
>>> the above ~110 lines with a call to the following (fully
>>> untested) 5-LOC function:
>>
>> Your suggestion seems to work perfectly fine, my original approach was
>> trying to avoid the loop of creating the read/write pattern but after
>> testing my v2 I wasn't able to notice any performance difference running
>> shader-db and having the same results. I was originally trying that
>> SIMD8 patterns were already constants. Sorry for the added complexity.
>>
>
> The loop runs for a logarithmic number of iterations though, so it has
> the exact same run-time complexity as your original patch, roughly the
> same amount of branches at run-time (but possibly less indirect
> branches!), and it should be compiled into a substantially lower number
> of instructions, which may actually cause it to perform better due to
> more favourable caching. It's hard to tell though which one will
> perform better in practice without benchmarking them, and as you
> probably realized this is so far from being a bottleneck that whatever
> the difference was is likely lost in the noise. So it really doesn't
> matter which one performs better...
>
>>>
>>> | uint32_t
>>> | periodic_mask(unsigned count, unsigned step, unsigned bits)
>>> | {
>>> | uint32_t m = (count ? (1 << bits) - 1 : 0);
>>> | const unsigined max = MIN2(count * step, sizeof(m) * CHAR_BITS);
>>> |
>>> | for (unsigned shift = step; shift < max; shift *= 2)
>>> | m |= m << shift;
>>> |
>>> | return m;
>>> | }
>>
>> I've used your function just changing the sizeof(m) * CHAR_BITS for the
>> REG_SIZE, to not include the limits.h.
>
> My intention was to make the function as agnostic to IR details as
> possible: the only reason there is a limit of 32 bits is because that's
> the size of the type used to hold the return value. Using sizeof makes
> sure that e.g. extending the code to 64 bits is as simple as changing
> the datatype to uint64_t.
>
>> And I've also included an offset parameter that allows us to shift the
>> bits of the pattern when we have an offset inside the register.
>>
>
> That sounds fine.
>
>>>> + /* We adjust the pattern to the byte_offset of the register */
>>>> + pattern = pattern << (r.offset % REG_SIZE);
>>>> +
>>>
>>> This doesn't really work except for r equal to the first GRF read by the
>>> source. Regions with non-zero sub-GRF offset that straddle multiple
>>> GRFs are not really very common at the IR level, but they're allowed by
>>> the hardware with some restrictions, so it would make sense to at least
>>> handle them conservatively in order to keep things from breaking silently.
>>
>>
>> I included an assert in the periodic_mask function to detect if the
>> combination of offset and mask goes over the REG_SIZE, so we would
>> detect an straddle at this level of the IR.
>>
>> assert(offset + max - (step - bits) <= REG_SIZE);
> Uhm... What's the definition of "offset" here? You seem to be passing
> the offset relative to the start of the VGRF modulo REG_SIZE but that
> doesn't really make sense to me whenever "reg_offset" is non-zero.
Yes I agree on removing this assert in the periodic_mask as it a generic
function. I was to focus on seen the integer as the register :)
The motivation was because of the following cases that happen with
8/16-bit more usually:
(1) mov(16) g12<2>HF g1<16,8,2>HF { align1 1H };
(2) mov(16) g12.1<2>HF g2<16,8,2>HF { align1 1H };
In the dst_write_pattern, at point of calling the general periodic_mask
case we know that instruction is not expected to be a send message
destination. So the pattern is the same for reg_offset = 0 and
reg_offset = 1.
0x3333333 for case (1) and 0xccccccc for case (2). But without repeating
the pattern offset we would get 0xcccccccc for reg_offset=0 but
0x333333333 for reg_offset=1 witch would be incorrect.
> I think you want to pass "src[i].offset % REG_SIZE - reg_offset *
> REG_SIZE" as a signed integer in order to get the offset of the first
> byte actually written by the instruction relative to the first byte of
> the GRF window of the pattern. You don't really need to assert-fail
> when the offset is greater or equal to 32 (which shouldn't actually
> happen in practice), "return 0" gives you the correct behavior. For
> negative offsets (which means the pattern starts after the first byte
> written by the instruction) you can just return ~0u conservatively
> whenever the current logic wouldn't work [assuming you don't feel like
> implementing the code to handle that case accurately ;)].
Maybe you feel more comfortable with the following approach for
src_read_pattern.:
For read sources we also assume that SENDs sources are completely
read so return ~0u except the byte_scattered_write source exceptions.
if (this->is_send_from_grf() || this->mlen != 0)
return ~0u;
So after this I think that we can assume that the following condition
should be always correct where source operand must reside within two
adjacent 256-bit registers so the pattern would be periodic for
all reg_offsets and we can use the (this->src[i].offset % REG_SIZE).
assert(reg_offset < DIV_ROUND_UP(this->src[i].stride ?
this->src[i].stride * type_sz(this->src[i].type) * this->exec_size :
type_sz(this->src[i].type), REG_SIZE));
So we could call in this scenario.
return periodic_mask(this->exec_size,
this->src[i].stride * type_sz(this->src[i].type),
type_sz(this->src[i].type),
this->src[i].offset % REG_SIZE);
The only issue I could think about that could generate issues would a
case that I didn't found in our current code where a source is defined
like r5.0<1;8;2>:w explained at PRM KBL vol07 Page 790 "A 16-element
register region with interleaved rows (r5.0<1;8,2>:w)".
What do you think?
Chema
>> Thanks for the review, I think that v2 has better shape.
>>
>
> No problem.
>
>> Chema
>>
>>>> + assert(pattern);
>>>> +
>>>> + return pattern;
>>>> +}
>>>> +
>>>> +
>>>> unsigned
>>>> fs_inst::components_read(unsigned i) const
>>>> {
>>>> diff --git a/src/intel/compiler/brw_ir_fs.h b/src/intel/compiler/brw_ir_fs.h
>>>> index 92dad269a34..5ea6294b8ad 100644
>>>> --- a/src/intel/compiler/brw_ir_fs.h
>>>> +++ b/src/intel/compiler/brw_ir_fs.h
>>>> @@ -350,6 +350,7 @@ public:
>>>> bool equals(fs_inst *inst) const;
>>>> bool is_send_from_grf() const;
>>>> bool is_partial_write() const;
>>>> + unsigned register_byte_use_pattern(const fs_reg &r, boolean is_dst) const;
>>>> bool is_copy_payload(const brw::simple_allocator &grf_alloc) const;
>>>> unsigned components_read(unsigned i) const;
>>>> unsigned size_read(int arg) const;
>>>> --
>>>> 2.17.1
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> mesa-dev mailing list
>>>> mesa-dev at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
More information about the mesa-dev
mailing list