[Mesa-dev] [PATCH 37/59] i965/fs: generalize SIMD16 interference workaround
Connor Abbott
cwabbott0 at gmail.com
Tue May 3 19:45:16 UTC 2016
On Tue, May 3, 2016 at 3:13 PM, Connor Abbott <cwabbott0 at gmail.com> wrote:
> On Tue, May 3, 2016 at 2:52 PM, Kenneth Graunke <kenneth at whitecape.org> wrote:
>> On Friday, April 29, 2016 1:29:34 PM PDT Samuel Iglesias Gonsálvez wrote:
>>> From: Connor Abbott <connor.w.abbott at intel.com>
>>>
>>> Work based on registers read/written instead of dispatch_width, so that
>>> the interferences are added for 64-bit sources/destinations as well.
>>> ---
>>> src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp | 67 +++++++++++++++
>> +-------
>>> 1 file changed, 48 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp b/src/mesa/
>> drivers/dri/i965/brw_fs_reg_allocate.cpp
>>> index 2347cd5..f0e96f9 100644
>>> --- a/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_reg_allocate.cpp
>>> @@ -541,6 +541,34 @@ setup_mrf_hack_interference(fs_visitor *v, struct
>> ra_graph *g,
>>> }
>>> }
>>>
>>> +static unsigned
>>> +get_reg_node(const fs_reg& reg, unsigned first_payload_node)
>>> +{
>>> + switch (reg.file) {
>>> + case VGRF:
>>> + return reg.nr;
>>> + case ARF:
>>> + case FIXED_GRF:
>>> + return first_payload_node + reg.nr;
>>
>> This can't possibly be right for ARFs. I think it's OK for FIXED_GRF.
>>
>> I imagine we can just drop ARF.
>>
>>> + case MRF:
>>> + default:
>>> + unreachable("unhandled register file");
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static void
>>> +add_reg_interference(struct ra_graph *g, const fs_reg& reg1,
>>> + const fs_reg& reg2, unsigned first_payload_node)
>>> +{
>>> + if ((reg1.file == VGRF || reg1.file == ARF || reg1.file == FIXED_GRF) &&
>>> + (reg2.file == VGRF || reg2.file == ARF || reg2.file == FIXED_GRF)) {
>>> + ra_add_node_interference(g, get_reg_node(reg1, first_payload_node),
>>> + get_reg_node(reg2, first_payload_node));
>>> + }
>>> +}
>>> +
>>> bool
>>> fs_visitor::assign_regs(bool allow_spilling)
>>> {
>>> @@ -643,26 +671,27 @@ fs_visitor::assign_regs(bool allow_spilling)
>>> }
>>> }
>>>
>>> - if (dispatch_width > 8) {
>>> - /* In 16-wide dispatch we have an issue where a compressed
>>> - * instruction is actually two instructions executed simultaneiously.
>>> - * It's actually ok to have the source and destination registers be
>>> - * the same. In this case, each instruction over-writes its own
>>> - * source and there's no problem. The real problem here is if the
>>> - * source and destination registers are off by one. Then you can end
>>> - * up in a scenario where the first instruction over-writes the
>>> - * source of the second instruction. Since the compiler doesn't know
>>> - * about this level of granularity, we simply make the source and
>>> - * destination interfere.
>>> - */
>>> - foreach_block_and_inst(block, fs_inst, inst, cfg) {
>>> - if (inst->dst.file != VGRF)
>>> - continue;
>>> + /* When instructions both read/write more than a single SIMD8 register,
>> we
>>> + * have an issue where an instruction is actually two instructions
>> executed
>>> + * simultaneiously. It's actually ok to have the source and destination
>>> + * registers be the same. In this case, each instruction over-writes
>> its
>>> + * own source and there's no problem. The real problem here is if the
>>> + * source and destination registers are off by one. Then you can end up
>> in
>>> + * a scenario where the first instruction over-writes the source of the
>>> + * second instruction. Since the compiler doesn't know about this level
>> of
>>> + * granularity, we simply make the source and destination interfere.
>>> + */
>>> + foreach_block_and_inst(block, fs_inst, inst, cfg) {
>>> + if (inst->dst.file != VGRF &&
>>> + inst->dst.file != ARF && inst->dst.file != FIXED_GRF)
>>> + continue;
>>>
>>> - for (int i = 0; i < inst->sources; ++i) {
>>> - if (inst->src[i].file == VGRF) {
>>> - ra_add_node_interference(g, inst->dst.nr, inst->src[i].nr);
>>> - }
>>> + if (inst->regs_written <= 1)
>>> + continue;
>>> +
>>> + for (int i = 0; i < inst->sources; ++i) {
>>> + if (inst->regs_read(i) > 1) {
>>> + add_reg_interference(g, inst->dst, inst->src[i],
>> first_payload_node);
>>> }
>>> }
>>> }
>>>
>>
>> This seems wrong to me. The point of the original code is that SIMD16
>> instructions internally decode as two SIMD8 instructions, which operate
>> sequentially. So, the first runs, and clobbers the source for the
>> second.
>>
>> I can't imagine that this happens for double floats - I imagine the
>> hardware reads and writes entire DFs in one instruction.
>>
>> I'm not sure why this patch is needed at all. Can we drop it, or do
>> things break?
>
> AFAICT, that's not how things work. Essentially, the HW can't
> read/write more than one SIMD8 register at a time, so even for SIMD8
> it has to break double reads/writes into two. If you think about it,
> this makes the restriction on horizontal strides not crossing a
> register make a lot more sense -- the HW just uses the horizontal
> stride inside a single read/write, and the vertical stride to figure
> out how to break up an instruction into multiple instructions. To me,
> this theory seemed to be confirmed by the fact that this patch did fix
> things, at least when I wrote it. You could try reverting it and
> seeing if it breaks things now, though.
Also, now that I think about it... if the HW really would atomically
read/write the registers for a single double SIMD8 instruction, why
can't it atomically read/write the same registers for a normal SIMD16
instruction? Also, why would we need to manually split up double
SIMD16 instructions but not SIMD16 instructions with 32-bit types?
It's all circumstantial evidence, but I can't think of anything that
doesn't point towards my interpretation being correct.
>
>>
>> If you have virtual opcodes that are passed all the way through to the
>> generator, and then become multiple instructions writing to the
>> destination, then you can add those opcodes to the
>> fs_inst::has_source_and_destination_hazard() method to prevent this kind
>> of trouble. But I don't think you have any of those...they're all
>> lowered away prior to register allocation...
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
More information about the mesa-dev
mailing list