[Mesa-dev] [PATCH v2 1/3] i965/fs: fix indirect load DF uniforms on BSW/BXT

Tue Feb 21 11:50:10 UTC 2017

On 20/02/17 21:31, Francisco Jerez wrote:
> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> 
>> On Mon, 2017-02-20 at 08:58 +0100, Samuel Iglesias Gonsálvez wrote:
>>> On Sat, 2017-02-18 at 18:58 -0800, Francisco Jerez wrote:
>>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>>>
>>>>> The lowered BSW/BXT indirect move instructions had incorrect
>>>>> source types, which luckily wasn't causing incorrect assembly to
>>>>> be
>>>>> generated due to the bug fixed in the next patch, but would have
>>>>> confused the remaining back-end IR infrastructure due to the
>>>>> mismatch
>>>>> between the IR source types and the emitted machine code.
>>>>>
>>>>> v2:
>>>>> - Improve commit log (Curro)
>>>>> - Fix read_size (Curro)
>>>>> - Fix DF uniform array detection in assign_constant_locations()
>>>>> when
>>>>>   it is acceded with 32-bit MOV_INDIRECTs in BSW/BXT.
>>>>>
>>>>> Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.com>
>>>>> Cc: "17.0" <mesa-stable at lists.freedesktop.org>
>>>>> ---
>>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp     | 11 ++++++++-
>>>>>  src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 41 ++++++++++++++++--
>>>>> --------------
>>>>>  2 files changed, 30 insertions(+), 22 deletions(-)
>>>>>
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> index c348bc7138d..93ab84b5845 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>> @@ -1944,10 +1944,19 @@ fs_visitor::assign_constant_locations()
>>>>>              unsigned last = constant_nr + (inst->src[2].ud / 4)
>>>>> -
>>>>> 1;
>>>>>              assert(last < uniforms);
>>>>>  
>>>>> +            bool supports_64bit_indirects =
>>>>> +               !devinfo->is_cherryview && !devinfo->is_broxton;
>>>>> +            /* Detect if this is as a result 64-bit MOV
>>>>> INDIRECT.
>>>>> In case
>>>>> +             * of BSW/BXT, we substitute each DF MOV INDIRECT by
>>>>> two 32-bit MOV
>>>>> +             * INDIRECT.
>>>>> +             */
>>>>> +            bool mov_indirect_64bit = (type_sz(inst-
>>>>>> src[i].type)
>>>>> == 8) ||
>>>>> +               (!supports_64bit_indirects && inst->dst.type ==
>>>>> BRW_REGISTER_TYPE_UD &&
>>>>> +                inst->src[0].type == BRW_REGISTER_TYPE_UD &&
>>>>> inst-
>>>>>> dst.stride == 2);
>>>>
>>>> This seems kind of fragile, I don't think the optimizer gives you
>>>> any
>>>> guarantees that the stride of a lowered 64-bit indirect move will
>>>> remain
>>>> equal to two, or that the destination stride of an actual 32-bit
>>>> indirect uniform load will never end up being two as well.  That
>>>> said,
>>>> because you access these with 32-bit indirect moves, I don't see
>>>> why
>>>> you'd need to treat them as 64-bit uniforms here, the usual
>>>> alignment
>>>> requirements for 64-bit uniforms no longer apply, so you can treat
>>>> them
>>>> as regular 32-bit uniforms AFAICT.  Why did you add this hunk?
>>>>
>>>
>>> I added it because of this case: if we access to one DF uniform array
>>> element with a normal MOV and the rest with MOV INDIRECT, we will
>>> mark
>>> the former as a live 64bit variable. Then we have the array scattered
>>> as part of it is uploaded as a 64-bits uniform and the other as 32-
>>> bits. Even if we avoid this by uploading everything together as 32-
>>> bits, then the access to that DF could not be aligned to 64-bits.
>>>
>>> So my idea was to find a way to identify somehow those MOV INDIRECT
>>> in
>>> BSW to mark all the array as a 64-bit one.
>>>
>>
>> Mmm, maybe I can fix this case without the hack I did. I can add the
>> following code after marking all live variables accessed by the
>> instructions.
>>
>> It is very similar to the one to detect live variables but it is fixing
>> the case where any MOV INDIRECT in BSW is accessing to an uniform array
>> of DF elements where one of these elements is directly accessed by
>> another instruction.
>>
>> What do you think?
>>
> 
> Looks somewhat better, but I don't think this is correct if you have
> multiple overlapping indirect loads of the same uniform array and only
> one of them overlaps with a direct 64-bit load.

In that case, I think this is correct. The 2 32-bit MOV INDIRECTs where
emitted as a "lowering" of an unsupported 64-bit MOV INDIRECT. They both
keep the 'read_size' of the original one, so they both overlap to any
other direct 64-bit load to that array like with the original
instruction. If none of them overlap to the direct 64-bit access, then I
think they can be handled as non-contiguous to the latter without any issue.

>  Apparently
> assign_constant_locations() already has code to attempt to push
> indirectly accessed uniform sections contiguously, but the logic seems
> broken in multiple ways, even on platforms other than CHV/BXT... The
> first is_live_64bit location assignment loop doesn't consider non-64bit
> uniforms even if they're marked as contiguous with a 64-bit uniform,
> because the contiguous block tracking is done from within
> set_push_pull_constant_loc() which may not be called at all if the
> continue condition evaluates to true at the top of the loop.

The idea for that block was to push all 64-bit uniforms first and
together so they are aligned to 64-bit, there are no gaps between them
and assuming, of course, they are not contiguous to non-64-bit data.

They cannot be contiguous to non-64-bit data due to a declaration in a
GLSL shader because it is not allowed to have two uniforms with the same
location, i.e. we cannot have two "views" of the same uniform variable
(for view, I mean read the same uniform as 32-bits type in one, and as
64-bits type in other). See "4.4.3 Uniform Variable Layout Qualifiers"
section in GLSL 4.50 spec. And if we have two consecutive arrays, we can
safely upload them in different locations of the push constant buffer.

So, this can only happen with instructions accessing the same area, with
a different bit-size and generated by the driver itself. I am going to
assume here than we are talking about a uniform variable defined with a
type of X-bit size and we access to it as Y-bit size, be X > Y. If the
vice-versa is possible (I am not aware of any case though), then the
solution would be more complex as we need to take into account alignments.

Knowing that, we can do one of these solutions:

1) Fix case by case. This is what I have proposed in BSW.
   The idea is to see if any overlapped area is accessed with
   different bit-sizes and mark the whole area with the bigger bit-size
   and upload it in the proper place in the push constant buffer.
   With BSW, we know this is happening due to a lowered DF-MOV INDIRECT
   so we can safely mark the whole area as DF in case of mixed accesses.
2) Generalize 1) to all bit-sizes and generations.

> Also
> AFAICT this could potentially lead to a very large amount of data being
> considered as contiguous if you have a contiguous 64-bit block, then
> random unrelated 32-bit data, and then another contiguous 64-bit block,
> because the very last element of a MOV_INDIRECT block doesn't seem to be
> marked is_live_64bit, which is also the one that would terminate the
> 64-bit contiguous block and actually cause it to be assigned a location.

Right. Marking the last element as is_live_64_bit is a trivial fix, I
will do it too.

> I don't think the current code would even guarantee proper alignment for
> subsequent 64-blocks in that case because it may be pushing unrelated
> 32-bit data in between.
> Worse, if the UNIFORM space ends with a 64-bit
> contiguous uniform block it may be considered contiguous with whatever
> unrelated data is handled first by the next 32-bit uniform assignment
> pass.
> 

I don't see how this can happen. 'last' is marking the end of the area
and its position is not marked as contiguous except with overlapped
accesses, where we know are done in purpose by the driver (e.g BSW's MOV
INDIRECT). The whole contiguous area is marked with the same bit-size,
after the aforementioned solutions.

What do you think about looping on contiguous[], look for contiguous
uniforms and assert if they all have the same bit-size, just before
setting pull/push constant locations? Then we can easily know if we are
wrong before uploading anything.

> Sigh...  Seems like this was only scratching the surface...
> 

:-(

As I might be missing something, I am open to suggestions/opinions...

Sam

>> Sam
>>
>> +   /* Detect if we accessed to an array of DF uniforms in both direct
>> and
>> +    * indirect ways on BSW/BXT. If this is the case, then we need
>> +    * to take into account in order to push all the elements together.
>> +    */
>> +   if (devinfo->is_cherryview || devinfo->is_broxton) {
>> +      foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>> +         for (int i = 0 ; i < inst->sources; i++) {
>> +            if (inst->src[i].file != UNIFORM)
>> +               continue;
>> +
>> +            int constant_nr = inst->src[i].nr + inst->src[i].offset /
>> 4;
>> +
>> +            if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0)
>> {
>> +               assert(inst->src[2].ud % 4 == 0);
>> +               unsigned last = constant_nr + (inst->src[2].ud / 4) -
>> 1;
>> +               assert(last < uniforms);
>> +
>> +               bool is_df_array = false;
>> +               for (unsigned j = constant_nr; j < last; j++) {
>> +                  if (is_live_64bit[j]) {
>> +                     is_df_array = true;
>> +                     break;
>> +                  }
>> +               }
>> +
>> +               if (is_df_array) {
>> +                  /* If part of it is accessed with direct addressing
>> as a DF
>> +                   * then we mark the whole array as DF.
>> +                   */
>> +                  for (unsigned j = constant_nr; j < last; j++)
>> +                     is_live_64bit[j] = true;
>> +               }
>> +            }
>> +         }
>> +      }
>> +   }
>> +
>>
>>>> Other than that patch looks good to me.
>>>>
>>>
>>> OK, thanks.
>>>
>>> Sam
>>>
>>>>>              for (unsigned j = constant_nr; j < last; j++) {
>>>>>                 is_live[j] = true;
>>>>>                 contiguous[j] = true;
>>>>> -               if (type_sz(inst->src[i].type) == 8) {
>>>>> +               if (mov_indirect_64bit) {
>>>>>                    is_live_64bit[j] = true;
>>>>>                 }
>>>>>              }
>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>> b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>> index 991c20fad62..4aaf84964e9 100644
>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>> @@ -3878,31 +3878,30 @@ fs_visitor::nir_emit_intrinsic(const
>>>>> fs_builder &bld, nir_intrinsic_instr *instr
>>>>>           unsigned read_size = instr->const_index[1] -
>>>>>              (instr->num_components - 1) * type_sz(dest.type);
>>>>>  
>>>>> -         fs_reg indirect_chv_high_32bit;
>>>>> -         bool is_chv_bxt_64bit =
>>>>> -            (devinfo->is_cherryview || devinfo->is_broxton) &&
>>>>> -            type_sz(dest.type) == 8;
>>>>> -         if (is_chv_bxt_64bit) {
>>>>> -            indirect_chv_high_32bit =
>>>>> vgrf(glsl_type::uint_type);
>>>>> -            /* Calculate indirect address to read high 32 bits
>>>>> */
>>>>> -            bld.ADD(indirect_chv_high_32bit, indirect,
>>>>> brw_imm_ud(4));
>>>>> -         }
>>>>> +         bool supports_64bit_indirects =
>>>>> +            !devinfo->is_cherryview && !devinfo->is_broxton;
>>>>>  
>>>>> -         for (unsigned j = 0; j < instr->num_components; j++) {
>>>>> -            if (!is_chv_bxt_64bit) {
>>>>> +         if (type_sz(dest.type) != 8 ||
>>>>> supports_64bit_indirects)
>>>>> {
>>>>> +            for (unsigned j = 0; j < instr->num_components; j++)
>>>>> {
>>>>>                 bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>>                          offset(dest, bld, j), offset(src, bld,
>>>>> j),
>>>>>                          indirect, brw_imm_ud(read_size));
>>>>> -            } else {
>>>>> -               bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>> -                        subscript(offset(dest, bld, j),
>>>>> BRW_REGISTER_TYPE_UD, 0),
>>>>> -                        offset(src, bld, j),
>>>>> -                        indirect, brw_imm_ud(read_size));
>>>>> -
>>>>> -               bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>> -                        subscript(offset(dest, bld, j),
>>>>> BRW_REGISTER_TYPE_UD, 1),
>>>>> -                        offset(src, bld, j),
>>>>> -                        indirect_chv_high_32bit,
>>>>> brw_imm_ud(read_size));
>>>>> +            }
>>>>> +         } else {
>>>>> +            const unsigned num_mov_indirects =
>>>>> +               type_sz(dest.type) /
>>>>> type_sz(BRW_REGISTER_TYPE_UD);
>>>>> +            /* We read a little bit less per MOV INDIRECT, as
>>>>> they
>>>>> are now
>>>>> +             * 32-bits ones instead of 64-bit. Fix read_size
>>>>> then.
>>>>> +             */
>>>>> +            const unsigned read_size_32bit = read_size -
>>>>> +                (num_mov_indirects - 1) *
>>>>> type_sz(BRW_REGISTER_TYPE_UD);
>>>>> +            for (unsigned j = 0; j < instr->num_components; j++)
>>>>> {
>>>>> +               for (unsigned i = 0; i < num_mov_indirects; i++)
>>>>> {
>>>>> +                  bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>> +                           subscript(offset(dest, bld, j),
>>>>> BRW_REGISTER_TYPE_UD, i),
>>>>> +                           subscript(offset(src, bld, j),
>>>>> BRW_REGISTER_TYPE_UD, i),
>>>>> +                           indirect,
>>>>> brw_imm_ud(read_size_32bit));
>>>>> +               }
>>>>>              }
>>>>>           }
>>>>>        }
>>>>> -- 
>>>>> 2.11.0
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 862 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170221/9088438f/attachment-0001.sig>