[Mesa-dev] [Mesa-stable] [PATCH v2 1/3] i965/fs: fix indirect load DF uniforms on BSW/BXT

Thu Feb 23 05:48:11 UTC 2017

Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:

> On 21/02/17 21:07, Francisco Jerez wrote:
>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>> 
>>> On 20/02/17 21:31, Francisco Jerez wrote:
>>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>>>
>>>>> On Mon, 2017-02-20 at 08:58 +0100, Samuel Iglesias Gonsálvez wrote:
>>>>>> On Sat, 2017-02-18 at 18:58 -0800, Francisco Jerez wrote:
>>>>>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>>>>>>
>>>>>>>> The lowered BSW/BXT indirect move instructions had incorrect
>>>>>>>> source types, which luckily wasn't causing incorrect assembly to
>>>>>>>> be
>>>>>>>> generated due to the bug fixed in the next patch, but would have
>>>>>>>> confused the remaining back-end IR infrastructure due to the
>>>>>>>> mismatch
>>>>>>>> between the IR source types and the emitted machine code.
>>>>>>>>
>>>>>>>> v2:
>>>>>>>> - Improve commit log (Curro)
>>>>>>>> - Fix read_size (Curro)
>>>>>>>> - Fix DF uniform array detection in assign_constant_locations()
>>>>>>>> when
>>>>>>>>   it is acceded with 32-bit MOV_INDIRECTs in BSW/BXT.
>>>>>>>>
>>>>>>>> Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.com>
>>>>>>>> Cc: "17.0" <mesa-stable at lists.freedesktop.org>
>>>>>>>> ---
>>>>>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp     | 11 ++++++++-
>>>>>>>>  src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 41 ++++++++++++++++--
>>>>>>>> --------------
>>>>>>>>  2 files changed, 30 insertions(+), 22 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>>> b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>>> index c348bc7138d..93ab84b5845 100644
>>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>>> @@ -1944,10 +1944,19 @@ fs_visitor::assign_constant_locations()
>>>>>>>>              unsigned last = constant_nr + (inst->src[2].ud / 4)
>>>>>>>> -
>>>>>>>> 1;
>>>>>>>>              assert(last < uniforms);
>>>>>>>>  
>>>>>>>> +            bool supports_64bit_indirects =
>>>>>>>> +               !devinfo->is_cherryview && !devinfo->is_broxton;
>>>>>>>> +            /* Detect if this is as a result 64-bit MOV
>>>>>>>> INDIRECT.
>>>>>>>> In case
>>>>>>>> +             * of BSW/BXT, we substitute each DF MOV INDIRECT by
>>>>>>>> two 32-bit MOV
>>>>>>>> +             * INDIRECT.
>>>>>>>> +             */
>>>>>>>> +            bool mov_indirect_64bit = (type_sz(inst-
>>>>>>>>> src[i].type)
>>>>>>>> == 8) ||
>>>>>>>> +               (!supports_64bit_indirects && inst->dst.type ==
>>>>>>>> BRW_REGISTER_TYPE_UD &&
>>>>>>>> +                inst->src[0].type == BRW_REGISTER_TYPE_UD &&
>>>>>>>> inst-
>>>>>>>>> dst.stride == 2);
>>>>>>>
>>>>>>> This seems kind of fragile, I don't think the optimizer gives you
>>>>>>> any
>>>>>>> guarantees that the stride of a lowered 64-bit indirect move will
>>>>>>> remain
>>>>>>> equal to two, or that the destination stride of an actual 32-bit
>>>>>>> indirect uniform load will never end up being two as well.  That
>>>>>>> said,
>>>>>>> because you access these with 32-bit indirect moves, I don't see
>>>>>>> why
>>>>>>> you'd need to treat them as 64-bit uniforms here, the usual
>>>>>>> alignment
>>>>>>> requirements for 64-bit uniforms no longer apply, so you can treat
>>>>>>> them
>>>>>>> as regular 32-bit uniforms AFAICT.  Why did you add this hunk?
>>>>>>>
>>>>>>
>>>>>> I added it because of this case: if we access to one DF uniform array
>>>>>> element with a normal MOV and the rest with MOV INDIRECT, we will
>>>>>> mark
>>>>>> the former as a live 64bit variable. Then we have the array scattered
>>>>>> as part of it is uploaded as a 64-bits uniform and the other as 32-
>>>>>> bits. Even if we avoid this by uploading everything together as 32-
>>>>>> bits, then the access to that DF could not be aligned to 64-bits.
>>>>>>
>>>>>> So my idea was to find a way to identify somehow those MOV INDIRECT
>>>>>> in
>>>>>> BSW to mark all the array as a 64-bit one.
>>>>>>
>>>>>
>>>>> Mmm, maybe I can fix this case without the hack I did. I can add the
>>>>> following code after marking all live variables accessed by the
>>>>> instructions.
>>>>>
>>>>> It is very similar to the one to detect live variables but it is fixing
>>>>> the case where any MOV INDIRECT in BSW is accessing to an uniform array
>>>>> of DF elements where one of these elements is directly accessed by
>>>>> another instruction.
>>>>>
>>>>> What do you think?
>>>>>
>>>>
>>>> Looks somewhat better, but I don't think this is correct if you have
>>>> multiple overlapping indirect loads of the same uniform array and only
>>>> one of them overlaps with a direct 64-bit load.
>>>
>>> In that case, I think this is correct. The 2 32-bit MOV INDIRECTs where
>>> emitted as a "lowering" of an unsupported 64-bit MOV INDIRECT. They both
>>> keep the 'read_size' of the original one, so they both overlap to any
>>> other direct 64-bit load to that array like with the original
>>> instruction. If none of them overlap to the direct 64-bit access, then I
>>> think they can be handled as non-contiguous to the latter without any issue.
>>>
>> 
>> What if you have two lowered indirect loads with overlapping but not
>> identical range, and only the second one contains elements accessed with
>> regular 64-bit moves?  Wouldn't your change mark part of the array
>> 64-bit aligned and the rest 32-bit aligned?
>> 
>
> Right. What we can do is similar to the loop of optimizations: put this
> code in a loop and if one of the iterations marked some area as 64-bit
> one, we repeat the process until no more progress is done. Then we are
> sure we mark all overlapping areas with the same block size.
>
>>>>  Apparently
>>>> assign_constant_locations() already has code to attempt to push
>>>> indirectly accessed uniform sections contiguously, but the logic seems
>>>> broken in multiple ways, even on platforms other than CHV/BXT... The
>>>> first is_live_64bit location assignment loop doesn't consider non-64bit
>>>> uniforms even if they're marked as contiguous with a 64-bit uniform,
>>>> because the contiguous block tracking is done from within
>>>> set_push_pull_constant_loc() which may not be called at all if the
>>>> continue condition evaluates to true at the top of the loop.
>>>
>>> The idea for that block was to push all 64-bit uniforms first and
>>> together so they are aligned to 64-bit, there are no gaps between them
>>> and assuming, of course, they are not contiguous to non-64-bit data.
>>>
>>> They cannot be contiguous to non-64-bit data due to a declaration in a
>>> GLSL shader because it is not allowed to have two uniforms with the same
>>> location,
>> 
>> They definitely can right now due to mixed 32-bit and 64-bit access to
>> uniform arrays on CHV/BXT and due to the last element of a 64-bit array
>> being marked incorrectly as 32-bit on AFAICT any platform.  Also I don't
>> see anything that would prevent the same from happening due to the
>> compiler propagating uniform loads into 32-bit bitwise operations.
>> 
>
> What I was talking about is that this is not due to GLSL shader
> declarations itself, but because of the way we are doing things inside
> the driver due to HW limitations.
>
>>> i.e. we cannot have two "views" of the same uniform variable
>>> (for view, I mean read the same uniform as 32-bits type in one, and as
>>> 64-bits type in other). See "4.4.3 Uniform Variable Layout Qualifiers"
>>> section in GLSL 4.50 spec. And if we have two consecutive arrays, we can
>>> safely upload them in different locations of the push constant buffer.
>>>
>>> So, this can only happen with instructions accessing the same area, with
>>> a different bit-size and generated by the driver itself. I am going to
>>> assume here than we are talking about a uniform variable defined with a
>>> type of X-bit size and we access to it as Y-bit size, be X > Y. If the
>>> vice-versa is possible (I am not aware of any case though), then the
>>> solution would be more complex as we need to take into account alignments.
>>>
>>> Knowing that, we can do one of these solutions:
>>>
>>> 1) Fix case by case. This is what I have proposed in BSW.
>>>    The idea is to see if any overlapped area is accessed with
>>>    different bit-sizes and mark the whole area with the bigger bit-size
>>>    and upload it in the proper place in the push constant buffer.
>>>    With BSW, we know this is happening due to a lowered DF-MOV INDIRECT
>>>    so we can safely mark the whole area as DF in case of mixed accesses.
>> 
>> This sounds like a partial and rather fragile solution, the compiler
>> gives you no guarantees that the indirect or direct move instruction
>> loading from the uniform file is going to look the same as when you
>> inserted it into the probgram by the end of the optimization loop.
>> 
>>> 2) Generalize 1) to all bit-sizes and generations.
>>>
>> I don't think we need to generalize things to other bit sizes for the
>> moment, but none of this problem is generation-dependent, so introducing
>> generation-dependent hacks only complicates the uniform assignment code
>> if you don't consider it crazy enough already...  Really, the uniform
>> allocation code should only guarantee two things which are fully
>> generation-independent in principle and we currently fail at:
>> 
>>  - Assign uniforms accessed contiguously to consecutive locations in the
>>    uniform buffer (whether it's a pull or push constant buffer).
>> 
>>  - Align the starting address of any contiguously accessed block (which
>>    may be the union of the ranges of multiple instructions) to the bit
>>    size of the largest access in the block.
>> 
>
> Agreed.
>
>>>> Also
>>>> AFAICT this could potentially lead to a very large amount of data being
>>>> considered as contiguous if you have a contiguous 64-bit block, then
>>>> random unrelated 32-bit data, and then another contiguous 64-bit block,
>>>> because the very last element of a MOV_INDIRECT block doesn't seem to be
>>>> marked is_live_64bit, which is also the one that would terminate the
>>>> 64-bit contiguous block and actually cause it to be assigned a location.
>>>
>>> Right. Marking the last element as is_live_64_bit is a trivial fix, I
>>> will do it too.
>>>
>> 
>> Thank you.
>> 
>>>> I don't think the current code would even guarantee proper alignment for
>>>> subsequent 64-blocks in that case because it may be pushing unrelated
>>>> 32-bit data in between.
>>>> Worse, if the UNIFORM space ends with a 64-bit
>>>> contiguous uniform block it may be considered contiguous with whatever
>>>> unrelated data is handled first by the next 32-bit uniform assignment
>>>> pass.
>>>>
>>>
>>> I don't see how this can happen. 'last' is marking the end of the area
>>> and its position is not marked as contiguous except with overlapped
>>> accesses, where we know are done in purpose by the driver (e.g BSW's MOV
>>> INDIRECT). The whole contiguous area is marked with the same bit-size,
>>> after the aforementioned solutions.
>>>
>> It isn't necessarily the same bit-size right now (because of CHV/BXT,
>> because of the is_live_64bit[last] bug and potentially other things),
>> which can cause the location assignment loop to skip over some elements
>> which may include the terminator element of a contiguous block, so we
>> may continue growing the block including random non-64-bit-aligned data.
>> 
>>> What do you think about looping on contiguous[], look for contiguous
>>> uniforms and assert if they all have the same bit-size, just before
>>> setting pull/push constant locations? Then we can easily know if we are
>>> wrong before uploading anything.
>>>
>> 
>> I guess an assertion won't fix the problem when it actually happens. ;)
>> 
>>>> Sigh...  Seems like this was only scratching the surface...
>>>>
>>>
>>> :-(
>>>
>>> As I might be missing something, I am open to suggestions/opinions...
>>>
>> 
>> I think the contiguous block tracking logic needs to be reworked and
>> pulled outside of set_push_pull_constant_loc() so the location
>> assignment loop can figure out what the correct alignment is for the
>> whole block.
>> 
>
> Right.
>
> What do you think about a solution like: fix the is_live_64_bit bug with
> last element; take out the contiguous block tracking logic from
> set_push_pull_constant_loc();

Sounds good.

> have a loop that looks for contiguous areas, mark them with the
> largest block size accessed in each area, repeat the process until no
> more progress can be done.
>
I don't think you need to repeat the process in any case, that would
likely increase the runtime complexity of the algorithm, instead you
could just track the maximum alignment requirement seen so far for the
current contiguous block while the set_push_pull_constant_loc() loop
searches for the 'contiguous[u] == false' terminator.  Once you get
there you would either ignore the block if it doesn't have the correct
alignment or give it a location in the uniform buffer.

The two passes could be done in a single outer loop which would allow
you to avoid some code duplication and would be easier to extend to
arbitrary datatype sizes than the current approach.

> I am going to develop this solution and see if I am missing something else.
>
> Sam
>
>>> Sam
>>>
>>>>> Sam
>>>>>
>>>>> +   /* Detect if we accessed to an array of DF uniforms in both direct
>>>>> and
>>>>> +    * indirect ways on BSW/BXT. If this is the case, then we need
>>>>> +    * to take into account in order to push all the elements together.
>>>>> +    */
>>>>> +   if (devinfo->is_cherryview || devinfo->is_broxton) {
>>>>> +      foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>>>>> +         for (int i = 0 ; i < inst->sources; i++) {
>>>>> +            if (inst->src[i].file != UNIFORM)
>>>>> +               continue;
>>>>> +
>>>>> +            int constant_nr = inst->src[i].nr + inst->src[i].offset /
>>>>> 4;
>>>>> +
>>>>> +            if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0)
>>>>> {
>>>>> +               assert(inst->src[2].ud % 4 == 0);
>>>>> +               unsigned last = constant_nr + (inst->src[2].ud / 4) -
>>>>> 1;
>>>>> +               assert(last < uniforms);
>>>>> +
>>>>> +               bool is_df_array = false;
>>>>> +               for (unsigned j = constant_nr; j < last; j++) {
>>>>> +                  if (is_live_64bit[j]) {
>>>>> +                     is_df_array = true;
>>>>> +                     break;
>>>>> +                  }
>>>>> +               }
>>>>> +
>>>>> +               if (is_df_array) {
>>>>> +                  /* If part of it is accessed with direct addressing
>>>>> as a DF
>>>>> +                   * then we mark the whole array as DF.
>>>>> +                   */
>>>>> +                  for (unsigned j = constant_nr; j < last; j++)
>>>>> +                     is_live_64bit[j] = true;
>>>>> +               }
>>>>> +            }
>>>>> +         }
>>>>> +      }
>>>>> +   }
>>>>> +
>>>>>
>>>>>>> Other than that patch looks good to me.
>>>>>>>
>>>>>>
>>>>>> OK, thanks.
>>>>>>
>>>>>> Sam
>>>>>>
>>>>>>>>              for (unsigned j = constant_nr; j < last; j++) {
>>>>>>>>                 is_live[j] = true;
>>>>>>>>                 contiguous[j] = true;
>>>>>>>> -               if (type_sz(inst->src[i].type) == 8) {
>>>>>>>> +               if (mov_indirect_64bit) {
>>>>>>>>                    is_live_64bit[j] = true;
>>>>>>>>                 }
>>>>>>>>              }
>>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>>>>> b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>>>>> index 991c20fad62..4aaf84964e9 100644
>>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>>>>> @@ -3878,31 +3878,30 @@ fs_visitor::nir_emit_intrinsic(const
>>>>>>>> fs_builder &bld, nir_intrinsic_instr *instr
>>>>>>>>           unsigned read_size = instr->const_index[1] -
>>>>>>>>              (instr->num_components - 1) * type_sz(dest.type);
>>>>>>>>  
>>>>>>>> -         fs_reg indirect_chv_high_32bit;
>>>>>>>> -         bool is_chv_bxt_64bit =
>>>>>>>> -            (devinfo->is_cherryview || devinfo->is_broxton) &&
>>>>>>>> -            type_sz(dest.type) == 8;
>>>>>>>> -         if (is_chv_bxt_64bit) {
>>>>>>>> -            indirect_chv_high_32bit =
>>>>>>>> vgrf(glsl_type::uint_type);
>>>>>>>> -            /* Calculate indirect address to read high 32 bits
>>>>>>>> */
>>>>>>>> -            bld.ADD(indirect_chv_high_32bit, indirect,
>>>>>>>> brw_imm_ud(4));
>>>>>>>> -         }
>>>>>>>> +         bool supports_64bit_indirects =
>>>>>>>> +            !devinfo->is_cherryview && !devinfo->is_broxton;
>>>>>>>>  
>>>>>>>> -         for (unsigned j = 0; j < instr->num_components; j++) {
>>>>>>>> -            if (!is_chv_bxt_64bit) {
>>>>>>>> +         if (type_sz(dest.type) != 8 ||
>>>>>>>> supports_64bit_indirects)
>>>>>>>> {
>>>>>>>> +            for (unsigned j = 0; j < instr->num_components; j++)
>>>>>>>> {
>>>>>>>>                 bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>>>>>                          offset(dest, bld, j), offset(src, bld,
>>>>>>>> j),
>>>>>>>>                          indirect, brw_imm_ud(read_size));
>>>>>>>> -            } else {
>>>>>>>> -               bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>>>>> -                        subscript(offset(dest, bld, j),
>>>>>>>> BRW_REGISTER_TYPE_UD, 0),
>>>>>>>> -                        offset(src, bld, j),
>>>>>>>> -                        indirect, brw_imm_ud(read_size));
>>>>>>>> -
>>>>>>>> -               bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>>>>> -                        subscript(offset(dest, bld, j),
>>>>>>>> BRW_REGISTER_TYPE_UD, 1),
>>>>>>>> -                        offset(src, bld, j),
>>>>>>>> -                        indirect_chv_high_32bit,
>>>>>>>> brw_imm_ud(read_size));
>>>>>>>> +            }
>>>>>>>> +         } else {
>>>>>>>> +            const unsigned num_mov_indirects =
>>>>>>>> +               type_sz(dest.type) /
>>>>>>>> type_sz(BRW_REGISTER_TYPE_UD);
>>>>>>>> +            /* We read a little bit less per MOV INDIRECT, as
>>>>>>>> they
>>>>>>>> are now
>>>>>>>> +             * 32-bits ones instead of 64-bit. Fix read_size
>>>>>>>> then.
>>>>>>>> +             */
>>>>>>>> +            const unsigned read_size_32bit = read_size -
>>>>>>>> +                (num_mov_indirects - 1) *
>>>>>>>> type_sz(BRW_REGISTER_TYPE_UD);
>>>>>>>> +            for (unsigned j = 0; j < instr->num_components; j++)
>>>>>>>> {
>>>>>>>> +               for (unsigned i = 0; i < num_mov_indirects; i++)
>>>>>>>> {
>>>>>>>> +                  bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>>>>> +                           subscript(offset(dest, bld, j),
>>>>>>>> BRW_REGISTER_TYPE_UD, i),
>>>>>>>> +                           subscript(offset(src, bld, j),
>>>>>>>> BRW_REGISTER_TYPE_UD, i),
>>>>>>>> +                           indirect,
>>>>>>>> brw_imm_ud(read_size_32bit));
>>>>>>>> +               }
>>>>>>>>              }
>>>>>>>>           }
>>>>>>>>        }
>>>>>>>> -- 
>>>>>>>> 2.11.0
>>>>>>
>>>>>> _______________________________________________
>>>>>> mesa-dev mailing list
>>>>>> mesa-dev at lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>> _______________________________________________
>>> mesa-stable mailing list
>>> mesa-stable at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-stable
>
> _______________________________________________
> mesa-stable mailing list
> mesa-stable at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-stable
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170222/3c2ef33e/attachment.sig>