[Mesa-dev] [Mesa-stable] [PATCH v2 1/3] i965/fs: fix indirect load DF uniforms on BSW/BXT

Wed Feb 22 06:38:54 UTC 2017

On 21/02/17 21:07, Francisco Jerez wrote:
> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> 
>> On 20/02/17 21:31, Francisco Jerez wrote:
>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>>
>>>> On Mon, 2017-02-20 at 08:58 +0100, Samuel Iglesias Gonsálvez wrote:
>>>>> On Sat, 2017-02-18 at 18:58 -0800, Francisco Jerez wrote:
>>>>>> Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>>>>>>
>>>>>>> The lowered BSW/BXT indirect move instructions had incorrect
>>>>>>> source types, which luckily wasn't causing incorrect assembly to
>>>>>>> be
>>>>>>> generated due to the bug fixed in the next patch, but would have
>>>>>>> confused the remaining back-end IR infrastructure due to the
>>>>>>> mismatch
>>>>>>> between the IR source types and the emitted machine code.
>>>>>>>
>>>>>>> v2:
>>>>>>> - Improve commit log (Curro)
>>>>>>> - Fix read_size (Curro)
>>>>>>> - Fix DF uniform array detection in assign_constant_locations()
>>>>>>> when
>>>>>>>   it is acceded with 32-bit MOV_INDIRECTs in BSW/BXT.
>>>>>>>
>>>>>>> Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.com>
>>>>>>> Cc: "17.0" <mesa-stable at lists.freedesktop.org>
>>>>>>> ---
>>>>>>>  src/mesa/drivers/dri/i965/brw_fs.cpp     | 11 ++++++++-
>>>>>>>  src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 41 ++++++++++++++++--
>>>>>>> --------------
>>>>>>>  2 files changed, 30 insertions(+), 22 deletions(-)
>>>>>>>
>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>> b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>> index c348bc7138d..93ab84b5845 100644
>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>>>>>>> @@ -1944,10 +1944,19 @@ fs_visitor::assign_constant_locations()
>>>>>>>              unsigned last = constant_nr + (inst->src[2].ud / 4)
>>>>>>> -
>>>>>>> 1;
>>>>>>>              assert(last < uniforms);
>>>>>>>  
>>>>>>> +            bool supports_64bit_indirects =
>>>>>>> +               !devinfo->is_cherryview && !devinfo->is_broxton;
>>>>>>> +            /* Detect if this is as a result 64-bit MOV
>>>>>>> INDIRECT.
>>>>>>> In case
>>>>>>> +             * of BSW/BXT, we substitute each DF MOV INDIRECT by
>>>>>>> two 32-bit MOV
>>>>>>> +             * INDIRECT.
>>>>>>> +             */
>>>>>>> +            bool mov_indirect_64bit = (type_sz(inst-
>>>>>>>> src[i].type)
>>>>>>> == 8) ||
>>>>>>> +               (!supports_64bit_indirects && inst->dst.type ==
>>>>>>> BRW_REGISTER_TYPE_UD &&
>>>>>>> +                inst->src[0].type == BRW_REGISTER_TYPE_UD &&
>>>>>>> inst-
>>>>>>>> dst.stride == 2);
>>>>>>
>>>>>> This seems kind of fragile, I don't think the optimizer gives you
>>>>>> any
>>>>>> guarantees that the stride of a lowered 64-bit indirect move will
>>>>>> remain
>>>>>> equal to two, or that the destination stride of an actual 32-bit
>>>>>> indirect uniform load will never end up being two as well.  That
>>>>>> said,
>>>>>> because you access these with 32-bit indirect moves, I don't see
>>>>>> why
>>>>>> you'd need to treat them as 64-bit uniforms here, the usual
>>>>>> alignment
>>>>>> requirements for 64-bit uniforms no longer apply, so you can treat
>>>>>> them
>>>>>> as regular 32-bit uniforms AFAICT.  Why did you add this hunk?
>>>>>>
>>>>>
>>>>> I added it because of this case: if we access to one DF uniform array
>>>>> element with a normal MOV and the rest with MOV INDIRECT, we will
>>>>> mark
>>>>> the former as a live 64bit variable. Then we have the array scattered
>>>>> as part of it is uploaded as a 64-bits uniform and the other as 32-
>>>>> bits. Even if we avoid this by uploading everything together as 32-
>>>>> bits, then the access to that DF could not be aligned to 64-bits.
>>>>>
>>>>> So my idea was to find a way to identify somehow those MOV INDIRECT
>>>>> in
>>>>> BSW to mark all the array as a 64-bit one.
>>>>>
>>>>
>>>> Mmm, maybe I can fix this case without the hack I did. I can add the
>>>> following code after marking all live variables accessed by the
>>>> instructions.
>>>>
>>>> It is very similar to the one to detect live variables but it is fixing
>>>> the case where any MOV INDIRECT in BSW is accessing to an uniform array
>>>> of DF elements where one of these elements is directly accessed by
>>>> another instruction.
>>>>
>>>> What do you think?
>>>>
>>>
>>> Looks somewhat better, but I don't think this is correct if you have
>>> multiple overlapping indirect loads of the same uniform array and only
>>> one of them overlaps with a direct 64-bit load.
>>
>> In that case, I think this is correct. The 2 32-bit MOV INDIRECTs where
>> emitted as a "lowering" of an unsupported 64-bit MOV INDIRECT. They both
>> keep the 'read_size' of the original one, so they both overlap to any
>> other direct 64-bit load to that array like with the original
>> instruction. If none of them overlap to the direct 64-bit access, then I
>> think they can be handled as non-contiguous to the latter without any issue.
>>
> 
> What if you have two lowered indirect loads with overlapping but not
> identical range, and only the second one contains elements accessed with
> regular 64-bit moves?  Wouldn't your change mark part of the array
> 64-bit aligned and the rest 32-bit aligned?
> 

Right. What we can do is similar to the loop of optimizations: put this
code in a loop and if one of the iterations marked some area as 64-bit
one, we repeat the process until no more progress is done. Then we are
sure we mark all overlapping areas with the same block size.

>>>  Apparently
>>> assign_constant_locations() already has code to attempt to push
>>> indirectly accessed uniform sections contiguously, but the logic seems
>>> broken in multiple ways, even on platforms other than CHV/BXT... The
>>> first is_live_64bit location assignment loop doesn't consider non-64bit
>>> uniforms even if they're marked as contiguous with a 64-bit uniform,
>>> because the contiguous block tracking is done from within
>>> set_push_pull_constant_loc() which may not be called at all if the
>>> continue condition evaluates to true at the top of the loop.
>>
>> The idea for that block was to push all 64-bit uniforms first and
>> together so they are aligned to 64-bit, there are no gaps between them
>> and assuming, of course, they are not contiguous to non-64-bit data.
>>
>> They cannot be contiguous to non-64-bit data due to a declaration in a
>> GLSL shader because it is not allowed to have two uniforms with the same
>> location,
> 
> They definitely can right now due to mixed 32-bit and 64-bit access to
> uniform arrays on CHV/BXT and due to the last element of a 64-bit array
> being marked incorrectly as 32-bit on AFAICT any platform.  Also I don't
> see anything that would prevent the same from happening due to the
> compiler propagating uniform loads into 32-bit bitwise operations.
> 

What I was talking about is that this is not due to GLSL shader
declarations itself, but because of the way we are doing things inside
the driver due to HW limitations.

>> i.e. we cannot have two "views" of the same uniform variable
>> (for view, I mean read the same uniform as 32-bits type in one, and as
>> 64-bits type in other). See "4.4.3 Uniform Variable Layout Qualifiers"
>> section in GLSL 4.50 spec. And if we have two consecutive arrays, we can
>> safely upload them in different locations of the push constant buffer.
>>
>> So, this can only happen with instructions accessing the same area, with
>> a different bit-size and generated by the driver itself. I am going to
>> assume here than we are talking about a uniform variable defined with a
>> type of X-bit size and we access to it as Y-bit size, be X > Y. If the
>> vice-versa is possible (I am not aware of any case though), then the
>> solution would be more complex as we need to take into account alignments.
>>
>> Knowing that, we can do one of these solutions:
>>
>> 1) Fix case by case. This is what I have proposed in BSW.
>>    The idea is to see if any overlapped area is accessed with
>>    different bit-sizes and mark the whole area with the bigger bit-size
>>    and upload it in the proper place in the push constant buffer.
>>    With BSW, we know this is happening due to a lowered DF-MOV INDIRECT
>>    so we can safely mark the whole area as DF in case of mixed accesses.
> 
> This sounds like a partial and rather fragile solution, the compiler
> gives you no guarantees that the indirect or direct move instruction
> loading from the uniform file is going to look the same as when you
> inserted it into the probgram by the end of the optimization loop.
> 
>> 2) Generalize 1) to all bit-sizes and generations.
>>
> I don't think we need to generalize things to other bit sizes for the
> moment, but none of this problem is generation-dependent, so introducing
> generation-dependent hacks only complicates the uniform assignment code
> if you don't consider it crazy enough already...  Really, the uniform
> allocation code should only guarantee two things which are fully
> generation-independent in principle and we currently fail at:
> 
>  - Assign uniforms accessed contiguously to consecutive locations in the
>    uniform buffer (whether it's a pull or push constant buffer).
> 
>  - Align the starting address of any contiguously accessed block (which
>    may be the union of the ranges of multiple instructions) to the bit
>    size of the largest access in the block.
> 

Agreed.

>>> Also
>>> AFAICT this could potentially lead to a very large amount of data being
>>> considered as contiguous if you have a contiguous 64-bit block, then
>>> random unrelated 32-bit data, and then another contiguous 64-bit block,
>>> because the very last element of a MOV_INDIRECT block doesn't seem to be
>>> marked is_live_64bit, which is also the one that would terminate the
>>> 64-bit contiguous block and actually cause it to be assigned a location.
>>
>> Right. Marking the last element as is_live_64_bit is a trivial fix, I
>> will do it too.
>>
> 
> Thank you.
> 
>>> I don't think the current code would even guarantee proper alignment for
>>> subsequent 64-blocks in that case because it may be pushing unrelated
>>> 32-bit data in between.
>>> Worse, if the UNIFORM space ends with a 64-bit
>>> contiguous uniform block it may be considered contiguous with whatever
>>> unrelated data is handled first by the next 32-bit uniform assignment
>>> pass.
>>>
>>
>> I don't see how this can happen. 'last' is marking the end of the area
>> and its position is not marked as contiguous except with overlapped
>> accesses, where we know are done in purpose by the driver (e.g BSW's MOV
>> INDIRECT). The whole contiguous area is marked with the same bit-size,
>> after the aforementioned solutions.
>>
> It isn't necessarily the same bit-size right now (because of CHV/BXT,
> because of the is_live_64bit[last] bug and potentially other things),
> which can cause the location assignment loop to skip over some elements
> which may include the terminator element of a contiguous block, so we
> may continue growing the block including random non-64-bit-aligned data.
> 
>> What do you think about looping on contiguous[], look for contiguous
>> uniforms and assert if they all have the same bit-size, just before
>> setting pull/push constant locations? Then we can easily know if we are
>> wrong before uploading anything.
>>
> 
> I guess an assertion won't fix the problem when it actually happens. ;)
> 
>>> Sigh...  Seems like this was only scratching the surface...
>>>
>>
>> :-(
>>
>> As I might be missing something, I am open to suggestions/opinions...
>>
> 
> I think the contiguous block tracking logic needs to be reworked and
> pulled outside of set_push_pull_constant_loc() so the location
> assignment loop can figure out what the correct alignment is for the
> whole block.
> 

Right.

What do you think about a solution like: fix the is_live_64_bit bug with
last element; take out the contiguous block tracking logic from
set_push_pull_constant_loc(); have a loop that looks for contiguous
areas, mark them with the largest block size accessed in each area,
repeat the process until no more progress can be done.

I am going to develop this solution and see if I am missing something else.

Sam

>> Sam
>>
>>>> Sam
>>>>
>>>> +   /* Detect if we accessed to an array of DF uniforms in both direct
>>>> and
>>>> +    * indirect ways on BSW/BXT. If this is the case, then we need
>>>> +    * to take into account in order to push all the elements together.
>>>> +    */
>>>> +   if (devinfo->is_cherryview || devinfo->is_broxton) {
>>>> +      foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
>>>> +         for (int i = 0 ; i < inst->sources; i++) {
>>>> +            if (inst->src[i].file != UNIFORM)
>>>> +               continue;
>>>> +
>>>> +            int constant_nr = inst->src[i].nr + inst->src[i].offset /
>>>> 4;
>>>> +
>>>> +            if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0)
>>>> {
>>>> +               assert(inst->src[2].ud % 4 == 0);
>>>> +               unsigned last = constant_nr + (inst->src[2].ud / 4) -
>>>> 1;
>>>> +               assert(last < uniforms);
>>>> +
>>>> +               bool is_df_array = false;
>>>> +               for (unsigned j = constant_nr; j < last; j++) {
>>>> +                  if (is_live_64bit[j]) {
>>>> +                     is_df_array = true;
>>>> +                     break;
>>>> +                  }
>>>> +               }
>>>> +
>>>> +               if (is_df_array) {
>>>> +                  /* If part of it is accessed with direct addressing
>>>> as a DF
>>>> +                   * then we mark the whole array as DF.
>>>> +                   */
>>>> +                  for (unsigned j = constant_nr; j < last; j++)
>>>> +                     is_live_64bit[j] = true;
>>>> +               }
>>>> +            }
>>>> +         }
>>>> +      }
>>>> +   }
>>>> +
>>>>
>>>>>> Other than that patch looks good to me.
>>>>>>
>>>>>
>>>>> OK, thanks.
>>>>>
>>>>> Sam
>>>>>
>>>>>>>              for (unsigned j = constant_nr; j < last; j++) {
>>>>>>>                 is_live[j] = true;
>>>>>>>                 contiguous[j] = true;
>>>>>>> -               if (type_sz(inst->src[i].type) == 8) {
>>>>>>> +               if (mov_indirect_64bit) {
>>>>>>>                    is_live_64bit[j] = true;
>>>>>>>                 }
>>>>>>>              }
>>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>>>> b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>>>> index 991c20fad62..4aaf84964e9 100644
>>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>>>>>>> @@ -3878,31 +3878,30 @@ fs_visitor::nir_emit_intrinsic(const
>>>>>>> fs_builder &bld, nir_intrinsic_instr *instr
>>>>>>>           unsigned read_size = instr->const_index[1] -
>>>>>>>              (instr->num_components - 1) * type_sz(dest.type);
>>>>>>>  
>>>>>>> -         fs_reg indirect_chv_high_32bit;
>>>>>>> -         bool is_chv_bxt_64bit =
>>>>>>> -            (devinfo->is_cherryview || devinfo->is_broxton) &&
>>>>>>> -            type_sz(dest.type) == 8;
>>>>>>> -         if (is_chv_bxt_64bit) {
>>>>>>> -            indirect_chv_high_32bit =
>>>>>>> vgrf(glsl_type::uint_type);
>>>>>>> -            /* Calculate indirect address to read high 32 bits
>>>>>>> */
>>>>>>> -            bld.ADD(indirect_chv_high_32bit, indirect,
>>>>>>> brw_imm_ud(4));
>>>>>>> -         }
>>>>>>> +         bool supports_64bit_indirects =
>>>>>>> +            !devinfo->is_cherryview && !devinfo->is_broxton;
>>>>>>>  
>>>>>>> -         for (unsigned j = 0; j < instr->num_components; j++) {
>>>>>>> -            if (!is_chv_bxt_64bit) {
>>>>>>> +         if (type_sz(dest.type) != 8 ||
>>>>>>> supports_64bit_indirects)
>>>>>>> {
>>>>>>> +            for (unsigned j = 0; j < instr->num_components; j++)
>>>>>>> {
>>>>>>>                 bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>>>>                          offset(dest, bld, j), offset(src, bld,
>>>>>>> j),
>>>>>>>                          indirect, brw_imm_ud(read_size));
>>>>>>> -            } else {
>>>>>>> -               bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>>>> -                        subscript(offset(dest, bld, j),
>>>>>>> BRW_REGISTER_TYPE_UD, 0),
>>>>>>> -                        offset(src, bld, j),
>>>>>>> -                        indirect, brw_imm_ud(read_size));
>>>>>>> -
>>>>>>> -               bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>>>> -                        subscript(offset(dest, bld, j),
>>>>>>> BRW_REGISTER_TYPE_UD, 1),
>>>>>>> -                        offset(src, bld, j),
>>>>>>> -                        indirect_chv_high_32bit,
>>>>>>> brw_imm_ud(read_size));
>>>>>>> +            }
>>>>>>> +         } else {
>>>>>>> +            const unsigned num_mov_indirects =
>>>>>>> +               type_sz(dest.type) /
>>>>>>> type_sz(BRW_REGISTER_TYPE_UD);
>>>>>>> +            /* We read a little bit less per MOV INDIRECT, as
>>>>>>> they
>>>>>>> are now
>>>>>>> +             * 32-bits ones instead of 64-bit. Fix read_size
>>>>>>> then.
>>>>>>> +             */
>>>>>>> +            const unsigned read_size_32bit = read_size -
>>>>>>> +                (num_mov_indirects - 1) *
>>>>>>> type_sz(BRW_REGISTER_TYPE_UD);
>>>>>>> +            for (unsigned j = 0; j < instr->num_components; j++)
>>>>>>> {
>>>>>>> +               for (unsigned i = 0; i < num_mov_indirects; i++)
>>>>>>> {
>>>>>>> +                  bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>>>>>>> +                           subscript(offset(dest, bld, j),
>>>>>>> BRW_REGISTER_TYPE_UD, i),
>>>>>>> +                           subscript(offset(src, bld, j),
>>>>>>> BRW_REGISTER_TYPE_UD, i),
>>>>>>> +                           indirect,
>>>>>>> brw_imm_ud(read_size_32bit));
>>>>>>> +               }
>>>>>>>              }
>>>>>>>           }
>>>>>>>        }
>>>>>>> -- 
>>>>>>> 2.11.0
>>>>>
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>> _______________________________________________
>> mesa-stable mailing list
>> mesa-stable at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-stable

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 862 bytes
Desc: OpenPGP digital signature
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170222/8b480a5f/attachment-0001.sig>