[Mesa-dev] [PATCH v2 1/3] i965/fs: fix indirect load DF uniforms on BSW/BXT

Mon Feb 20 20:31:18 UTC 2017

Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:

> On Mon, 2017-02-20 at 08:58 +0100, Samuel Iglesias Gonsálvez wrote:
>> On Sat, 2017-02-18 at 18:58 -0800, Francisco Jerez wrote:
>> > Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
>> > 
>> > > The lowered BSW/BXT indirect move instructions had incorrect
>> > > source types, which luckily wasn't causing incorrect assembly to
>> > > be
>> > > generated due to the bug fixed in the next patch, but would have
>> > > confused the remaining back-end IR infrastructure due to the
>> > > mismatch
>> > > between the IR source types and the emitted machine code.
>> > > 
>> > > v2:
>> > > - Improve commit log (Curro)
>> > > - Fix read_size (Curro)
>> > > - Fix DF uniform array detection in assign_constant_locations()
>> > > when
>> > >   it is acceded with 32-bit MOV_INDIRECTs in BSW/BXT.
>> > > 
>> > > Signed-off-by: Samuel Iglesias Gonsálvez <siglesias at igalia.com>
>> > > Cc: "17.0" <mesa-stable at lists.freedesktop.org>
>> > > ---
>> > >  src/mesa/drivers/dri/i965/brw_fs.cpp     | 11 ++++++++-
>> > >  src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 41 ++++++++++++++++--
>> > > --------------
>> > >  2 files changed, 30 insertions(+), 22 deletions(-)
>> > > 
>> > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > > b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > > index c348bc7138d..93ab84b5845 100644
>> > > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
>> > > @@ -1944,10 +1944,19 @@ fs_visitor::assign_constant_locations()
>> > >              unsigned last = constant_nr + (inst->src[2].ud / 4)
>> > > -
>> > > 1;
>> > >              assert(last < uniforms);
>> > >  
>> > > +            bool supports_64bit_indirects =
>> > > +               !devinfo->is_cherryview && !devinfo->is_broxton;
>> > > +            /* Detect if this is as a result 64-bit MOV
>> > > INDIRECT.
>> > > In case
>> > > +             * of BSW/BXT, we substitute each DF MOV INDIRECT by
>> > > two 32-bit MOV
>> > > +             * INDIRECT.
>> > > +             */
>> > > +            bool mov_indirect_64bit = (type_sz(inst-
>> > > >src[i].type)
>> > > == 8) ||
>> > > +               (!supports_64bit_indirects && inst->dst.type ==
>> > > BRW_REGISTER_TYPE_UD &&
>> > > +                inst->src[0].type == BRW_REGISTER_TYPE_UD &&
>> > > inst-
>> > > > dst.stride == 2);
>> > 
>> > This seems kind of fragile, I don't think the optimizer gives you
>> > any
>> > guarantees that the stride of a lowered 64-bit indirect move will
>> > remain
>> > equal to two, or that the destination stride of an actual 32-bit
>> > indirect uniform load will never end up being two as well.  That
>> > said,
>> > because you access these with 32-bit indirect moves, I don't see
>> > why
>> > you'd need to treat them as 64-bit uniforms here, the usual
>> > alignment
>> > requirements for 64-bit uniforms no longer apply, so you can treat
>> > them
>> > as regular 32-bit uniforms AFAICT.  Why did you add this hunk?
>> > 
>> 
>> I added it because of this case: if we access to one DF uniform array
>> element with a normal MOV and the rest with MOV INDIRECT, we will
>> mark
>> the former as a live 64bit variable. Then we have the array scattered
>> as part of it is uploaded as a 64-bits uniform and the other as 32-
>> bits. Even if we avoid this by uploading everything together as 32-
>> bits, then the access to that DF could not be aligned to 64-bits.
>> 
>> So my idea was to find a way to identify somehow those MOV INDIRECT
>> in
>> BSW to mark all the array as a 64-bit one.
>> 
>
> Mmm, maybe I can fix this case without the hack I did. I can add the
> following code after marking all live variables accessed by the
> instructions.
>
> It is very similar to the one to detect live variables but it is fixing
> the case where any MOV INDIRECT in BSW is accessing to an uniform array
> of DF elements where one of these elements is directly accessed by
> another instruction.
>
> What do you think?
>

Looks somewhat better, but I don't think this is correct if you have
multiple overlapping indirect loads of the same uniform array and only
one of them overlaps with a direct 64-bit load.  Apparently
assign_constant_locations() already has code to attempt to push
indirectly accessed uniform sections contiguously, but the logic seems
broken in multiple ways, even on platforms other than CHV/BXT...  The
first is_live_64bit location assignment loop doesn't consider non-64bit
uniforms even if they're marked as contiguous with a 64-bit uniform,
because the contiguous block tracking is done from within
set_push_pull_constant_loc() which may not be called at all if the
continue condition evaluates to true at the top of the loop.  Also
AFAICT this could potentially lead to a very large amount of data being
considered as contiguous if you have a contiguous 64-bit block, then
random unrelated 32-bit data, and then another contiguous 64-bit block,
because the very last element of a MOV_INDIRECT block doesn't seem to be
marked is_live_64bit, which is also the one that would terminate the
64-bit contiguous block and actually cause it to be assigned a location.
I don't think the current code would even guarantee proper alignment for
subsequent 64-blocks in that case because it may be pushing unrelated
32-bit data in between.  Worse, if the UNIFORM space ends with a 64-bit
contiguous uniform block it may be considered contiguous with whatever
unrelated data is handled first by the next 32-bit uniform assignment
pass.

Sigh...  Seems like this was only scratching the surface...

> Sam
>
> +   /* Detect if we accessed to an array of DF uniforms in both direct
> and
> +    * indirect ways on BSW/BXT. If this is the case, then we need
> +    * to take into account in order to push all the elements together.
> +    */
> +   if (devinfo->is_cherryview || devinfo->is_broxton) {
> +      foreach_block_and_inst_safe(block, fs_inst, inst, cfg) {
> +         for (int i = 0 ; i < inst->sources; i++) {
> +            if (inst->src[i].file != UNIFORM)
> +               continue;
> +
> +            int constant_nr = inst->src[i].nr + inst->src[i].offset /
> 4;
> +
> +            if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0)
> {
> +               assert(inst->src[2].ud % 4 == 0);
> +               unsigned last = constant_nr + (inst->src[2].ud / 4) -
> 1;
> +               assert(last < uniforms);
> +
> +               bool is_df_array = false;
> +               for (unsigned j = constant_nr; j < last; j++) {
> +                  if (is_live_64bit[j]) {
> +                     is_df_array = true;
> +                     break;
> +                  }
> +               }
> +
> +               if (is_df_array) {
> +                  /* If part of it is accessed with direct addressing
> as a DF
> +                   * then we mark the whole array as DF.
> +                   */
> +                  for (unsigned j = constant_nr; j < last; j++)
> +                     is_live_64bit[j] = true;
> +               }
> +            }
> +         }
> +      }
> +   }
> +
>
>> > Other than that patch looks good to me.
>> > 
>> 
>> OK, thanks.
>> 
>> Sam
>> 
>> > >              for (unsigned j = constant_nr; j < last; j++) {
>> > >                 is_live[j] = true;
>> > >                 contiguous[j] = true;
>> > > -               if (type_sz(inst->src[i].type) == 8) {
>> > > +               if (mov_indirect_64bit) {
>> > >                    is_live_64bit[j] = true;
>> > >                 }
>> > >              }
>> > > diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>> > > b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>> > > index 991c20fad62..4aaf84964e9 100644
>> > > --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>> > > +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp
>> > > @@ -3878,31 +3878,30 @@ fs_visitor::nir_emit_intrinsic(const
>> > > fs_builder &bld, nir_intrinsic_instr *instr
>> > >           unsigned read_size = instr->const_index[1] -
>> > >              (instr->num_components - 1) * type_sz(dest.type);
>> > >  
>> > > -         fs_reg indirect_chv_high_32bit;
>> > > -         bool is_chv_bxt_64bit =
>> > > -            (devinfo->is_cherryview || devinfo->is_broxton) &&
>> > > -            type_sz(dest.type) == 8;
>> > > -         if (is_chv_bxt_64bit) {
>> > > -            indirect_chv_high_32bit =
>> > > vgrf(glsl_type::uint_type);
>> > > -            /* Calculate indirect address to read high 32 bits
>> > > */
>> > > -            bld.ADD(indirect_chv_high_32bit, indirect,
>> > > brw_imm_ud(4));
>> > > -         }
>> > > +         bool supports_64bit_indirects =
>> > > +            !devinfo->is_cherryview && !devinfo->is_broxton;
>> > >  
>> > > -         for (unsigned j = 0; j < instr->num_components; j++) {
>> > > -            if (!is_chv_bxt_64bit) {
>> > > +         if (type_sz(dest.type) != 8 ||
>> > > supports_64bit_indirects)
>> > > {
>> > > +            for (unsigned j = 0; j < instr->num_components; j++)
>> > > {
>> > >                 bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>> > >                          offset(dest, bld, j), offset(src, bld,
>> > > j),
>> > >                          indirect, brw_imm_ud(read_size));
>> > > -            } else {
>> > > -               bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>> > > -                        subscript(offset(dest, bld, j),
>> > > BRW_REGISTER_TYPE_UD, 0),
>> > > -                        offset(src, bld, j),
>> > > -                        indirect, brw_imm_ud(read_size));
>> > > -
>> > > -               bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>> > > -                        subscript(offset(dest, bld, j),
>> > > BRW_REGISTER_TYPE_UD, 1),
>> > > -                        offset(src, bld, j),
>> > > -                        indirect_chv_high_32bit,
>> > > brw_imm_ud(read_size));
>> > > +            }
>> > > +         } else {
>> > > +            const unsigned num_mov_indirects =
>> > > +               type_sz(dest.type) /
>> > > type_sz(BRW_REGISTER_TYPE_UD);
>> > > +            /* We read a little bit less per MOV INDIRECT, as
>> > > they
>> > > are now
>> > > +             * 32-bits ones instead of 64-bit. Fix read_size
>> > > then.
>> > > +             */
>> > > +            const unsigned read_size_32bit = read_size -
>> > > +                (num_mov_indirects - 1) *
>> > > type_sz(BRW_REGISTER_TYPE_UD);
>> > > +            for (unsigned j = 0; j < instr->num_components; j++)
>> > > {
>> > > +               for (unsigned i = 0; i < num_mov_indirects; i++)
>> > > {
>> > > +                  bld.emit(SHADER_OPCODE_MOV_INDIRECT,
>> > > +                           subscript(offset(dest, bld, j),
>> > > BRW_REGISTER_TYPE_UD, i),
>> > > +                           subscript(offset(src, bld, j),
>> > > BRW_REGISTER_TYPE_UD, i),
>> > > +                           indirect,
>> > > brw_imm_ud(read_size_32bit));
>> > > +               }
>> > >              }
>> > >           }
>> > >        }
>> > > -- 
>> > > 2.11.0
>> 
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170220/fc10d1fd/attachment-0001.sig>