[Mesa-dev] [PATCH v4 19/44] i965/fs: Use byte_scattered_write on 16-bit store_ssbo

Fri Dec 1 10:46:08 UTC 2017

On 01/12/17 11:12, Jason Ekstrand wrote:
> I've left some comments below that I think clean things up and make this
> better, but I believe it is correct as-is.
> 
> Reviewed-by: Jason Ekstrand <jason at jlekstrand.net
> <mailto:jason at jlekstrand.net>>
> 
> On Wed, Nov 29, 2017 at 6:42 PM, Jose Maria Casanova Crespo
> <jmcasanova at igalia.com <mailto:jmcasanova at igalia.com>> wrote:
> 
>     From: Alejandro Piñeiro <apinheiro at igalia.com
>     <mailto:apinheiro at igalia.com>>
> 
>     We need to rely on byte scattered writes as untyped writes are 32-bit
>     size. We could try to keep using 32-bit messages when we have two or
>     four 16-bit elements, but for simplicity sake, we use the same message
>     for any component number. We revisit this aproach in the follwing
>     patches.
> 
>     v2: Removed use of stride = 2 on 16-bit sources (Jason Ekstrand)
> 
>     v3: (Jason Ekstrand)
>         - Include bit_size to scattered write message and remove namespace
>         - specific for scattered messages.
>         - Move comment to proper place.
>         - Squashed with i965/fs: Adjust type_size/type_slots on store_ssbo.
>         (Jose Maria Casanova)
>         - Take into account that get_nir_src returns now WORD types for
>           16-bit sources instead of DWORD.
> 
>     Signed-off-by: Jose Maria Casanova Crespo <jmcasanova at igalia.com
>     <mailto:jmcasanova at igalia.com>>
>     Signed-off-by: Alejandro Piñeiro <apinheiro at igalia.com
>     <mailto:apinheiro at igalia.com>>
>     ---
>      src/intel/compiler/brw_fs_nir.cpp | 51
>     ++++++++++++++++++++++++++++-----------
>      1 file changed, 37 insertions(+), 14 deletions(-)
> 
>     diff --git a/src/intel/compiler/brw_fs_nir.cpp
>     b/src/intel/compiler/brw_fs_nir.cpp
>     index d6ab286147..ff04e2468b 100644
>     --- a/src/intel/compiler/brw_fs_nir.cpp
>     +++ b/src/intel/compiler/brw_fs_nir.cpp
>     @@ -4075,14 +4075,15 @@ fs_visitor::nir_emit_intrinsic(const
>     fs_builder &bld, nir_intrinsic_instr *instr
>             * Also, we have to suffle 64-bit data to be in the
>     appropriate layout
>             * expected by our 32-bit write messages.
>             */
>     -      unsigned type_size = 4;
>     -      if (nir_src_bit_size(instr->src[0]) == 64) {
>     -         type_size = 8;
>     +      unsigned bit_size = nir_src_bit_size(instr->src[0]);
>     +      unsigned type_size = bit_size / 8;
>     +      if (bit_size == 64) {
>               val_reg = shuffle_64bit_data_for_32bit_write(bld,
>                  val_reg, instr->num_components);
>            }
> 
>     -      unsigned type_slots = type_size / 4;
>     +      /* 16-bit types would use a minimum of 1 slot */
>     +      unsigned type_slots = MAX2(type_size / 4, 1);
> 
> 
> Given that this is only used for emit_typed_write, maybe we should just
> move it next to the emit_typed_write call and just get rid of the
> MAX2().  More on that later.

It makes sanes, i follow partially this approach at "[PATCH v4 26/44]
i965/fs: Optimize 16-bit SSBO stores by packing two into a 32-bit reg"
using an slots_per_component that is just 2 for 64-bits and 1 for the
other bitsizes. But i like your approach.

>            /* Combine groups of consecutive enabled channels in one write
>             * message. We use ffs to find the first enabled channel and
>     then ffs on
>     @@ -4093,12 +4094,19 @@ fs_visitor::nir_emit_intrinsic(const
>     fs_builder &bld, nir_intrinsic_instr *instr
>               unsigned first_component = ffs(writemask) - 1;
>               unsigned length = ffs(~(writemask >> first_component)) - 1;
> 
> 
> If the one above is first_component, num_components would be a better
> name for this one.  It's very confusing go have something generically
> named "length" in a piece of code with so many different possible units.

It was also confussing to me. What about a rename to
num_consecutive_components as that what is really calculating? so we
don't confuse it with the num_components of instr.

>     -         /* We can't write more than 2 64-bit components at once.
>     Limit the
>     -          * length of the write to what we can do and let the next
>     iteration
>     -          * handle the rest
>     -          */
>     -         if (type_size > 4)
>     +         if (type_size > 4) {
>     +            /* We can't write more than 2 64-bit components at
>     once. Limit
>     +             * the length of the write to what we can do and let
>     the next
>     +             * iteration handle the rest.
>     +             */
>                  length = MIN2(2, length);
>     +         } else if (type_size == 2) {
> 
> 
> Maybe type_size < 4?

I should have advanced this change to this patch, you commented it
already for current [PATCH v4 26/44]

>     +            /* For 16-bit types we are using byte scattered writes,
>     that can
>     +             * only write one component per call. So we limit the
>     length, and
>     +             * let the write happening in several iterations.
>     +             */
>     +            length = 1;
>     +         }
> 
>               fs_reg offset_reg;
>               nir_const_value *const_offset =
>     nir_src_as_const_value(instr->src[2]);
>     @@ -4112,11 +4120,26 @@ fs_visitor::nir_emit_intrinsic(const
>     fs_builder &bld, nir_intrinsic_instr *instr
>                          brw_imm_ud(type_size * first_component));
>               }
> 
>     -
>     -         emit_untyped_write(bld, surf_index, offset_reg,
>     -                            offset(val_reg, bld, first_component *
>     type_slots),
>     -                            1 /* dims */, length * type_slots,
>     -                            BRW_PREDICATE_NONE);
>     +         if (type_size == 2) {
> 
> 
> maybe type_size < 4?

Agree

>     +            /* Untyped Surface messages have a fixed 32-bit size,
>     so we need
>     +             * to rely on byte scattered in order to write 16-bit
>     elements.
>     +             * The byte_scattered_write message needs that every
>     written 16-bit
>     +             * type to be aligned 32-bits (stride=2).
>     +             */
>     +            fs_reg tmp = bld.vgrf(BRW_REGISTER_TYPE_D);
>     +            bld.MOV(subscript(tmp, BRW_REGISTER_TYPE_W, 0),
>     +                     offset(val_reg, bld, first_component));
>     +            emit_byte_scattered_write(bld, surf_index, offset_reg,
>     +                                      tmp,
>     +                                      1 /* dims */, 1,
>     +                                      bit_size,
>     +                                      BRW_PREDICATE_NONE);
>     +         } else {
> 
> 
> If we moved type_slots here, I think we could very nicely future-proof
> things as follows:

I think that we can not move it here because it is really used when
calculating the offset for the components of 64-bits,

> assert(num_components * type_size < 16);

Shouldn't it be <=

> assert((num_components * type_size) % 4 == 0);
> assert((first_component * type_size) % 4 == 0);
> unsigned first_slot = (first_component * type_size) / 4;
> unsigned num_slots = (num_components * type_size) / 4;
> emit_untyped_write(bld, surf_index, reg_offset,
>                    offset(val_reg, bld, first_slot),
>                    1 /* dims */, num_slots,
>                    BRW_PREDICATE_NONE);

> That said, let's not get ahead of ourselves.  This can all be done as a
> later clean-up on top of the optimization patch if that's easier. :)

We were thinking this with the following optimization in the past. But
at that moment we thought it was easier to understand doing it in two
steps. I'll wait for you to review [PATCH v4 26/44] and try to advance
as much as I can to this first one to avoid re-writing lines.

>     +            emit_untyped_write(bld, surf_index, offset_reg,
>     +                               offset(val_reg, bld, first_component
>     * type_slots),
>     +                               1 /* dims */, length * type_slots,
>     +                               BRW_PREDICATE_NONE);
>     +         }
> 
>               /* Clear the bits in the writemask that we just wrote,
>     then try
>                * again to see if more channels are left.
>     --
>     2.14.3
> 
>     _______________________________________________
>     mesa-dev mailing list
>     mesa-dev at lists.freedesktop.org <mailto:mesa-dev at lists.freedesktop.org>
>     https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>     <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
> 
>