[Mesa-dev] [PATCH v4 23/44] i965/fs: Enables 16-bit load_ubo with sampler

Tue Dec 5 23:46:48 UTC 2017

On 05/12/17 22:25, Chema Casanova wrote:
> On 05/12/17 19:53, Jason Ekstrand wrote:
>> On Tue, Dec 5, 2017 at 9:08 AM, Chema Casanova <jmcasanova at igalia.com
>> <mailto:jmcasanova at igalia.com>> wrote:
>>
>>     El 30/11/17 a las 23:58, Jason Ekstrand escribió:
>>     > On Wed, Nov 29, 2017 at 6:50 PM, Jose Maria Casanova Crespo
>>     > <jmcasanova at igalia.com <mailto:jmcasanova at igalia.com>
>>     <mailto:jmcasanova at igalia.com <mailto:jmcasanova at igalia.com>>> wrote:
>>     >
>>     >     load_ubo is using 32-bit loads as uniforms surfaces have a 32-bit
>>     >     surface format defined. So when reading 16-bit components with the
>>     >     sampler we need to unshuffle two 16-bit components from each
>>     32-bit
>>     >     component.
>>     >
>>     >     Using the sampler avoids the use of the byte_scattered_read
>>     message
>>     >     that needs one message for each component and is supposed to be
>>     >     slower.
>>     >
>>     >     In the case of SKL+ we take advantage of a hardware feature that
>>     >     automatically defines a channel mask based on the rlen value,
>>     so on
>>     >     SKL+ we only use half of the registers without using a header
>>     in the
>>     >     payload.
>>     >     ---
>>     >      src/intel/compiler/brw_fs.cpp           | 31
>>     >     +++++++++++++++++++++++++++----
>>     >      src/intel/compiler/brw_fs_generator.cpp | 10 ++++++++--
>>     >      2 files changed, 35 insertions(+), 6 deletions(-)
>>     >
>>     >     diff --git a/src/intel/compiler/brw_fs.cpp
>>     >     b/src/intel/compiler/brw_fs.cpp
>>     >     index 1ca4d416b2..9c543496ba 100644
>>     >     --- a/src/intel/compiler/brw_fs.cpp
>>     >     +++ b/src/intel/compiler/brw_fs.cpp
>>     >     @@ -184,9 +184,17 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
>>     >     fs_builder &bld,
>>     >          * a double this means we are only loading 2 elements worth of
>>     >     data.
>>     >          * We also want to use a 32-bit data type for the dst of the
>>     >     load operation
>>     >          * so other parts of the driver don't get confused about the
>>     >     size of the
>>     >     -    * result.
>>     >     +    * result. On the case of 16-bit data we only need half of the
>>     >     32-bit
>>     >     +    * components on SKL+ as we take advance of using message
>>     >     return size to
>>     >     +    * define an xy channel mask.
>>     >          */
>>     >     -   fs_reg vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);
>>     >     +   fs_reg vec4_result;
>>     >     +   if (type_sz(dst.type) == 2 && (devinfo->gen >= 9)) {
>>     >     +      vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 2);
>>     >     +      vec4_result = retype(vec4_result, BRW_REGISTER_TYPE_HF);
>>     >     +   } else {
>>     >     +      vec4_result = bld.vgrf(BRW_REGISTER_TYPE_F, 4);
>>     >     +   }
>>     >
>>     >         fs_inst *inst =
>>     >     bld.emit(FS_OPCODE_VARYING_PULL_CONSTANT_LOAD_LOGICAL,
>>     >                                  vec4_result, surf_index,
>>     vec4_offset);
>>     >         inst->size_written = 4 *
>>     >     vec4_result.component_size(inst->exec_size);
>>     >     @@ -197,8 +205,23 @@ fs_visitor::VARYING_PULL_CONSTANT_LOAD(const
>>     >     fs_builder &bld,
>>     >         }
>>     >
>>     >         vec4_result.type = dst.type;
>>     >     -   bld.MOV(dst, offset(vec4_result, bld,
>>     >     -                       (const_offset & 0xf) /
>>     >     type_sz(vec4_result.type)));
>>     >     +
>>     >     +   if (type_sz(dst.type) == 2) {
>>     >     +      /* 16-bit types need to be unshuffled as each pair of
>>     >     16-bit components
>>     >     +       * is packed on a 32-bit compoment because we are using a
>>     >     32-bit format
>>     >     +       * in the surface of uniform that is read by the sampler.
>>     >     +       * TODO: On BDW+ mark when an uniform has 16-bit type so we
>>     >     could setup a
>>     >     +       * surface format of 16-bit and use the 16-bit return
>>     >     format at the
>>     >     +       * sampler.
>>     >     +       */
>>     >     +      vec4_result.stride = 2;
>>     >     +      bld.MOV(dst, byte_offset(offset(vec4_result, bld,
>>     >     +                                      (const_offset & 0x7) / 4),
>>     >     +                               (const_offset & 0x7) / 2 % 2 *
>>     2));
>>     >     +   } else {
>>     >     +      bld.MOV(dst, offset(vec4_result, bld,
>>     >     +                          (const_offset & 0xf) /
>>     >     type_sz(vec4_result.type)));
>>     >     +   }
>>     >
>>     >
>>     > This seems overly complicated.  How about something like
>>
>>     > fs_reg dw = offset(vec4_result, bld, (const_offset & 0xf) / 4);
>>     > switch (type_sz(dst.type)) {
>>     > case 2:
>>     >    shuffle_32bit_load_result_to_16bit_data(bld, dst, dw, 1);
>>     >    bld.MOV(dst, subscript(dw, dst.type, (const_offset / 2) & 1));
>>     >    break;
>>     > case 4:
>>     >    bld.MOV(dst, dw);

Finally i needed to change this line for:

bld.MOV(dst, retype(dw, dst.type));

As dst isn't guaranteed to be F, we need the retype so the MOV doesn't
apply format conversions as cached by some piglit tests. I'm sending the
final patch.

>>     >    break;
>>     > case 8:
>>     >    shuffle_32bit_load_result_to_64bit_data(bld, dst, dw, 1);
>>     >    break;
>>     > default:
>>     >    unreachable();
>>     > }
>>
>>     This implementation it is really more clear. Tested and works perfectly
>>     fine.
>>
>>     >  
>>     >
>>     >      }
>>     >
>>     >      /**
>>     >     diff --git a/src/intel/compiler/brw_fs_generator.cpp
>>     >     b/src/intel/compiler/brw_fs_generator.cpp
>>     >     index a3861cd68e..00a4e29147 100644
>>     >     --- a/src/intel/compiler/brw_fs_generator.cpp
>>     >     +++ b/src/intel/compiler/brw_fs_generator.cpp
>>     >     @@ -1381,12 +1381,18 @@
>>     >     fs_generator::generate_varying_pull_constant_load_gen7(fs_inst *inst,
>>     >         uint32_t simd_mode, rlen, mlen;
>>     >         if (inst->exec_size == 16) {
>>     >            mlen = 2;
>>     >     -      rlen = 8;
>>     >     +      if (type_sz(dst.type) == 2 && (devinfo->gen >= 9))
>>     >     +         rlen = 4;
>>     >     +      else
>>     >     +         rlen = 8;
>>     >
>>     >
>>     > I'm not sure what I think of this.  We intentionally use a vec4 today
>>     > instead of a single float because it lets us potentially batch up a
>>     > bunch of loads in a single texture operation.  Are we sure that we
>>     > never need more than 2 of those result registers?
>>
>>     I can drop this supposed "optimization", in this hunk and also the
>>     following lines. Maybe we could use the not read 3rd and 4th  32-bit
>>     components to read for example a consecutive pair of 16-bit vec4 in a
>>     matrix context.
>>
>>     Removing this last hunks and with your previous code proposal, can I
>>     count with the R-b ? At the end the patch would be just reduced to your
>>     code ...
>>
>>
>> Yup, fine with me.  Or, if it just reduces to what I wrote above, you
>> can re-author the patch and review it yourself.  I don't care which.
> 
> So let it be my first review in mesa-dev :)
> 
> Chema
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>