[Mesa-dev] [PATCH 14/47] i965/fs: Set stride 2 when dealing with 16-bit floats/ints

Thu Sep 7 23:33:55 UTC 2017

On Thu, Aug 24, 2017 at 6:54 AM, Alejandro Piñeiro <apinheiro at igalia.com>
wrote:

> From: Jose Maria Casanova Crespo <jmcasanova at igalia.com>
>
> From Skylake PRM, Volume 07 3D Media GPGPU, Section Register Region
> Restrictions, SubSection 5. Special Cases for Word Operations (page
> 781):
>
>    "There is a relaxed alignment rule for word destinations. When the
>     destination type is word (UW, W, HF), destination data types can
>     be aligned to either the lowest word or the second lowest word of
>     the execution channel. This means the destination data words can
>     be either all in the even word locations or all in the odd word
>     locations.
>
>     // Example:
>     add (8) r10.0<2>:hf r11.0<8;8,1>:f r12.0<8;8,1>:hf
>     add (8) r10.1<2>:hf r11.0<8;8,1>:f r12.0<8;8,1>:hf
>     // Note: The destination offset may be .0 or .1 although the
> destination
>        subregister
>     // is required to be aligned to execution datatype."
>
> In practice, that means that for alu operations like float to half
> float conversions, we need to align the 16-bit to 32-bit (so setting
> the stride to 2).
>

I believe your application of this rule is a bit imprecise.  My
understanding of the hardware is that the rule stated above is a
modification to the following rule from the previous page:

> When the Execution Data Type is wider than the destination data type, the
destination must
> be aligned as required by the wider execution data type and specify a
HorzStride equal to
> the ratio in sizes of the two data types. For example, a mov with a D
source and B destination
> must use a 4-byte aligned destination and a Dst.HorzStride of 4.

In other words, the rule you stated above only applies in the case of ALU
operations with a source type wider than the destination.  If all operands
to an ALU operation are word types, the destination can be tightly packed.
We have a decent bit of code in the driver today which does ALU operations
that write to word destinations with a stride of 1; all of it is done with
W or DW and not HF but I believe the same principal applies to HF.

In particular, this means that basically the only time we care about this
rule is for nir_op_f2f16 and f2f32 (with a 64-bit source). The rest of the
ALU operations should be fine as-is.  (For nir_op_f2f32 with a 16-bit
source, the destination is wider than the source so the restriction does
not apply.)   I'd prefer we let 16-bit values be tightly packed by default
and implement f32->f16 conversions as

mov (8) r10.0<2>:hf r11.0<8;8,1>f
mov (8) r12.0<1>:hf r10.0<16;8,2>hf

The optimizer can probably sort things out for us and get rid of the second
mov.

--Jason

As a first approach, we are setting the stride to 2 in general, even
> if it is not an alu operation, as simplifies handling with 16-bit data
> types. We can revisit this in the future.
>
> Signed-off-by: Jose Maria Casanova Crespo <jmcasanova at igalia.com>
> Signed-off-by: Alejandro Piñeiro <apinheiro at igalia.com>
> ---
>  src/intel/compiler/brw_fs_nir.cpp | 35 ++++++++++++++++++++++++++++++
> ++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/src/intel/compiler/brw_fs_nir.cpp
> b/src/intel/compiler/brw_fs_nir.cpp
> index e4eba1401f8..c469baf42a1 100644
> --- a/src/intel/compiler/brw_fs_nir.cpp
> +++ b/src/intel/compiler/brw_fs_nir.cpp
> @@ -585,6 +585,21 @@ fs_visitor::nir_emit_alu(const fs_builder &bld,
> nir_alu_instr *instr)
>     result.type = brw_type_for_nir_type(devinfo,
>        (nir_alu_type)(nir_op_infos[instr->op].output_type |
>                       nir_dest_bit_size(instr->dest.dest)));
> +   /*  According to PRM, Volume 07 3D Media GPGPU, Section Register Region
> +    *  Restrictions, SubSection 5. Special Cases for Word Operations, GEN
> >= 8
> +    *  can only handle writting on 16-bit (UW,HF,W) registers on the lower
> +    *  word or in the higher word for each double word. It needs to be the
> +    *  same offset for for all 16-bit elements of the register. So as
> +    *  consequence all destinations of 16-bit registers should use a
> stride
> +    *  equal to 2.
> +    *
> +    *  We also need alignment to 32-bit for conversions operations
> +    *  form 32-bit to 16-bit. So this is also useful for this operations
> +    *  that are the only alu operation supported right now.
> +    */
> +
> +   if (nir_dest_bit_size(instr->dest.dest) == 16)
> +      result.stride = 2;
>
>     fs_reg op[4];
>     for (unsigned i = 0; i < nir_op_infos[instr->op].num_inputs; i++) {
> @@ -594,6 +609,9 @@ fs_visitor::nir_emit_alu(const fs_builder &bld,
> nir_alu_instr *instr)
>                          nir_src_bit_size(instr->src[i].src)));
>        op[i].abs = instr->src[i].abs;
>        op[i].negate = instr->src[i].negate;
> +      /* Previous comment still aplies here for source elements */
> +      if (nir_src_bit_size(instr->src[i].src) == 16)
> +         op[i].stride = 2;
>     }
>
>     /* We get a bunch of mov's out of the from_ssa pass and they may still
> @@ -2083,6 +2101,9 @@ fs_visitor::emit_gs_input_load(const fs_reg &dst,
>        first_component = first_component / 2;
>     }
>
> +   if (type_sz(tmp_dst.type) == 2)
> +      tmp_dst.stride = 2;
> +
>     for (unsigned iter = 0; iter < num_iterations; iter++) {
>        if (offset_const) {
>           /* Constant indexing - use global offset. */
> @@ -2414,6 +2435,9 @@ fs_visitor::nir_emit_tcs_intrinsic(const fs_builder
> &bld,
>           dst = tmp;
>        }
>
> +      if (type_sz(dst.type) == 2)
> +         dst.stride = 2;
> +
>        for (unsigned iter = 0; iter < num_iterations; iter++) {
>           if (indirect_offset.file == BAD_FILE) {
>              /* Constant indexing - use global offset. */
> @@ -2725,6 +2749,9 @@ fs_visitor::nir_emit_tes_intrinsic(const fs_builder
> &bld,
>           first_component = first_component / 2;
>        }
>
> +      if (type_sz(dest.type) == 2)
> +         dest.stride = 2;
> +
>        fs_inst *inst;
>        if (indirect_offset.file == BAD_FILE) {
>           /* Arbitrarily only push up to 32 vec4 slots worth of data,
> @@ -2735,7 +2762,7 @@ fs_visitor::nir_emit_tes_intrinsic(const fs_builder
> &bld,
>              fs_reg src = fs_reg(ATTR, imm_offset / 2, dest.type);
>              for (int i = 0; i < instr->num_components; i++) {
>                 unsigned comp = 16 / type_sz(dest.type) * (imm_offset % 2)
> +
> -                  i + first_component;
> +                  i * dest.stride + first_component;
>                 bld.MOV(offset(dest, bld, i), component(src, comp));
>              }
>              tes_prog_data->base.urb_read_length =
> @@ -3182,6 +3209,9 @@ fs_visitor::nir_emit_fs_intrinsic(const fs_builder
> &bld,
>           num_components *= 2;
>        }
>
> +      if (type_sz(dest.type) == 2)
> +         dest.stride = 2;
> +
>        for (unsigned int i = 0; i < num_components; i++) {
>           struct brw_reg interp = interp_reg(base, component + i);
>           interp = suboffset(interp, 3);
> @@ -3576,6 +3606,9 @@ fs_visitor::nir_emit_intrinsic(const fs_builder
> &bld, nir_intrinsic_instr *instr
>     if (nir_intrinsic_infos[instr->intrinsic].has_dest)
>        dest = get_nir_dest(instr->dest);
>
> +   if (instr->dest.is_ssa && nir_dest_bit_size(instr->dest) == 16)
> +      dest.stride = 2;
> +
>     switch (instr->intrinsic) {
>     case nir_intrinsic_atomic_counter_inc:
>     case nir_intrinsic_atomic_counter_dec:
> --
> 2.11.0
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20170907/d8cea714/attachment.html>