<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Thu, Aug 24, 2017 at 6:54 AM, Alejandro Piñeiro <span dir="ltr"><<a href="mailto:apinheiro@igalia.com" target="_blank">apinheiro@igalia.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">From: Jose Maria Casanova Crespo <<a href="mailto:jmcasanova@igalia.com">jmcasanova@igalia.com</a>><br> <br> >From Skylake PRM, Volume 07 3D Media GPGPU, Section Register Region<br> Restrictions, SubSection 5. Special Cases for Word Operations (page<br> 781):<br> <br> "There is a relaxed alignment rule for word destinations. When the<br> destination type is word (UW, W, HF), destination data types can<br> be aligned to either the lowest word or the second lowest word of<br> the execution channel. This means the destination data words can<br> be either all in the even word locations or all in the odd word<br> locations.<br> <br> // Example:<br> add (8) r10.0<2>:hf r11.0<8;8,1>:f r12.0<8;8,1>:hf<br> add (8) r10.1<2>:hf r11.0<8;8,1>:f r12.0<8;8,1>:hf<br> // Note: The destination offset may be .0 or .1 although the destination<br> subregister<br> // is required to be aligned to execution datatype."<br> <br> In practice, that means that for alu operations like float to half<br> float conversions, we need to align the 16-bit to 32-bit (so setting<br> the stride to 2).<br></blockquote><div><br></div><div>I believe your application of this rule is a bit imprecise. My understanding of the hardware is that the rule stated above is a modification to the following rule from the previous page:</div><div><br></div><div>> When the Execution Data Type is wider than the destination data type, the destination must<br>> be aligned as required by the wider execution data type and specify a HorzStride equal to<br>> the ratio in sizes of the two data types. For example, a mov with a D source and B destination<br>> must use a 4-byte aligned destination and a Dst.HorzStride of 4.</div><div><br></div><div>In other words, the rule you stated above only applies in the case of ALU operations with a source type wider than the destination. If all operands to an ALU operation are word types, the destination can be tightly packed. We have a decent bit of code in the driver today which does ALU operations that write to word destinations with a stride of 1; all of it is done with W or DW and not HF but I believe the same principal applies to HF.<br></div><div><br></div><div>In particular, this means that basically the only time we care about this rule is for nir_op_f2f16 and f2f32 (with a 64-bit source). The rest of the ALU operations should be fine as-is. (For nir_op_f2f32 with a 16-bit source, the destination is wider than the source so the restriction does not apply.) I'd prefer we let 16-bit values be tightly packed by default and implement f32->f16 conversions as</div><div><br></div><div>mov (8) r10.0<2>:hf r11.0<8;8,1>f</div><div>mov (8) r12.0<1>:hf r10.0<16;8,2>hf</div><div><br></div><div>The optimizer can probably sort things out for us and get rid of the second mov.</div><div><br></div><div>--Jason<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> As a first approach, we are setting the stride to 2 in general, even<br> if it is not an alu operation, as simplifies handling with 16-bit data<br> types. We can revisit this in the future.<br> <br> Signed-off-by: Jose Maria Casanova Crespo <<a href="mailto:jmcasanova@igalia.com">jmcasanova@igalia.com</a>><br> Signed-off-by: Alejandro Piñeiro <<a href="mailto:apinheiro@igalia.com">apinheiro@igalia.com</a>><br> ---<br> src/intel/compiler/brw_fs_nir.<wbr>cpp | 35 ++++++++++++++++++++++++++++++<wbr>++++-<br> 1 file changed, 34 insertions(+), 1 deletion(-)<br> <br> diff --git a/src/intel/compiler/brw_fs_<wbr>nir.cpp b/src/intel/compiler/brw_fs_<wbr>nir.cpp<br> index e4eba1401f8..c469baf42a1 100644<br> --- a/src/intel/compiler/brw_fs_<wbr>nir.cpp<br> +++ b/src/intel/compiler/brw_fs_<wbr>nir.cpp<br> @@ -585,6 +585,21 @@ fs_visitor::nir_emit_alu(const fs_builder &bld, nir_alu_instr *instr)<br> result.type = brw_type_for_nir_type(devinfo,<br> (nir_alu_type)(nir_op_infos[<wbr>instr->op].output_type |<br> nir_dest_bit_size(instr->dest.<wbr>dest)));<br> + /* According to PRM, Volume 07 3D Media GPGPU, Section Register Region<br> + * Restrictions, SubSection 5. Special Cases for Word Operations, GEN >= 8<br> + * can only handle writting on 16-bit (UW,HF,W) registers on the lower<br> + * word or in the higher word for each double word. It needs to be the<br> + * same offset for for all 16-bit elements of the register. So as<br> + * consequence all destinations of 16-bit registers should use a stride<br> + * equal to 2.<br> + *<br> + * We also need alignment to 32-bit for conversions operations<br> + * form 32-bit to 16-bit. So this is also useful for this operations<br> + * that are the only alu operation supported right now.<br> + */<br> +<br> + if (nir_dest_bit_size(instr-><wbr>dest.dest) == 16)<br> + result.stride = 2;<br> <br> fs_reg op[4];<br> for (unsigned i = 0; i < nir_op_infos[instr->op].num_<wbr>inputs; i++) {<br> @@ -594,6 +609,9 @@ fs_visitor::nir_emit_alu(const fs_builder &bld, nir_alu_instr *instr)<br> nir_src_bit_size(instr->src[i]<wbr>.src)));<br> op[i].abs = instr->src[i].abs;<br> op[i].negate = instr->src[i].negate;<br> + /* Previous comment still aplies here for source elements */<br> + if (nir_src_bit_size(instr->src[<wbr>i].src) == 16)<br> + op[i].stride = 2;<br> }<br> <br> /* We get a bunch of mov's out of the from_ssa pass and they may still<br> @@ -2083,6 +2101,9 @@ fs_visitor::emit_gs_input_<wbr>load(const fs_reg &dst,<br> first_component = first_component / 2;<br> }<br> <br> + if (type_sz(tmp_dst.type) == 2)<br> + tmp_dst.stride = 2;<br> +<br> for (unsigned iter = 0; iter < num_iterations; iter++) {<br> if (offset_const) {<br> /* Constant indexing - use global offset. */<br> @@ -2414,6 +2435,9 @@ fs_visitor::nir_emit_tcs_<wbr>intrinsic(const fs_builder &bld,<br> dst = tmp;<br> }<br> <br> + if (type_sz(dst.type) == 2)<br> + dst.stride = 2;<br> +<br> for (unsigned iter = 0; iter < num_iterations; iter++) {<br> if (indirect_offset.file == BAD_FILE) {<br> /* Constant indexing - use global offset. */<br> @@ -2725,6 +2749,9 @@ fs_visitor::nir_emit_tes_<wbr>intrinsic(const fs_builder &bld,<br> first_component = first_component / 2;<br> }<br> <br> + if (type_sz(dest.type) == 2)<br> + dest.stride = 2;<br> +<br> fs_inst *inst;<br> if (indirect_offset.file == BAD_FILE) {<br> /* Arbitrarily only push up to 32 vec4 slots worth of data,<br> @@ -2735,7 +2762,7 @@ fs_visitor::nir_emit_tes_<wbr>intrinsic(const fs_builder &bld,<br> fs_reg src = fs_reg(ATTR, imm_offset / 2, dest.type);<br> for (int i = 0; i < instr->num_components; i++) {<br> unsigned comp = 16 / type_sz(dest.type) * (imm_offset % 2) +<br> - i + first_component;<br> + i * dest.stride + first_component;<br> bld.MOV(offset(dest, bld, i), component(src, comp));<br> }<br> tes_prog_data->base.urb_read_<wbr>length =<br> @@ -3182,6 +3209,9 @@ fs_visitor::nir_emit_fs_<wbr>intrinsic(const fs_builder &bld,<br> num_components *= 2;<br> }<br> <br> + if (type_sz(dest.type) == 2)<br> + dest.stride = 2;<br> +<br> for (unsigned int i = 0; i < num_components; i++) {<br> struct brw_reg interp = interp_reg(base, component + i);<br> interp = suboffset(interp, 3);<br> @@ -3576,6 +3606,9 @@ fs_visitor::nir_emit_<wbr>intrinsic(const fs_builder &bld, nir_intrinsic_instr *instr<br> if (nir_intrinsic_infos[instr-><wbr>intrinsic].has_dest)<br> dest = get_nir_dest(instr->dest);<br> <br> + if (instr->dest.is_ssa && nir_dest_bit_size(instr->dest) == 16)<br> + dest.stride = 2;<br> +<br> switch (instr->intrinsic) {<br> case nir_intrinsic_atomic_counter_<wbr>inc:<br> case nir_intrinsic_atomic_counter_<wbr>dec:<br> <span class="gmail-HOEnZb"><font color="#888888">--<br> 2.11.0<br> <br> ______________________________<wbr>_________________<br> mesa-dev mailing list<br> <a href="mailto:mesa-dev@lists.freedesktop.org">mesa-dev@lists.freedesktop.org</a><br> <a href="https://lists.freedesktop.org/mailman/listinfo/mesa-dev" rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/mesa-dev</a><br> </font></span></blockquote></div><br></div></div>