[Mesa-dev] [PATCH v2 3/6] i965/fs/gen7: split instructions that run into exec masking bugs
Francisco Jerez
currojerez at riseup.net
Mon Jul 11 19:19:18 UTC 2016
Samuel Iglesias Gonsálvez <siglesias at igalia.com> writes:
> From: Iago Toral Quiroga <itoral at igalia.com>
>
> In fp64 we can produce code like this:
>
> mov(16) vgrf2<2>:UD, vgrf3<2>:UD
>
> That our simd lowering pass would typically split in instructions with a
> width of 8, writing to two consecutive registers each. Unfortunately, gen7
> hardware has a bug affecting execution masking and as a result, the
> second GRF register write won't work properly. Curro verified this:
>
> "The problem is that pre-Gen8 EUs are hardwired to use the QtrCtrl+1
> (where QtrCtrl is the 8-bit quarter of the execution mask signals
> specified in the instruction control fields) for the second
> compressed half of any single-precision instruction (for
> double-precision instructions it's hardwired to use NibCtrl+1,
> at least on HSW), which means that the EU will apply the wrong
> execution controls for the second sequential GRF write if the number
> of channels per GRF is not exactly eight in single-precision mode (or
> four in double-float mode)."
>
> In practice, this means that we cannot write more than one
> consecutive GRF in a single instruction if the number of channels
> per GRF is not exactly eight in single-precision mode (or four
> in double-float mode).
>
> This patch makes our SIMD lowering pass split this kind of instructions
> so that the split versions only write to a single register. In the
> example above this means that we split the write in 4 instructions, each
> one writing 4 UD elements (width = 4) to a single register.
>
> v2 (Curro):
> - Make explicit that the thing about hardwiring NibCtrl+1 for the second
> compressed half is known to happen in Haswell and the issue with IVB
> might not be exactly the same.
> - Assign max_width instead of returning early so that we can handle
> multiple restrictions affecting to the same instruction.
> - Avoid division by 0 if the instruction does not write any registers.
> - Ignore instructions what have WE_all set.
> - Use the instruction execution type size instead of the dst type size.
> ---
> src/mesa/drivers/dri/i965/brw_fs.cpp | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp b/src/mesa/drivers/dri/i965/brw_fs.cpp
> index 2f473cc..4d57412 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
> @@ -4691,6 +4691,34 @@ get_fpu_lowered_simd_width(const struct brw_device_info *devinfo,
> */
> unsigned reg_count = inst->regs_written;
>
You've put this right in the middle of one of my previous workarounds
;), can you move it down a bit more next to the "According to the IVB
PRMs" line, or close to the end of the function?
> + /* Pre-Gen8 EUs are hardwired to use the QtrCtrl+1 (where QtrCtrl is
> + * the 8-bit quarter of the execution mask signals specified in the
> + * instruction control fields) for the second compressed half of any
> + * single-precision instruction (for double-precision instructions
> + * it's hardwired to use NibCtrl+1, at least on HSW), which means that
> + * the EU will apply the wrong execution controls for the second
> + * sequential GRF write if the number of channels per GRF is not exactly
> + * eight in single-precision mode (or four in double-float mode).
> + *
> + * In this situation we calculate the maximum size of the split
> + * instructions so they only ever write to a single register.
> + */
> + if (devinfo->gen < 8 && inst->regs_written > 1 &&
> + !inst->force_writemask_all) {
> + unsigned channels_per_grf = inst->exec_size / inst->regs_written;
Could be declared const.
> + unsigned exec_type_size = 0;
> + for (int i = 0; i < inst->sources; i++) {
> + if (inst->src[i].file == BAD_FILE)
> + break;
It wouldn't be right to break early if the instruction has any valid
sources after a non-present one. This should probably be:
| if (inst->src[i].file != BAD_FILE)
| exec_type_size = MAX2(exec_type_size, type_sz(inst->src[i].type));
instead.
> + exec_type_size = MAX2(exec_type_size, type_sz(inst->src[i].type));
> + }
> + assert(exec_type_size);
> +
> + if (channels_per_grf != REG_SIZE / exec_type_size) {
I think you really need to use (exec_type_size == 8 ? 4 : 8) instead of
the RHS of this expression. The hardware shifts exactly 8 channels per
compressed half of the instruction regardless of the execution type, so
this formula would give you an incorrect answer for exec_type_size < 4.
> + max_width = MIN2(max_width, channels_per_grf);
> + }
Redundant braces.
> + }
> +
> for (unsigned i = 0; i < inst->sources; i++)
> reg_count = MAX2(reg_count, (unsigned)inst->regs_read(i));
>
> --
> 2.7.4
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 212 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/mesa-dev/attachments/20160711/ab871bee/attachment.sig>
More information about the mesa-dev
mailing list