[Mesa-dev] [PATCH] i965/fs: Fix gen<6 LRP opcode

Tue May 20 12:10:01 PDT 2014

On Fri, May 9, 2014 at 6:21 AM, Juha-Pekka Heikkila
<juhapekka.heikkila at gmail.com> wrote:
> On ILK implicit accumulator write from MUL opcode seem to
> behave sometime unexpected. This patch change implicit
> accumulator write to explicit on emitting LRP for gen<6.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=77707
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=77703
> Signed-off-by: Juha-Pekka Heikkila <juhapekka.heikkila at gmail.com>
> ---
>  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
> index d2dc5fa..c09f0a7 100644
> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp
> @@ -226,8 +226,10 @@ fs_visitor::emit_lrp(const fs_reg &dst, const fs_reg &x, const fs_reg &y,
>        fs_reg negative_a = a;
>        negative_a.negate = !a.negate;
>
> +      struct brw_reg acc = retype(brw_acc_reg(), this->result.type);
> +
>        emit(ADD(one_minus_a, negative_a, fs_reg(1.0f)));
> -      fs_inst *mul = emit(MUL(reg_null_f, y, a));
> +      fs_inst *mul = emit(MUL(acc, y, a));
>        mul->writes_accumulator = true;

If we're changing the accumulator destination to be explicit, we
shouldn't make it implicit as well.

>        emit(MAC(dst, x, one_minus_a));
>     } else {
> --
> 1.8.1.2

All in all, I'm fine with this patch as-is. If fixes a couple of
regressions, even though I have no idea why.

According to the G45 docs:

> Performance Note: GENX hardware cannot support write followed by read on the
> same accumulator register back to back. A thread stall (equivalent of having a ‘switch’
> instruction control) may occur before an instruction that uses an accumulator register
> as an (implicit or explicit) source and the previous instruction has the same
> accumulator register as the (implicit or explicit) destination. This commonly occurs in
> signal processing algorithms. For example, a multi-tap FIR filter can be implemented
> by a sequence of mac instructions; a matrix computation in DCT transform also
> consists of a sequence of mac instructions. A program with a non-compressed
> instruction stream may choose to interleave the use of acc0 and acc1 to achieve
> better performance, if accumulators are used explicitly. This is not true if the
> accumulator is addressed implicitly based on the SecHalf compression control field as
> flags and masks are also affected by SecHalf.

A snapshot of the Bspec before they trashed the ILK docs says "GEN4
hardware cannot...", so it doesn't seem like this even applies to
Ironlake, and it doesn't even claim to cause hangs. Unfortunately
that's the best explanation I can come up with at this point.

Running some code in the simulator that writes and reads the
accumulator at least doesn't show any complaints on Haswell. I don't
think any of us have an Ironlake simulator?

I really have no idea what's wrong. At the same time, we also have no
idea about the latency of operations writing the accumulator. There's
some text that suggests that it's lower latency that standard
registers, but we have no ability to measure this before Sandybridge.

While it's likely that using the accumulator+MAC is faster to perform
a LRP for one float, for multiple we wind up with code like

mac(8)          g113<1>F        g2<0,1,0>F      g4<8,8,1>F      {
align1 WE_normal 1Q compacted };
mul(8)          null            g2.5<0,1,0>F    g3.1<0,1,0>F    {
align1 WE_normal 1Q AccWrEnable };
mac(8)          g114<1>F        g2.1<0,1,0>F    g5<8,8,1>F      {
align1 WE_normal 1Q compacted };
mul(8)          null            g2.6<0,1,0>F    g3.2<0,1,0>F    {
align1 WE_normal 1Q AccWrEnable };
mac(8)          g115<1>F        g2.2<0,1,0>F    g6<8,8,1>F      {
align1 WE_normal 1Q compacted };
mul(8)          null            g2.7<0,1,0>F    g3.3<0,1,0>F    {
align1 WE_normal 1Q AccWrEnable };
mac(8)          g116<1>F        g2.3<0,1,0>F    g7<8,8,1>F      {
align1 WE_normal 1Q compacted };

where we presumably stall the EU between successive calculations. It's
unclear to me whether this was an improvement over the old code, which
used one more instruction per component, but could hide instruction
latency more easily.