[Mesa-dev] [PATCH 2/5] i965/vec4: adding vec4_cmod_propagation optimization

Wed Oct 14 00:51:52 PDT 2015

On 13/10/15 23:36, Matt Turner wrote:
> On Tue, Oct 13, 2015 at 1:49 AM, Alejandro Piñeiro <apinheiro at igalia.com> wrote:
>> On 13/10/15 03:10, Matt Turner wrote:
>>> Looks like this is causing an intermittent failure on HSW in our
>>> Jenkins system (but I'm not able to reproduce locally) and a
>>> consistent failure on G45. I'll investigate that.
>> Ok, will hold on, and meanwhile I will try to reproduce the problem on
>> HSW. Unfortunately I don't have any G45 available, so I will not be able
>> to help on that. Thanks for taking a look there.
> Okay, I think I see what's going on:
>
> --- good        2015-10-13 13:34:40.753357253 -0700
> +++ bad 2015-10-13 13:36:06.114290094 -0700
> -and(8)          g6<1>.xD        g6<4,4,1>.xD    1D              { align16 };
> -cmp.nz.f0(8)    null            -g6<4,4,1>.xD   0D              { align16 };
> +and.nz.f0(8)    g6<1>.xD        g6<4,4,1>.xD    1D              { align16 };
>
> We're propagating a .nz conditional mod from a CMP with a null dest
> (that has a writemask of .xyzw) to an AND that has a writemask of only
> .x, so only the .x channel of the flag is now being updated.

That is really common. And that is the idea behind the first patch of
the series that changes how if's are emitted, and allowing to optimize
if scan_inst->writemask is WRITEMASK_X. I will use the piglit test
glsl-1.30/execution/switch/vs-pervertex.shader_test to explain this.

If we use current master, we get the following assembly:
1: cmp.z.f0(8)     g20<1>.xD       g12<4,4,1>.xD   g19<4,4,1>.xD   {
align16 1Q compacted };
2: mov.nz.f0(8)    null            g20<4,4,1>.xD                   {
align16 1Q };
3: (+f0) if(8)     JIP: 6          UIP: 6                          {
align16 1Q };

If we just simplify out the mov.nz at #2, the test would fail, because
as in the example you mentioned, null has a XYZW writemask, but g19 a
writemask X. So this optimized version, f0 only have x updated, and #3
is doing a normal if, that uses all the channels.

But right now the if just needs to check against one component, the x.
So that is the reason (among others) that my patch 1 changes how if's
are emitted. So, using both my patches 1 and 2, the assembly outcome
would be the following:
1: cmp.z.f0(8)     g20<1>.xD       g12<4,4,1>.xD   g19<4,4,1>.xD   {
align16 1Q compacted };
2: (+f0.x) if(8)   JIP: 6          UIP: 6                          {
align16 1Q };

It is true that now #1 only updates the x channel. But the if at #2 only
use that channel now, so the test still passes.

>
> I think for now the thing to do is add
>
>    (inst->dst.writemask & ~scan_inst->dst.writemask) != 0)

This would bail out basically all if's.

So guessing about what is happening here, as I mentioned, we are
allowing to pass that condition if scan_inst->dst.writemask equal to
WRITEMASK_X assuming that they come from an if, so we don't need to
update all f0 channels. But as Jason mentioned in our off-list chat,
that would not be the case if we are dealing with a sel. So something
like this:
1: cmp.z.f0(8)     g20<1>.xD       g12<4,4,1>.xD   g19<4,4,1>.xD   {
align16 1Q compacted };
2: mov.nz.f0(8)    null            g20<4,4,1>.xD                   {
align16 1Q };
3: (+f0.x) if(8)     JIP: 6          UIP: 6                          {
align16 1Q };

#2 can be optimized out. But if we replace that if for an sel at #3, we
can't.

If that is the case, in addition to check scan_inst and inst, we would
need to check in which kind of instruction the flag register is used later.

But as I said, Im just guessing until I see the full failing test.

> to the list of conditions under which we bail out. That is, if the
> instruction we want to propagate the cmod onto writes fewer channels,
> we can't do the optimization.
>
> With that, the block should look like:
>
>      if ((scan_inst->predicate && scan_inst->opcode != BRW_OPCODE_SEL) ||
>          scan_inst->dst.reg_offset != inst->src[0].reg_offset ||
>          (scan_inst->dst.writemask != WRITEMASK_X &&
>           scan_inst->dst.writemask != WRITEMASK_XYZW) ||
>          (scan_inst->dst.writemask == WRITEMASK_XYZW &&
>           inst->src[0].swizzle != BRW_SWIZZLE_XYZW) ||
>          (inst->dst.writemask & ~scan_inst->dst.writemask) != 0)
>         break;
>
> The good news is that, unless I've done something wrong, this doesn't
> affect any shaders in shader-db on ILK or HSW (I only tested those
> two, but I expect the results are the same everywhere). Apparently
> this is a pretty rare case.

Are you sure? I have made a run adding your condition, and now comparing
master vs having the optimization I get this:
total instructions in shared programs: 6240631 -> 6240471 (-0.00%)
instructions in affected programs:     18965 -> 18805 (-0.84%)
helped:                                160
HURT:                                  0

That is a really small gain. Or put in other way, if we compare the
conditions I have on the original patches vs adding the condition you
are proposing, I get this:

total instructions in shared programs: 6223900 -> 6240471 (0.27%)
instructions in affected programs:     477537 -> 494108 (3.47%)
helped:                                0
HURT:                                  3047

>
> With that change, my R-b still stands, though we should have a unit
> test for this case as well in 3/5.

I would like to discuss this a little more before pushing.

>
> Thanks!
> Matt
>

BR

-- 
Alejandro Piñeiro (apinheiro at igalia.com)