[Mesa-dev] [PATCH 0/2] Nir: Allow CSE of SSBO loads

Thu Oct 22 08:18:29 PDT 2015

On Thu, 2015-10-22 at 16:38 +0200, Iago Toral wrote:
> On Thu, 2015-10-22 at 09:39 -0400, Connor Abbott wrote:
> > On Thu, Oct 22, 2015 at 7:21 AM, Iago Toral Quiroga <itoral at igalia.com> wrote:
> > > I implemented this first as a separate optimization pass in GLSL IR [1], but
> > > Curro pointed out that this being pretty much a restricted form of a CSE pass
> > > it would probably make more sense to do it inside CSE (and we no longer have
> > > a CSE pass in GLSL IR).
> > >
> > > Unlike other things we CSE in NIR, in the case of SSBO loads we need to make
> > > sure that we invalidate previous entries in the set in the presence of
> > > conflicting instructions (i.e. SSBO writes to the same block and offset) or
> > > in the presence of memory barriers.
> > >
> > > If this is accepted I intend to extend this to also cover image reads, which
> > > follow similar behavior.
> > >
> > > No regressions observed in piglit or dEQP's SSBO functional tests.
> > >
> > > [1] http://lists.freedesktop.org/archives/mesa-dev/2015-October/097718.html
> > >
> > > Iago Toral Quiroga (2):
> > >   nir/cse: invalidate SSBO loads in presence of ssbo writes or memory
> > >     barriers
> > >   nir/instr_set: allow rewrite of SSBO loads
> > >
> > >  src/glsl/nir/nir_instr_set.c |  24 ++++++--
> > >  src/glsl/nir/nir_opt_cse.c   | 142 +++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 162 insertions(+), 4 deletions(-)
> > >
> > > --
> > > 1.9.1
> > >
> > > _______________________________________________
> > > mesa-dev mailing list
> > > mesa-dev at lists.freedesktop.org
> > > http://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > 
> > NAK, this isn't going to work. NIR CSE is designed for operations
> > which can be moved around freely as long they're still dominated by
> > the SSA values they use. It makes heavy advantage of this to avoid
> > looking at the entire CFG and instead only at the current block and
> > its parents in the dominance tree. For example, imagine you have
> > something like:
> > 
> > A = load_ssbo 0
> > if  (cond) {
> >    store_ssbo 0
> > }
> > B = load_ssbo 0
> > 
> > Then A and B can't be combined, but CSE will combine them anyways when
> > it reaches B because it keeps a hash table of values dominating B and
> > finds A as a match. It doesn't look at the if conditional at all
> > because it doesn't dominate the load to B. This is great when you want
> > to CSE pure things that don't depend on other side effects -- after
> > all, this is the sort of efficiency that SSA is supposed to give us --
> > but it means that as-is, it can't be used for e.g. SSBO's and images
> > without completely changing how the pass works and making it less
> > efficient.
> 
> Ugh! One would think that at least one of the 2000+ SSBO tests in dEQP
> would catch something like this... I guess not :(.

However, I have just tested this and it works just fine. See:

buffer Fragments {
   vec4 v;
};
out vec4 color;

void main()
{
   vec4 tmp = v;
   if (tmp.x > 0) {
      v = vec4(0, 1, 0, 1);
   }
   color = v;
}

And the final NIR SSA form for this is:

impl main {
        block block_0:
        /* preds: */
        vec1 ssa_0 = load_const (0x00000000 /* 0.000000 */)
        vec4 ssa_1 = load_const (0x00000000 /* 0.000000 */,
0x3f800000 /* 1.000000 */, 0x00000000 /* 0.000000 */, 0x3f800000 /*
1.000000 */)
        vec4 ssa_2 = intrinsic load_ssbo (ssa_0) () (0)
        vec1 ssa_3 = flt ssa_0, ssa_2
        /* succs: block_1 block_2 */
        if ssa_3 {
                block block_1:
                /* preds: block_0 */
                intrinsic store_ssbo (ssa_1, ssa_0) () (0, 15)
                /* succs: block_3 */
        } else {
                block block_2:
                /* preds: block_0 */
                /* succs: block_3 */
        }
        block block_3:
        /* preds: block_1 block_2 */
        vec4 ssa_4 = intrinsic load_ssbo (ssa_0) () (0)
        intrinsic store_output (ssa_4) () (0)   /* color */
        /* succs: block_4 */
        block block_4:
}

What is going on here is that block 1 is in block0->dom_children, so the
CSE pass looks into that, sees the store and invalidates the first SSBO
load as I was initially hoping that it would.

I guess this behavior is not expected then?

Iago

> > Now, that being said, I still think that we should definitely be doing
> > this sort of thing in NIR now that we've finally added support for
> > SSBO's and images. We've been trying to avoid adding new optimizations
> > to GLSL, since we've been trying to move away from it. In addition,
> > with SPIR-V on the way, anything added to GLSL IR now is something
> > that we won't be able to use with SPIR-V shaders. Only doing it in FS
> > doesn't sound so great either; we should be doing as much as possible
> > at the midlevel, and combining SSBO loads is something that isn't
> > FS-specific at all.
> 
> Yeah, agreed.
> 
> > There are two ways I can see support for this being added to NIR:
> > 
> > 1. Add an extra fake source/destination to intrinsics with side
> > effects, and add a pass to do essentially a conversion to SSA that
> > wires up these "token" sources/destinations, or perhaps extend the
> > existing to-SSA pass.
> > 
> > 2. Add a special "load-combining" pass that does some dataflow
> > analysis or similar (or, for now, only looks at things within a single
> > block).
> > 
> > The advantage of #1 is that we get to use existing NIR passes, like
> > CSE, DCE, and GCM "for free" on SSBO loads and stores, without having
> > to do the equivalent thing using dataflow analysis. Also, doing store
> > forwarding (i.e. replacing the result of an SSBO load with the value
> > corresponding to a store, if we can figure out which store affects it)
> > is going to much easier. However, #1 is going to be much more of a
> > research project. I've thought about how we could do it, but I'm still
> > not sure how it could be done feasibly and still be correct.
> 
> Thanks for sharing these ideas. #1 looks like the best way to go in
> terms of benefits (although it looks rather artificial!), however I am
> not sure that my understanding of NIR at this moment is good enough to
> pursue something like that. Also, I would really like to see some sort
> of support for this landing soon, even if limited, since having 16 SSBO
> loads for a single matrix multiplication like the one I mention in the
> commit log for my second patch is really bad, and even the simplest
> version of #2 would address that, so I think I'll give #2 a try.
> 
> Iago
>