[Intel-gfx] [PATCH] drm/i915: Update write_domains on active list after flush.

Tue Mar 2 17:22:39 CET 2010

On Mon, Mar 01, 2010 at 09:51:37AM -0800, Eric Anholt wrote:
> On Mon, 1 Feb 2010 13:26:14 +0100, Daniel Vetter <daniel at ffwll.ch> wrote:
> > btw, I've thought a bit about why you want to do such crazy tricks with
> > flushes in the middle of batchbuffers. I've come up with the follwing.
> >
> > 1) Per execbuf-ioctl overhead on the cpu. I haven't seen anything obvious
> > there. But even when profiling points this way, IMHO it's better to fix it
> > than to paper over it by issuing fewer ioctls.
> 
> As a simple test I did
> s/intel_batchbuffer_emit_mi_flush/intel_batchbuffer_flush/ on Mesa and
> ran firefox-talos-gfx.  It went from 33 seconds to 50 seconds, entirely
> CPU bound either way.  execbuf jumped from just over 5% in the kernel to
> 16%.  That broke down to about:
> 
> 3.4% refcounting (about half for the ref, half for the idr spinlock)
> 1.3% kmalloc/free (hey, we could remove that)
> 1% move_to_active (half refcounting)
> 1% copy_from_user
> 1% object_pin
> .8% copy_to_user (we copy more than needed, perhaps restructure args?)
> .45% add_request
> .15% gem_flush
> 
> and a bit more that I forget because my system crashed closing sysprof
> and I lost my draft :(

Thanks for doing some profiling. I've just started benchmarking and
profiling for my pipelined fencing stuff and it's nice to see date from
people who actually know what their doing.

Looks like there are some low-hanging fruits here, like batching up the
the refcounting and getting rid of the idr spinlock in the profile. Dito
for move_to_active.
> 
> > 2) Per execbuf gpu<->cpu synchronization (e.g. too many interrupts). Code
> > as-is is quite suboptimal, for when we need some flushes, we already emit
> > more than one interrupt per execbuf. There's also an XXX already there
> > no one yet bothered to fix. I have some ideas there to mitigate this (as
> > prep work for something else).
> 
> My assumption is that these don't really matter either.

Well, I've found out that they matter. Omitting them grossly regresses
xrender performance in cairo. Looks like the current uxa implementation
has tightly coupled gpu/cpu interactions. So the XXX is actually wrong,
currently we _need_ all these interrupts, and less of them is worse.

> > 3) Per execbuf overhead for the gpu due to the need to re-emit all the
> > necessary state (shaders, tex-units, ...). My idea to fix this:
> > 
> >  - exebuf-ioctls returns a cookie for every batchbuffer.
> >  - userspaces emits commands as if it owns the hw (like dri1).
> >  - when doing the next execbuffer ioctl, it also sets the cookie from the
> >    last execbuffer ioctl.
> >  - kernel checks wheter any other batchbuffer was executed in between (by
> >    checking the cookie from userspace). If the cookies don't match, it
> >    errors out with a new error-code.
> >  - when userspace sees this special error-code, it creates a new
> >    batchbuffer with a complete re-emit of the full gpu state.
> > 
> > Of course, this needs some fancy new userspace code to take advantage of.
> > But this way a user can be more fair towards other users (by issuing
> > smaller execbufs) without a performance loss in the normal case of one
> > process hogging the gpu (e.g. giant vbo's with compute-intense shaders
> > could be split-up).
> 
> IIRC from a previous test, the GPU didn't care much about extra state
> emission, but in tests like cairo-perf the extra CPU cost definitely
> counts.  If we do HW contexts much of this CPU cost for the setup would
> go away (though there's still pinning to be done on set_context, so it
> won't be entirely free).
> 
> We'd thought about doing basically what you explained before.  The
> downside to that is that you have to do all that complete re-emit work
> up front, because at the point of exec you don't know what the state was
> at the start that you needed to have set up.  So basically what I'd been
> thinking for a non-HW-context hack like that was, at first state emit to
> batchbuffer, make a second batch and emit all the other setup state into
> it, then continue into the primary batch like normal.  Hand both to the
> kernel.  If you weren't the last context running, the kernel runs the
> setup batch before your batch.  The tricks here off the top of my head:
> 
> 1) On 965, most of the context state is pointers to objects, so you need
> to get those objects pinned and relocated (and emit state pointing at
> them if relocated) even if you were the last context.
> 2) Aperture accounting for both setup+main batch
> 3) Pinning the buffers across setup+main in the kernel if you did change
> contexts.
> 
> Because of those hard parts (particularly #1), I never pursued this
> path.

And 1) can't be fixed with hw support, either, because the hw stores
pointers to these objects ... So it looks like optimizing execbuf is
called for.

btw, last time I've strolled through gpu docs I've noticed that there are
some clock counter registers. We could store them with
MI_STORE_REGISTER_MEM into the hw status page. Placing one of these before
and after a batchbuffer should give pretty accurate measurements for the
time the gpu is actually busy processing a batchbuffer. Has anyone looked
into this yet or do you think it's not really worth it as a performance
tuning tool?

-Daniel
-- 
Daniel Vetter
Mail: daniel at ffwll.ch
Mobile: +41 (0)79 365 57 48