[Intel-gfx] [PATCH] drm/i915: Update write_domains on active list after flush.

Mon Mar 1 18:51:37 CET 2010

On Mon, 1 Feb 2010 13:26:14 +0100, Daniel Vetter <daniel at ffwll.ch> wrote:
> On Sun, Jan 31, 2010 at 04:40:17PM +0000, Chris Wilson wrote:
> > Daniel, can you respin your earlier patches as I think there were a couple
> > of good bug fixes that were lost in the noise. And then we can take a
> > fresh look at the others, and see if I said anything very silly.  :-)
> > -ickle
> 
> Will do somewhen later today.
> 
> btw, I've thought a bit about why you want to do such crazy tricks with
> flushes in the middle of batchbuffers. I've come up with the follwing.
>
> 1) Per execbuf-ioctl overhead on the cpu. I haven't seen anything obvious
> there. But even when profiling points this way, IMHO it's better to fix it
> than to paper over it by issuing fewer ioctls.

As a simple test I did
s/intel_batchbuffer_emit_mi_flush/intel_batchbuffer_flush/ on Mesa and
ran firefox-talos-gfx.  It went from 33 seconds to 50 seconds, entirely
CPU bound either way.  execbuf jumped from just over 5% in the kernel to
16%.  That broke down to about:

3.4% refcounting (about half for the ref, half for the idr spinlock)
1.3% kmalloc/free (hey, we could remove that)
1% move_to_active (half refcounting)
1% copy_from_user
1% object_pin
.8% copy_to_user (we copy more than needed, perhaps restructure args?)
.45% add_request
.15% gem_flush

and a bit more that I forget because my system crashed closing sysprof
and I lost my draft :(

> 2) Per execbuf gpu<->cpu synchronization (e.g. too many interrupts). Code
> as-is is quite suboptimal, for when we need some flushes, we already emit
> more than one interrupt per execbuf. There's also an XXX already there
> no one yet bothered to fix. I have some ideas there to mitigate this (as
> prep work for something else).

My assumption is that these don't really matter either.

> 3) Per execbuf overhead for the gpu due to the need to re-emit all the
> necessary state (shaders, tex-units, ...). My idea to fix this:
> 
>  - exebuf-ioctls returns a cookie for every batchbuffer.
>  - userspaces emits commands as if it owns the hw (like dri1).
>  - when doing the next execbuffer ioctl, it also sets the cookie from the
>    last execbuffer ioctl.
>  - kernel checks wheter any other batchbuffer was executed in between (by
>    checking the cookie from userspace). If the cookies don't match, it
>    errors out with a new error-code.
>  - when userspace sees this special error-code, it creates a new
>    batchbuffer with a complete re-emit of the full gpu state.
> 
> Of course, this needs some fancy new userspace code to take advantage of.
> But this way a user can be more fair towards other users (by issuing
> smaller execbufs) without a performance loss in the normal case of one
> process hogging the gpu (e.g. giant vbo's with compute-intense shaders
> could be split-up).

IIRC from a previous test, the GPU didn't care much about extra state
emission, but in tests like cairo-perf the extra CPU cost definitely
counts.  If we do HW contexts much of this CPU cost for the setup would
go away (though there's still pinning to be done on set_context, so it
won't be entirely free).

We'd thought about doing basically what you explained before.  The
downside to that is that you have to do all that complete re-emit work
up front, because at the point of exec you don't know what the state was
at the start that you needed to have set up.  So basically what I'd been
thinking for a non-HW-context hack like that was, at first state emit to
batchbuffer, make a second batch and emit all the other setup state into
it, then continue into the primary batch like normal.  Hand both to the
kernel.  If you weren't the last context running, the kernel runs the
setup batch before your batch.  The tricks here off the top of my head:

1) On 965, most of the context state is pointers to objects, so you need
to get those objects pinned and relocated (and emit state pointing at
them if relocated) even if you were the last context.
2) Aperture accounting for both setup+main batch
3) Pinning the buffers across setup+main in the kernel if you did change
contexts.

Because of those hard parts (particularly #1), I never pursued this
path.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/intel-gfx/attachments/20100301/10ee92b4/attachment.sig>