[Intel-gfx] Corruption in glxgears with Compiz

Sat Oct 23 19:48:47 CEST 2010

On Sat, 23 Oct 2010 12:42:05 +0100, Peter Clifton <pcjc2 at cam.ac.uk> wrote:
> Your patch works a treat.. I knew mine was really only a band-aid which
> forced a flush on the pending indiscriminately, and was glad to see the
> proper fix. 
> 
> Really difficult to get your head round all this flush / invalidate
> stuff. I get the idea, but in practice it is very confusing due to the
> fact it is all deferred / scheduled work, and both subtly different
> concepts (flush / invalidate) concepts are handled by the same action on
> the GPU, and very similar code! Very easy to muddle current / pending
> ring in my head, for example.
> 
> You replied to Alexey that the patch is only a stop gap, and inter-ring
> synchronisation is the real challenge. I guess that is something you'll
> be forced to look at with the new Sandybridge chipset having a separate
> ring for BLT operations?

Exactly. We already have the issue on i965 with the Bitstream Decoder ring
which handles video separate from the render ring. Fortunately no one has
fallen over the lack of synchronisation there since the API design makes
interoperating GL/RENDER/Video so difficult. Even worse is that it is only
with Sandybridge that we have the ability to insert semaphores onto the
ring to handle inter-ring synchronisation on the GPU, otherwise we will
simply have to wait on retirement when transferring ownership from one
ring to another.  Is it worth the additional complexity to have buffers
reside on multiple rings at the same time? Possibly if we do start mixing
video + GL.  Anyway with the BLT split, handling synchronisation will
become an issue.

> I'm just looking for fps with my circuit board rendering GL code at the
> moment.. that's why I'm following git HEAD stuff, to see if the drivers
> can unlock some performance in the code I'm writing. I'm struggling to
> profile just what the bottleneck is!

Aye, profiling GPU code at the moment is a hard problem. If you do find
some CPU bottlenecks, they're usually the easiest to fix. What may help is
to sync every operation and see what the relative times + relative
frequencies to work out the rate limiting step and then see if you can
break it down further and repeat. (Even if we had a GPU callgrind, given
the disconnect between what is executed on the GPU and GL, it may not be
obvious how to improve the code.) uprof may help here given the
annotations Robert Brag has made for mesa profiling.

We're always eager to improve our code to get the most of our admittedly
lack-luster GPUs. Even suggests on what tools would be useful or
improvements we could make to improve profiling/development are most
welcome.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre