Moving Etnaviv forward (Was: Re: Etnaviv DRM driver - oops when unloading)

Tue Jun 9 03:47:59 PDT 2015

On Tue, Jun 09, 2015 at 12:13:05PM +0200, Lucas Stach wrote:
> To make this a bit easier for me I would like to ask you to please keep
> the relevant discussions on E-Mail. Following another communication
> channel like IRC just makes things harder for me. I'm prepared to take
> critique on public mail, no need to keep things hidden. ;)

Unfortunately, a lot of discussion does happen on IRC - that's why there
exists a #etnaviv channel.  It's not that it's hidden, it's just that
you're only there very rarely.

> Am Donnerstag, den 28.05.2015, 00:03 +0100 schrieb Russell King - ARM
> Linux:
> > Lucas,
> > 
> > We definitely need to talk about this... please can you take a look at
> > my patch stack which I mentioned in my reply to Alexander.  I know that
> > you already based your patch set off one of my out-dated patch stacks,
> > but unless you grab a more recent one, you're going to be solving a lot
> > of bugs that I've fixed.
>
> Right, thanks for the pointer. I took a look at this and it doesn't seem
> to clash with things I did until now, so it should be pretty easy for me
> to resync your changes into a common tree.

That's good.

> > Also, I've heard via IRC that you've updated my DDX - which is something
> > I've also done (I think I did mention that I would be doing this... and
> > I also have a bunch of work on the Xrender backend.)  I suspect that
> > makes your work there redundant.
>
> I've minimally beat it into shape to work on top of the new kernel
> driver and fixed some XV accel bugs when used with GStreamer.

What bugs are they?  It would be nice to have at least bug fixes sent
in a timely manner.

> I'm also
> looking at accelerating rotated blits to speed up the rotated display
> case. I consider none of this to be stable and tested enough to push it
> out right now. I'll keep watching the things you do the driver and ping
> you if I have anything worthwhile to integrate.

Rotated blits for general pixmaps don't make sense.

The only place where this becomes useful is in Xrender's composite method,
where we have picture transforms to deal with.

However, I forsee problems there, as miComputeCompositeRegion() doesn't
take account of any rotations.  (The source pict->clientClip is in
unrotated coordinates looking at miValidatePicture(), and
miClipPictureSrc() will apply that clip region in an untranslated manner
to the destination.)  Even considering flips, I'm not sure it's valid to
use the region computed by miComputeCompositeRegion() as the computed
region would appear (at least to me) to be wrong.  Maybe I'm
misunderstanding something in the Xrender extension though.

I have some experimental code I'm running on iMX6 which implements the
Xrandr H and V flips, as well as 180° rotation (iow, H and V flip) but
the conclusion I've come to is it will probably be much better (and
more efficient) to use any abilities of the KMS driver to do the final
rotation/flipping rather than having the GPU do this.  Couple this with
my point above, and it's the reason why I haven't put much effort into
that bit.

> > Another issue is that we incur quite a heavy overhead if we allocate an
> > etnadrm buffer, and then map it into userspace.  We end up allocating
> > all pages for the buffer as soon as any page is faulted in (consider if
> > it's a 1080p buffer...) and then setup the scatterlists.  That's useless
> > overhead if we later decide that we're not going to pass it to the GPU
> > (eg, because Xorg allocated a pixmap which it then only performed CPU
> > operations on.)  I have "changes" for this which aren't prepared as
> > proper patches yet (in other words, they're currently as a playground of
> > changes.)
>
> I think this is a userspace problem. Userspace should not allocate any
> GEM objects until it is clear that the buffer is going to be used by the
> GPU. Experience with the Intel DDX shows that a crucial part for good X
> accel performance is to have a proper strategy for migrating pixmaps
> between the CPU and GPU in place, even with UMA.

I don't think so.  There are certain cases where we know that a pixmap is
never going to be supported by the GPU (eg, unsupported format) but there
are a lot of cases where we just don't know.

The problem is that bouncing a 1080p pixmap between the CPU and GPU is
expensive.  If we're the situation where we have multiple different
buffers for a single pixmap (eg, one for the fb code, one for the GPU)
then we'll be forever copying pixels between the two, which just adds to
the overhead.  Yes, I'm aware that i915 does damage tracking to reduce
the amount of copying, but that still doesn't get away from reading from
an uncached pixmap.

(btw, when this code runs on Armada DRM, it's running with fully cached
pixmaps allocated by Armada DRM, it doesn't use the Etnaviv allocator,
and it doesn't suffer from these overheads as a result.)

> > I have other improvements pending which needs proper perf analysis
> > before I can sort them out properly.  These change the flow for a
> > submitted command set - reading all data from userspace before taking
> > dev->struct_mutex, since holding this lock over a page fault isn't
> > particularly graceful.  Do we really want to stall graphics on the
> > entire GPU while some process' page happens to be swapped out to disk?
> > I suspect not.
> 
> As long as those changes don't incur any userspace visible changes it
> should be safe to keep them in the playground until the basics are
> mainline. Nonetheless I would be grateful if you could point me to those
> patches, so I can get a picture of how those changes would look like.

They don't, but the one we do need to solve before we can think about
mainlining anything is the cache status of the command buffers, and
whether we still use bos in userspace for those.  If we're going to
switch to reading via copy_from_user() (which has the advantage that
it'll work on virtual caches - having userspace write via the userspace
alias, and then trying to read from the kernel space alias is fraught
with problems for systems with virtual caches) then there's no point
in having the commands stored in a bo.

-- 
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.