Moving Etnaviv forward (Was: Re: Etnaviv DRM driver - oops when unloading)

Tue Jun 9 03:13:05 PDT 2015

Hi Russell, et al.

first let me say I'm sorry that I have been less responsive than some of
you would have hoped. I'm trying to get better at that, but juggling a
large pile of components where none of them are currently mainline isn't
an easy task, especially with downstream priorities not quite lining up
with the mainlining efforts. I'm trying to get a better time management
in place with slots reserved for the mainline stuff.

To make this a bit easier for me I would like to ask you to please keep
the relevant discussions on E-Mail. Following another communication
channel like IRC just makes things harder for me. I'm prepared to take
critique on public mail, no need to keep things hidden. ;)

Am Donnerstag, den 28.05.2015, 00:03 +0100 schrieb Russell King - ARM
Linux:
> On Wed, May 27, 2015 at 02:49:17PM +0200, Lucas Stach wrote:
> > Hi Alexander,
> > 
> > Am Mittwoch, den 27.05.2015, 14:45 +0200 schrieb Alexander Holler:
> > > Hello,
> > > 
> > > I've just build and booted the Etnaviv driver as module with Kernel 4.0.4.
> > > 
> > > When I unload the driver with rmmod an oops happens:
> > > 
> > Thanks for the report.
> > 
> > I'm currently working on the patchstack to get it into shape for another
> > submission. I'll take a look at this problem.
> 
> Lucas,
> 
> We definitely need to talk about this... please can you take a look at
> my patch stack which I mentioned in my reply to Alexander.  I know that
> you already based your patch set off one of my out-dated patch stacks,
> but unless you grab a more recent one, you're going to be solving a lot
> of bugs that I've fixed.
> 
Right, thanks for the pointer. I took a look at this and it doesn't seem
to clash with things I did until now, so it should be pretty easy for me
to resync your changes into a common tree.

> Also, I've heard via IRC that you've updated my DDX - which is something
> I've also done (I think I did mention that I would be doing this... and
> I also have a bunch of work on the Xrender backend.)  I suspect that
> makes your work there redundant.
> 
I've minimally beat it into shape to work on top of the new kernel
driver and fixed some XV accel bugs when used with GStreamer. I'm also
looking at accelerating rotated blits to speed up the rotated display
case. I consider none of this to be stable and tested enough to push it
out right now. I'll keep watching the things you do the driver and ping
you if I have anything worthwhile to integrate.

> There's at least two more issues that need to be discussed, which is
> concerning the command buffers, and their DMA-coherent nature, which
> makes the kernel command parser expensive.  In my perf measurements,
> it's right at the top of the list as the most expensive function.
> 
> I've made some improvements to the parser which reduces its perf
> figure a little, but it's still up there as the number one hot
> function, inspite of the code being as tight as the compiler can
> manage.
> 
> This is probably because we're reading from uncached memory: the CPU
> can't speculatively prefetch into the cache any of the data from memory.
> 
Yes, I've already seen this coming up in the traces.

I was thinking about exposing cached memory to userspace for the command
streams and then copying the commands into a write-combined buffer while
validating them. This isn't far from your idea and I think it should let
us optimize the CPU access for validation and buffer reloc patching,
while avoiding the need to do additional cache synchronization and
plugging the submit-then-change attack vector.

> It's _probably_ (I haven't benchmarked it) going to be faster to
> copy_from_user() the command buffers into the kernel, either directly
> into DMA coherent memory, or into cacheable memory, and then run them
> through the command parser, followed by either a copy to DMA coherent
> memory or using the DMA streaming API to push the cache lines out.
> 
> This has other advantages: by not directly exposing the memory which the
> GPU executes its command stream into userspace, we prevent submit-then-
> change "attacks" bypassing the kernel command parser.
> 
> Another issue is that we incur quite a heavy overhead if we allocate an
> etnadrm buffer, and then map it into userspace.  We end up allocating
> all pages for the buffer as soon as any page is faulted in (consider if
> it's a 1080p buffer...) and then setup the scatterlists.  That's useless
> overhead if we later decide that we're not going to pass it to the GPU
> (eg, because Xorg allocated a pixmap which it then only performed CPU
> operations on.)  I have "changes" for this which aren't prepared as
> proper patches yet (in other words, they're currently as a playground of
> changes.)
> 
I think this is a userspace problem. Userspace should not allocate any
GEM objects until it is clear that the buffer is going to be used by the
GPU. Experience with the Intel DDX shows that a crucial part for good X
accel performance is to have a proper strategy for migrating pixmaps
between the CPU and GPU in place, even with UMA.

> I have other improvements pending which needs proper perf analysis
> before I can sort them out properly.  These change the flow for a
> submitted command set - reading all data from userspace before taking
> dev->struct_mutex, since holding this lock over a page fault isn't
> particularly graceful.  Do we really want to stall graphics on the
> entire GPU while some process' page happens to be swapped out to disk?
> I suspect not.
> 

As long as those changes don't incur any userspace visible changes it
should be safe to keep them in the playground until the basics are
mainline. Nonetheless I would be grateful if you could point me to those
patches, so I can get a picture of how those changes would look like.

Regards,
Lucas

-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |