Discussion starters for ION session at Linux Plumbers Android+Graphics microconf

Fri Sep 6 02:16:57 PDT 2013

On Thu, Sep 05, 2013 at 10:06:52PM -0700, John Stultz wrote:
> On 09/05/2013 08:26 PM, Rob Clark wrote:
> > On Thu, Sep 5, 2013 at 8:49 PM, John Stultz <john.stultz at linaro.org> wrote:
> >> Hey everyone,
> >>    In preparation for the Plumbers Android+Graphics microconf, I wanted to
> >> send out some background documentation to try to get all the context we can
> >> out there prior to the discussion, as time will be limited and it would be
> >> best to spend it discussing solutions rather then re-hashing problems and
> >> requirements.
> >>
> >> I'm sure many folks on this list could probably do a better job summarizing
> >> the issues, but I wanted to get this out there to try to enumerate the
> >> problems and the different perspectives on the issues that I'm aware of.
> >>
> >> The document is on LWN here:
> >> http://lwn.net/SubscriberLink/565469/9d88daa2282ef6c2/
> > oh, I had missed that article.. fwiw
> 
> It was published just moments before I sent out this thread, so I
> wouldn't have expected anyone to have seen it yet. :)
> 
> 
> > "Another possible solution is to allow dma-buf exporters to not
> > allocate the backing buffers immediately. This would allow multiple
> > drivers to attach to a dma-buf before the allocation occurs. Then,
> > when the buffer is first used, the allocation is done; at that time,
> > the allocator could scan the list of attached drivers and be able to
> > determine the constraints of the attached devices and allocate memory
> > accordingly. This would allow user space to not have to deal with any
> > constraint solving. "
> >
> > That is actually how dma-buf works today.  And at least with GEM
> > buffers exported as dma-buf's, the allocation is deferred.  It does
> > require attaching the buffers in all the devices that will be sharing
> > the buffer up front (but I suppose you need to know the involved
> > devices one way or another with any solution, so this approach seems
> > as good as any).  We *do* still need to spiff up dev->dma_parms a bit
> > more, and there might be some room for some helpers to figure out the
> > union of all attached devices constraints, and allocate suitable
> > backing pages... so perhaps this is one thing we should be talking
> > about.
> 
> Ok. I had gone looking for an example of the deferred allocation, but
> didn't find it.  I'll go look again, but if you have a pointer, that
> could be useful.
> 
> So yea, I do think this is the most promising approach, but sorting out
> the next steps for doing a proof of concept is one thing I'd like to
> discuss (as mentioned in the article, via a ion-like generic allocator,
> or trying to wire in the constraint solving to some limited set of
> drivers via generic helper functions). As well as getting a better
> understanding the Android developers concern about any non-deterministic
> cost of allocating at mmap time.
> 
> 
> Thanks for the feedback and thoughts! I'm hopeful some approach to
> resolving the various issues can be found, but I suspect it will have a
> few different parts.

My main gripe with ION is that it creates a parallel infrastructure for
figuring out allocation constraints of devices. Upstream already has all
the knowledge (or at least most of it) for cache flushing, mapping into
iommus and allocating from special pools stored in association with the
device structure. So imo an upstream ION thing should reuse the
information each device and its driver already has available.

Now I also see that a central allocator has upsides since reinventing this
wheel for every device driver is not a great idea. One idea to get there
and keep the benefits of ION with up-front allocations would be.
1) Allcoate the dma-buf handle at the central allocator. No backing
storage gets allocated.
2) Import that dma-buf everywhere you want it to be used. That way
userspace doesn't need to deal with whatever hw madness is actually used
to implement the drm/v4l/whatever devices nodes internally.
3) Ask the central allocator to finalize the buffer allocation placement
and grab backing storage.

If any further attaching happens that doesn't work out it would simply
fail, and userspace gets to keep the pieces. Which is already the case in
today's upstream when userspace is unlucky and doesn't pick the most
constrained device.

This only tackles the "make memory allocation predictable" issue ION
solves, which leaves the optimized cache flushing. We can add caches for
pre-flushed objects for that (not rocket science, most of the drm drivers
have that wheel reinvented, too). That leaves us with optimizing cache
flushes (i.e. leaving them out when switching between devices without cpu
accesss in-between). The current linux dma api doesn't really support
this, so we need to add a bit of interfaces there to be able to do
device-to-device cache flushing (which safe for maybe iommu flushes I
expect to be noops). And the central allocator obviously needs to keep
track of where the current cache domain is.

Aside: Intel Atom SoCs have the same cache flushing challenges since all
the gfx blocks (gpu, display, camera, ...) prefer direct main memory
access that bypasses gpu caches. Big core stuff is obviously different and
fully coherent. So we need a solution for this, too, but unfortunately the
camera driver guys haven't yet managed to up stream their driver so not
possible for us to demonstrate anything on upstream :( Same story as
everywhere else in SoC-land I guess ...

Now one thing I've missed from your article on the GEM vs. ION topic is
that gem allows buffers to be swapped out. That works by allocating
shmemfs nodes, but that doesn't really work together nicely with the
current linux dma apis. Which means that drivers have a bunch of hacks to
work around this (and ttm has an entire page cache as a 2nd allocation
step to get at the right dma api allocated pages).

There's been the occasional talk about a gemfs to rectify these allocation
issues. If we'd merge this with the central allocator and optionally allow
it to swap out/move backing storage pages (and also back them with a fs
node ofc) then we could rip out a bit code from drm drivers. I also think
that this way would be the only approach to actually make PRIME work
together with IOMMUs. There's some really old patches from Chris Wilson to
teach i915-gem to directly manage the backing storage swapping, so
patching this into the central allocator shouldn't be too nefarious.

So that's my rough sketch of the brave new world I have in mind. Please
poke holes ;-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch