[PATCH 1/2] drm: add cache support for arm64

Wed Aug 7 16:09:53 UTC 2019

On Tue, Aug 6, 2019 at 11:25 PM Christoph Hellwig <hch at lst.de> wrote:
>
> On Tue, Aug 06, 2019 at 09:23:51AM -0700, Rob Clark wrote:
> > On Tue, Aug 6, 2019 at 8:50 AM Christoph Hellwig <hch at lst.de> wrote:
> > >
> > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> > > > Agreed that drm_cflush_* isn't a great API.  In this particular case
> > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> > > > out to memory later, and so that I don't get a cache hit on
> > > > uncached/wc mmap'ing.
> > >
> > > So what is the use case here?  Allocate pages using the page allocator
> > > (or CMA for that matter), and then mmaping them to userspace and never
> > > touching them again from the kernel?
> >
> > Currently, it is pages coming from tmpfs.  Ideally we want pages that
> > are swappable when unpinned.
>
> tmpfs is basically a (complicated) frontend for alloc pages as far
> as page allocation is concerned.
>
> > CPU mappings are *mostly* just mapping to userspace.  There are a few
> > exceptions that are vmap'd (fbcon, and ringbuffer).
>
> And those use the same backend?

yes

> > (Eventually I'd like to support pages passed in from userspace.. but
> > that is down the road.)
>
> Eww.  Please talk to the iommu list before starting on that.

This is more of a long term goal, we can't do it until we have
per-context/process pagetables, ofc.

Getting a bit off topic, but I'm curious about what problems you are
concerned about.  Userspace can shoot it's own foot, but if it is not
sharing GPU pagetables with other processes, it can't shoot other's
feet.  (I'm guessing you are concerned about non-page-aligned
mappings?)

> > > > Tying it in w/ iommu seems a bit weird to me.. but maybe that is just
> > > > me, I'm certainly willing to consider proposals or to try things and
> > > > see how they work out.
> > >
> > > This was just my through as the fit seems easy.  But maybe you'll
> > > need to explain your use case(s) a bit more so that we can figure out
> > > what a good high level API is.
> >
> > Tying it to iommu_map/unmap would be awkward, as we could need to
> > setup cpu mmap before it ends up mapped to iommu.  And the plan to
> > support per-process pagetables involved creating an iommu_domain per
> > userspace gl context.. some buffers would end up mapped into multiple
> > contexts/iommu_domains.
> >
> > If the cache operation was detached from iommu_map/unmap, then it
> > would seem weird to be part of the iommu API.
> >
> > I guess I'm not entirely sure what you had in mind, but this is why
> > iommu seemed to me like a bad fit.
>
> So back to the question, I'd like to understand your use case (and
> maybe hear from the other drm folks if that is common):
>
>  - you allocate pages from shmem (why shmem, btw?  if this is done by
>    other drm drivers how do they guarantee addressability without an
>    iommu?)

shmem for swappable pages.  I don't unpin and let things get swapped
out yet, but I'm told it starts to become important when you have 50
browser tabs open ;-)

There are some display-only drm drivers with no IOMMU, which use CMA
rather than shmem.  Basically every real GPU has some form of MMU or
IOMMU for memory protection.  (The couple exceptions do expensive
kernel side cmdstream validation.)

>  - then the memory is either mapped to userspace or vmapped (or even
>    both, althrough the lack of aliasing you mentioned would speak
>    against it) as writecombine (aka arm v6+ normal uncached).  Does
>    the mapping live on until the memory is freed?

(side note, *most* of the drm/msm supported devices are armv8, the
exceptions are 8060 and 8064 which are armv7.. I don't think drm/msm
will ever have to deal w/ armv6)

Userspace keeps the userspace mmap around opportunistically (once it
is mmap'd, not all buffers will be accessed from CPU).  In fact there
is a userspace buffer cache, where we try to re-use buffers that are
already allocated and possibly mmap'd, since allocation and setting up
mmap is expensive.

(There is an MADVISE ioctl so userspace can tell kernel about buffers
in the cache, which are available to be purged by shrinker..  if a
buffer is purged, it's userspace mmap is torn down... along with it's
IOMMU map, ofc)

>  - as you mention swapping - how do you guarantee there are no
>    aliases in the kernel direct mapping after the page has been swapped
>    in?

Before unpinning and allowing pages to be swapped out, CPU and IOMMU
maps would be torn down.  (I don't think we could unpin buffers w/ a
vmap, but those are just a drop in the bucket.)

Currently, the kernel always knows buffers associated w/ a submit
(job) queued to the GPU, so it could bring pages back in and re-store
iommu map.. the fault handler can be used to bring things back in for
CPU access.  (For more free-form HMM style access to userspace memory,
we'd need to be able to sleep in IOMMU fault handler before the IOMMU
resumes translation.)

As far as cache is concerned, it would be basically the same as newly
allocated pages, ie. need to wb+inv the new pages.

>  - then the memory is potentially mapped to the iommu.  Is it using
>    a long-living mapping, or does get unmapped/remapped repeatedly?

Similar to CPU maps, we keep the IOMMU map around as long as possible.

(Side note, I was thinking batching unmaps could be useful to reduce
IOMMU TLB inv overhead.. usually when we are freeing a buffer, we are
freeing multiple, so we could unmap them all, then TLB inv, then free
pages.)

BR,
-R