[PATCH 0/2] Lima DRM driver

Wed Feb 13 09:38:24 UTC 2019

On Wed, Feb 13, 2019 at 09:35:30AM +0100, Christian König wrote:
> Am 13.02.19 um 08:59 schrieb Daniel Vetter:
> > On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh at kernel.org> wrote:
> > > On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric at anholt.net> wrote:
> > > > Rob Herring <robh at kernel.org> writes:
> > > > 
> > > > > On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel at ffwll.ch> wrote:
> > > > > > On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
> > > > > > > On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel at ffwll.ch> wrote:
> > > > > > > > On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
> > > > > > > > > Kernel DRM driver for ARM Mali 400/450 GPUs.
> > > > > > > > > 
> > > > > > > > > Since last RFC, all feedback has been addressed. Most Mali DTS
> > > > > > > > > changes are already upstreamed by SoC maintainers. The kernel
> > > > > > > > > driver and user-kernel interface are quite stable for several
> > > > > > > > > months, so I think it's ready to be upstreamed.
> > > > > > > > > 
> > > > > > > > > This implementation mainly take amdgpu DRM driver as reference.
> > > > > > > > > 
> > > > > > > > > - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
> > > > > > > > >    OpenGL vertex shader processing and PP is for fragment shader
> > > > > > > > >    processing. Each processor has its own MMU so prcessors work in
> > > > > > > > >    virtual address space.
> > > > > > > > > - There's only one GP but multiple PP (max 4 for mali 400 and 8
> > > > > > > > >    for mali 450) in the same mali 4xx GPU. All PPs are grouped
> > > > > > > > >    togather to handle a single fragment shader task divided by
> > > > > > > > >    FB output tiled pixels. Mali 400 user space driver is
> > > > > > > > >    responsible for assign target tiled pixels to each PP, but mali
> > > > > > > > >    450 has a HW module called DLBU to dynamically balance each
> > > > > > > > >    PP's load.
> > > > > > > > > - User space driver allocate buffer object and map into GPU
> > > > > > > > >    virtual address space, upload command stream and draw data with
> > > > > > > > >    CPU mmap of the buffer object, then submit task to GP/PP with
> > > > > > > > >    a register frame indicating where is the command stream and misc
> > > > > > > > >    settings.
> > > > > > > > > - There's no command stream validation/relocation due to each user
> > > > > > > > >    process has its own GPU virtual address space. GP/PP's MMU switch
> > > > > > > > >    virtual address space before running two tasks from different
> > > > > > > > >    user process. Error or evil user space code just get MMU fault
> > > > > > > > >    or GP/PP error IRQ, then the HW/SW will be recovered.
> > > > > > > > > - Use TTM as MM. TTM_PL_TT type memory is used as the content of
> > > > > > > > >    lima buffer object which is allocated from TTM page pool. all
> > > > > > > > >    lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
> > > > > > > > >    allocation, so there's no buffer eviction and swap for now.
> > > > > > > > All other render gpu drivers that have unified memory (aka is on the SoC)
> > > > > > > > use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
> > > > > > > > (and i915 is kinda the same too really). TTM makes sense if you have some
> > > > > > > > discrete memory to manage, but imo not in any other place really.
> > > > > > > > 
> > > > > > > > What's the design choice behind this?
> > > > > > > To be honest, it's just because TTM offers more helpers. I did implement
> > > > > > > a GEM way with cma alloc at the beginning. But when implement paged mem,
> > > > > > > I found TTM has mem pool alloc, sync and mmap related helpers which covers
> > > > > > > much of my existing code. It's totally possible with GEM, but not as easy as
> > > > > > > TTM to me. And virtio-gpu seems an example to use TTM without discrete
> > > > > > > mem. Shouldn't TTM a super set of both unified mem and discrete mem?
> > > > > > virtio does have fake vram and migration afaiui. And sure, you can use TTM
> > > > > > without the vram migration, it's just that most of the complexity of TTM
> > > > > > is due to buffer placement and migration and all that stuff. If you never
> > > > > > need to move buffers, then you don't need that ever.
> > > > > > 
> > > > > > Wrt lack of helpers, what exactly are you looking for? A big part of these
> > > > > > for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
> > > > > > provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
> > > > > > the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
> > > > > > closest, since it doesn't have a display block either) would be better I
> > > > > > think.
> > > > > FWIW, I'm working on the panfrost driver and am using the shmem
> > > > > helpers from Noralf. It's the early stages though. I started a patch
> > > > > for etnaviv to use it too, but found I need to rework it to sub-class
> > > > > the shmem GEM object.
> > > > Did you just convert the shmem helpers over to doing alloc_coherent?  If
> > > > so, I'd be interested in picking them up for v3d, and that might help
> > > > get another patch out of your stack.
> > > I haven't really fully addressed that yet, but yeah, my plan is just
> > > to switch to WC alloc and mappings. I was going to try to make it
> > > configurable, but there is a comment in the ARM dma mapping code which
> > > makes me wonder if tinydrm using streaming DMA for SPI is
> > > fundamentally broken (and maybe CMA is less broken?). If not broken,
> > > not guaranteed to work.
> > > 
> > > /*
> > >   * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
> > >   * that the intention is to allow exporting memory allocated via the
> > >   * coherent DMA APIs through the dma_buf API, which only accepts a
> > >   * scattertable.  This presents a couple of problems:
> > >   * 1. Not all memory allocated via the coherent DMA APIs is backed by
> > >   *    a struct page
> > >   * 2. Passing coherent DMA memory into the streaming APIs is not allowed
> > >   *    as we will try to flush the memory through a different alias to that
> > >   *    actually being used (and the flushes are redundant.)
> > >   */
> > The sg table is only for device access, which avoids both of these
> > issues. That's the idea at least, except all ttm-based drivers don't
> > care, instead they expect a struct page and then use that to build a
> > ttm_bo. And then use all the ttm cpu side access functions, instead of
> > using the dma-buf interfaces (which need to exist to avoid the above
> > issues).
> 
> Actually that is not correct any more. I've fixed this while working on
> directly sharing BOs between amdgpu devices.
> 
> TTM now uses the DMA addresses from the sg table and I actually wanted to
> remove the pages for imported DMA-buf BOs for a while now.

Nice! And yeah it's been a while since I looked at this ... So just a bit
of cleanup work left to do, fundamentals are in place. Shouldn't be too
hard to get rid of the pages, since the dma-buf cpu accessor functions
have been modelled after the ttm_bo interfaces.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch