[PATCH 0/2] Lima DRM driver

Tue Feb 26 16:23:11 UTC 2019

Am 26.02.19 um 16:58 schrieb Daniel Vetter:
> On Wed, Feb 13, 2019 at 9:35 AM Christian König
> <ckoenig.leichtzumerken at gmail.com> wrote:
>> Am 13.02.19 um 08:59 schrieb Daniel Vetter:
>>> On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh at kernel.org> wrote:
>>>> On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric at anholt.net> wrote:
>>>>> Rob Herring <robh at kernel.org> writes:
>>>>>
>>>>>> On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel at ffwll.ch> wrote:
>>>>>>> On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
>>>>>>>> On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel at ffwll.ch> wrote:
>>>>>>>>> On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
>>>>>>>>>> Kernel DRM driver for ARM Mali 400/450 GPUs.
>>>>>>>>>>
>>>>>>>>>> Since last RFC, all feedback has been addressed. Most Mali DTS
>>>>>>>>>> changes are already upstreamed by SoC maintainers. The kernel
>>>>>>>>>> driver and user-kernel interface are quite stable for several
>>>>>>>>>> months, so I think it's ready to be upstreamed.
>>>>>>>>>>
>>>>>>>>>> This implementation mainly take amdgpu DRM driver as reference.
>>>>>>>>>>
>>>>>>>>>> - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
>>>>>>>>>>     OpenGL vertex shader processing and PP is for fragment shader
>>>>>>>>>>     processing. Each processor has its own MMU so prcessors work in
>>>>>>>>>>     virtual address space.
>>>>>>>>>> - There's only one GP but multiple PP (max 4 for mali 400 and 8
>>>>>>>>>>     for mali 450) in the same mali 4xx GPU. All PPs are grouped
>>>>>>>>>>     togather to handle a single fragment shader task divided by
>>>>>>>>>>     FB output tiled pixels. Mali 400 user space driver is
>>>>>>>>>>     responsible for assign target tiled pixels to each PP, but mali
>>>>>>>>>>     450 has a HW module called DLBU to dynamically balance each
>>>>>>>>>>     PP's load.
>>>>>>>>>> - User space driver allocate buffer object and map into GPU
>>>>>>>>>>     virtual address space, upload command stream and draw data with
>>>>>>>>>>     CPU mmap of the buffer object, then submit task to GP/PP with
>>>>>>>>>>     a register frame indicating where is the command stream and misc
>>>>>>>>>>     settings.
>>>>>>>>>> - There's no command stream validation/relocation due to each user
>>>>>>>>>>     process has its own GPU virtual address space. GP/PP's MMU switch
>>>>>>>>>>     virtual address space before running two tasks from different
>>>>>>>>>>     user process. Error or evil user space code just get MMU fault
>>>>>>>>>>     or GP/PP error IRQ, then the HW/SW will be recovered.
>>>>>>>>>> - Use TTM as MM. TTM_PL_TT type memory is used as the content of
>>>>>>>>>>     lima buffer object which is allocated from TTM page pool. all
>>>>>>>>>>     lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
>>>>>>>>>>     allocation, so there's no buffer eviction and swap for now.
>>>>>>>>> All other render gpu drivers that have unified memory (aka is on the SoC)
>>>>>>>>> use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
>>>>>>>>> (and i915 is kinda the same too really). TTM makes sense if you have some
>>>>>>>>> discrete memory to manage, but imo not in any other place really.
>>>>>>>>>
>>>>>>>>> What's the design choice behind this?
>>>>>>>> To be honest, it's just because TTM offers more helpers. I did implement
>>>>>>>> a GEM way with cma alloc at the beginning. But when implement paged mem,
>>>>>>>> I found TTM has mem pool alloc, sync and mmap related helpers which covers
>>>>>>>> much of my existing code. It's totally possible with GEM, but not as easy as
>>>>>>>> TTM to me. And virtio-gpu seems an example to use TTM without discrete
>>>>>>>> mem. Shouldn't TTM a super set of both unified mem and discrete mem?
>>>>>>> virtio does have fake vram and migration afaiui. And sure, you can use TTM
>>>>>>> without the vram migration, it's just that most of the complexity of TTM
>>>>>>> is due to buffer placement and migration and all that stuff. If you never
>>>>>>> need to move buffers, then you don't need that ever.
>>>>>>>
>>>>>>> Wrt lack of helpers, what exactly are you looking for? A big part of these
>>>>>>> for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
>>>>>>> provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
>>>>>>> the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
>>>>>>> closest, since it doesn't have a display block either) would be better I
>>>>>>> think.
>>>>>> FWIW, I'm working on the panfrost driver and am using the shmem
>>>>>> helpers from Noralf. It's the early stages though. I started a patch
>>>>>> for etnaviv to use it too, but found I need to rework it to sub-class
>>>>>> the shmem GEM object.
>>>>> Did you just convert the shmem helpers over to doing alloc_coherent?  If
>>>>> so, I'd be interested in picking them up for v3d, and that might help
>>>>> get another patch out of your stack.
>>>> I haven't really fully addressed that yet, but yeah, my plan is just
>>>> to switch to WC alloc and mappings. I was going to try to make it
>>>> configurable, but there is a comment in the ARM dma mapping code which
>>>> makes me wonder if tinydrm using streaming DMA for SPI is
>>>> fundamentally broken (and maybe CMA is less broken?). If not broken,
>>>> not guaranteed to work.
>>>>
>>>> /*
>>>>    * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
>>>>    * that the intention is to allow exporting memory allocated via the
>>>>    * coherent DMA APIs through the dma_buf API, which only accepts a
>>>>    * scattertable.  This presents a couple of problems:
>>>>    * 1. Not all memory allocated via the coherent DMA APIs is backed by
>>>>    *    a struct page
>>>>    * 2. Passing coherent DMA memory into the streaming APIs is not allowed
>>>>    *    as we will try to flush the memory through a different alias to that
>>>>    *    actually being used (and the flushes are redundant.)
>>>>    */
>>> The sg table is only for device access, which avoids both of these
>>> issues. That's the idea at least, except all ttm-based drivers don't
>>> care, instead they expect a struct page and then use that to build a
>>> ttm_bo. And then use all the ttm cpu side access functions, instead of
>>> using the dma-buf interfaces (which need to exist to avoid the above
>>> issues).
>> Actually that is not correct any more. I've fixed this while working on
>> directly sharing BOs between amdgpu devices.
>>
>> TTM now uses the DMA addresses from the sg table and I actually wanted
>> to remove the pages for imported DMA-buf BOs for a while now.
> Finally gotten around to reading ttm code to update my understanding,
> and I think I realized why I never realized this changed:
> TTM_PAGE_FLAG_SG and related code seems to be the fancy new code you
> added to go sg table native in ttm, and from a quick look rolled out
> everywhere. But drm_prime_sg_to_page_addr_arrays is still called. Is
> that the missing cleanup you're referring to?

Yes, exactly. The last thing I pushed upstream was making the pages 
optional in drm_prime_sg_to_page_addr_arrays.

I just never got around to really not fill ttm->pages any more, but in 
theory it should be possible to just comment that out and be happy about it.

Christian.

>   Would be nice if we
> could nuke it to stop the copypasta spread (and spreading it seems to
> do :-/). Maybe as a todo.rst entry?
>
> Cheers, Daniel
>
>> Regards,
>> Christian.
>>
>>> So except if you want to fix ttm dma-buf import (which is going to be
>>> a pile of work), add this to the list of why ttm is probably not the
>>> best choice for something mostly running on arm soc. x86 gets away
>>> because dma is easy on x86.
>>> -Daniel
>>>
>>>>> I'm particularly interested in the shmem helpers because I should start
>>>>> doing dynamic binding in and out of the GPU's page table, to avoid
>>>>> pinning so much memory all the time.
>>>> I'll try to post something in the next couple of days.
>>>>
>>>> Rob
>>>
>