[PATCH 0/2] Lima DRM driver

Wed Feb 13 08:35:30 UTC 2019

Am 13.02.19 um 08:59 schrieb Daniel Vetter:
> On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh at kernel.org> wrote:
>> On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric at anholt.net> wrote:
>>> Rob Herring <robh at kernel.org> writes:
>>>
>>>> On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel at ffwll.ch> wrote:
>>>>> On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
>>>>>> On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel at ffwll.ch> wrote:
>>>>>>> On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
>>>>>>>> Kernel DRM driver for ARM Mali 400/450 GPUs.
>>>>>>>>
>>>>>>>> Since last RFC, all feedback has been addressed. Most Mali DTS
>>>>>>>> changes are already upstreamed by SoC maintainers. The kernel
>>>>>>>> driver and user-kernel interface are quite stable for several
>>>>>>>> months, so I think it's ready to be upstreamed.
>>>>>>>>
>>>>>>>> This implementation mainly take amdgpu DRM driver as reference.
>>>>>>>>
>>>>>>>> - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
>>>>>>>>    OpenGL vertex shader processing and PP is for fragment shader
>>>>>>>>    processing. Each processor has its own MMU so prcessors work in
>>>>>>>>    virtual address space.
>>>>>>>> - There's only one GP but multiple PP (max 4 for mali 400 and 8
>>>>>>>>    for mali 450) in the same mali 4xx GPU. All PPs are grouped
>>>>>>>>    togather to handle a single fragment shader task divided by
>>>>>>>>    FB output tiled pixels. Mali 400 user space driver is
>>>>>>>>    responsible for assign target tiled pixels to each PP, but mali
>>>>>>>>    450 has a HW module called DLBU to dynamically balance each
>>>>>>>>    PP's load.
>>>>>>>> - User space driver allocate buffer object and map into GPU
>>>>>>>>    virtual address space, upload command stream and draw data with
>>>>>>>>    CPU mmap of the buffer object, then submit task to GP/PP with
>>>>>>>>    a register frame indicating where is the command stream and misc
>>>>>>>>    settings.
>>>>>>>> - There's no command stream validation/relocation due to each user
>>>>>>>>    process has its own GPU virtual address space. GP/PP's MMU switch
>>>>>>>>    virtual address space before running two tasks from different
>>>>>>>>    user process. Error or evil user space code just get MMU fault
>>>>>>>>    or GP/PP error IRQ, then the HW/SW will be recovered.
>>>>>>>> - Use TTM as MM. TTM_PL_TT type memory is used as the content of
>>>>>>>>    lima buffer object which is allocated from TTM page pool. all
>>>>>>>>    lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
>>>>>>>>    allocation, so there's no buffer eviction and swap for now.
>>>>>>> All other render gpu drivers that have unified memory (aka is on the SoC)
>>>>>>> use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
>>>>>>> (and i915 is kinda the same too really). TTM makes sense if you have some
>>>>>>> discrete memory to manage, but imo not in any other place really.
>>>>>>>
>>>>>>> What's the design choice behind this?
>>>>>> To be honest, it's just because TTM offers more helpers. I did implement
>>>>>> a GEM way with cma alloc at the beginning. But when implement paged mem,
>>>>>> I found TTM has mem pool alloc, sync and mmap related helpers which covers
>>>>>> much of my existing code. It's totally possible with GEM, but not as easy as
>>>>>> TTM to me. And virtio-gpu seems an example to use TTM without discrete
>>>>>> mem. Shouldn't TTM a super set of both unified mem and discrete mem?
>>>>> virtio does have fake vram and migration afaiui. And sure, you can use TTM
>>>>> without the vram migration, it's just that most of the complexity of TTM
>>>>> is due to buffer placement and migration and all that stuff. If you never
>>>>> need to move buffers, then you don't need that ever.
>>>>>
>>>>> Wrt lack of helpers, what exactly are you looking for? A big part of these
>>>>> for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
>>>>> provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
>>>>> the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
>>>>> closest, since it doesn't have a display block either) would be better I
>>>>> think.
>>>> FWIW, I'm working on the panfrost driver and am using the shmem
>>>> helpers from Noralf. It's the early stages though. I started a patch
>>>> for etnaviv to use it too, but found I need to rework it to sub-class
>>>> the shmem GEM object.
>>> Did you just convert the shmem helpers over to doing alloc_coherent?  If
>>> so, I'd be interested in picking them up for v3d, and that might help
>>> get another patch out of your stack.
>> I haven't really fully addressed that yet, but yeah, my plan is just
>> to switch to WC alloc and mappings. I was going to try to make it
>> configurable, but there is a comment in the ARM dma mapping code which
>> makes me wonder if tinydrm using streaming DMA for SPI is
>> fundamentally broken (and maybe CMA is less broken?). If not broken,
>> not guaranteed to work.
>>
>> /*
>>   * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
>>   * that the intention is to allow exporting memory allocated via the
>>   * coherent DMA APIs through the dma_buf API, which only accepts a
>>   * scattertable.  This presents a couple of problems:
>>   * 1. Not all memory allocated via the coherent DMA APIs is backed by
>>   *    a struct page
>>   * 2. Passing coherent DMA memory into the streaming APIs is not allowed
>>   *    as we will try to flush the memory through a different alias to that
>>   *    actually being used (and the flushes are redundant.)
>>   */
> The sg table is only for device access, which avoids both of these
> issues. That's the idea at least, except all ttm-based drivers don't
> care, instead they expect a struct page and then use that to build a
> ttm_bo. And then use all the ttm cpu side access functions, instead of
> using the dma-buf interfaces (which need to exist to avoid the above
> issues).

Actually that is not correct any more. I've fixed this while working on 
directly sharing BOs between amdgpu devices.

TTM now uses the DMA addresses from the sg table and I actually wanted 
to remove the pages for imported DMA-buf BOs for a while now.

Regards,
Christian.

>
> So except if you want to fix ttm dma-buf import (which is going to be
> a pile of work), add this to the list of why ttm is probably not the
> best choice for something mostly running on arm soc. x86 gets away
> because dma is easy on x86.
> -Daniel
>
>>> I'm particularly interested in the shmem helpers because I should start
>>> doing dynamic binding in and out of the GPU's page table, to avoid
>>> pinning so much memory all the time.
>> I'll try to post something in the next couple of days.
>>
>> Rob
>
>