[linux.git drm/ttm]: NULL pointer dereference upon driver probe

Christian König christian.koenig at amd.com
Mon Aug 10 19:24:01 UTC 2020


Am 10.08.20 um 20:51 schrieb Dave Airlie:
> On Mon, 10 Aug 2020 at 22:20, Christian König <christian.koenig at amd.com> wrote:
>> Am 07.08.20 um 09:02 schrieb Christian König:
>>> Am 06.08.20 um 20:50 schrieb Roland Scheidegger:
>>>> Am 06.08.20 um 17:28 schrieb Christian König:
>>>>> My best guess is that you are facing two separate bugs here.
>>>>>
>>>>> Crash #1 is somehow related to CRTCs and might even be cause by the
>>>>> atomic-helper change you noted below.
>>>>>
>>>>> Crash #2 is caused because vmw_bo_create_and_populate() tries to
>>>>> manually populate a BO object instead of relying on TTM to do it when
>>>>> necessary. This indeed doesn't work any more because of "drm/ttm: make
>>>>> TT creation purely optional v3".
>>>>>
>>>>> Question is why vmwgfx is doing this?
>>>> Not really sure unfortunately, it's possible vmwgfx is doing it because
>>>> ttm lacked some capabilities at some point?
>>> I think so as well, yes.
>>>
>>>>    Trying to figure this one out...
>>> Problem is that what vmwgfx is doing here is questionable at best.
>>>
>>> By definition BOs in the SYSTEM domain are not accessible by the GPU,
>>> even if it is a virtual one.
>>>
>>> And what vmwgfx does is allocating one in the SYSTEM domain as not
>>> evictable and then bypassing TTM in filling and mapping it to the GPU.
>>>
>>> That doesn't really makes sense to me, why shouldn't that BO be put in
>>> the GTT domain then in the first place?
>> Well I think I figured out what VMWGFX is doing here, but you won't like it.
>>
>> See VMWGFX doesn't support TTMs GTT domain. So to implement the mob and
>> otable BOs it is allocating system domain BOs, pinning them and manually
>> filling them with pages.
>>
>> The correct fix would be to audit VMWGFX and fix this handling so that
>> it doesn't mess any more with TTM internal object state.
>>
>> Till that happens we can only revert the patch for now.
> Probably good to do, at least we know the problem now.
>
> However I found myself in the same place yesterday so we should
> discuss how to fix it going forward.
>
> At least on Intel IGPs you have GTT and PPGTT (per-process table). GTT
> on later hw is only needed for certain objects, like scanout etc. Not
> every object needs to be in the GTT domain.

We have the same situation on amdgpu. GART objects are only allocated 
for scanout and VMID0 access.

See out amdgpu_gtt_mgr.c.

> But when you get an execbuffer and you want to bind the PPGTT objects,
> you need to either move the object to the GTT domain pointlessly and
> suboptimally, since the GTT domain could fill up and start needing
> evictions.

That is intentional behavior. The GTT domain is the over all memory 
which is currently GPU accessible.

The GART can be much smaller than the GTT domain.

> So the option is to get SYSTEM domain objects, only move them to
> TTM_PL_TT when pinning for scanout etc, but otherwise generate the
> pages lists from the objects. In my playing around I've hacked up a TT
> create/populate path, with no bind.

We already tried this and it turned out to be a bad idea.

See amdgpu_ttm_alloc_gart() how to easily do it with the GTT domain.

Regards,
Christian.

>
> Dave.
> I have hardware that has no requirement for all objects to be in the
> TT domain, but still has a TT domain.



More information about the dri-devel mailing list