[PATCH 1/1] drm/amdgpu: Take a reference to an exported BO

Tue May 5 14:58:02 UTC 2020

Am 2020-05-05 um 3:47 a.m. schrieb Christian König:
> Just to reply here once more, this patch is a clear NAK.

Agreed. But see below. I don't think all is well.

>
> The references are grabbed in the call path of drm_gem_prime_export()
> and dropped again in drm_gem_dmabuf_release().
>
> So they are perfectly balanced as far as I can see.

That is true for the GEM object references. But I believe there is still
a problem with the TTM BO references.

As far as I can tell amdgpu_bo_unref can free the TTM BO while there are
still references to the GEM object from DMA buf exports. I think that's
a fundamental problem with how we have two reference counts for the same
physical object (TTM BO and the embedded GEM BO).

I think the correct solution is for amdgpu_bo_ref/unref to delegate its
reference counting to drm_gem_object_get/put instead of ttm_bo_get/put.
The amdgpu BO would hold one token reference to the TTM BO, which it can
drop when the GEM BO refcount drops to 0. Finally, the amdgpu BO should
only be freed once the TTM BO refcount also becomes 0.

Regards,
  Felix

>
> Regards,
> Christian.
>
> Am 01.05.20 um 16:44 schrieb Felix Kuehling:
>>
>> [dropping my gmail address]
>>
>> We saw this backtrace showing the call chain while investigating a
>> kernel oops caused by this issue on the DKMS branch with the KFD IPC
>> API. It happens after a dma-buf file is released with fput:
>>
>> [ 1255.049330] BUG: kernel NULL pointer dereference, address: 000000000000051e
>> [ 1255.049727] #PF: supervisor read access in kernel mode
>> [ 1255.050092] #PF: error_code(0x0000) - not-present page
>> [ 1255.050416] PGD 0 P4D 0
>> [ 1255.050736] Oops: 0000 [#1] SMP PTI
>> [ 1255.051060] CPU: 27 PID: 2292 Comm: kworker/27:2 Tainted: G           OE     5.3.0-46-generic #38~18.04.1-Ubuntu
>> [ 1255.051400] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 3.0a 02/26/2019
>> [ 1255.051752] Workqueue: events delayed_fput
>> [ 1255.052111] RIP: 0010:drm_gem_object_put_unlocked+0x1c/0x70 [drm]
>> [ 1255.052465] Code: 4d 80 c8 ee 0f 0b eb d8 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 34 55 48 89 e5 41 54 53 48 89 fb 48 8b 7f 08 48 8b 47 20 <48> 83 b8 a0 00 00 00 00 74 1a 4c 8d 67 68 48 89 df 4c 89 e6 e8 9b
>> [ 1255.053224] RSP: 0018:ffffb4b62035fdc8 EFLAGS: 00010286
>> [ 1255.053613] RAX: 000000000000047e RBX: ffff9f2add197850 RCX: 0000000000000000
>> [ 1255.054032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f2aa2548aa0
>> [ 1255.054440] RBP: ffffb4b62035fdd8 R08: 0000000000000000 R09: 0000000000000000
>> [ 1255.054860] R10: 0000000000000010 R11: ffff9f2a4b1cc310 R12: 0000000000080005
>> [ 1255.055268] R13: ffff9f2a4b1cc310 R14: ffff9f4e369161e0 R15: ffff9f2a1b2f9080
>> [ 1255.055674] FS:  0000000000000000(0000) GS:ffff9f4e3f740000(0000) knlGS:0000000000000000
>> [ 1255.056087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1255.056501] CR2: 000000000000051e CR3: 00000002df00a004 CR4: 00000000007606e0
>> [ 1255.056923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1255.057345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 1255.057763] PKRU: 55555554
>> [ 1255.058179] Call Trace:
>> [ 1255.058603]  drm_gem_dmabuf_release+0x1a/0x30 [drm]
>> [ 1255.059025]  dma_buf_release+0x56/0x130
>> [ 1255.059443]  __fput+0xc6/0x260
>> [ 1255.059856]  delayed_fput+0x20/0x30
>> [ 1255.060272]  process_one_work+0x1fd/0x3f0
>> [ 1255.060686]  worker_thread+0x34/0x410
>> [ 1255.061099]  kthread+0x121/0x140
>> [ 1255.061510]  ? process_one_work+0x3f0/0x3f0
>> [ 1255.061923]  ? kthread_park+0xb0/0xb0
>> [ 1255.062336]  ret_from_fork+0x35/0x40
>>
>> drm_gem_object_put_unlocked calls drm_gem_object_free when the
>> obj->refcount reaches 0. From there it calls
>> dev->driver->gem_free_object_unlocked, which is
>> amdgpu_gem_object_free in amdgpu.
>>
>> Regards,
>>   Felix
>>
>> Am 2020-05-01 um 10:29 a.m. schrieb Christian König:
>>> Am 01.05.20 um 16:21 schrieb Felix Kuehling:
>>>> From: Felix Kuehling <felix.kuehling at gmail.com>
>>>>
>>>> That reference gets dropped when the the dma-buf is freed. Not
>>>> incrementing
>>>> the refcount can lead to use-after-free errors.
>>>>
>>>> Signed-off-by: Felix Kuehling <felix.kuehling at gmail.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>> index ffeb20f11c07..a0f9b3ef4aad 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>> @@ -398,8 +398,15 @@ struct dma_buf *amdgpu_gem_prime_export(struct
>>>> drm_gem_object *gobj,
>>>>           return ERR_PTR(-EPERM);
>>>>         buf = drm_gem_prime_export(gobj, flags);
>>>> -    if (!IS_ERR(buf))
>>>> +    if (!IS_ERR(buf)) {
>>>>           buf->ops = &amdgpu_dmabuf_ops;
>>>> +        /* GEM needs a reference to the underlying object
>>>> +         * that gets dropped when the dma-buf is released,
>>>> +         * through the amdgpu_gem_object_free callback
>>>> +         * from drm_gem_object_put_unlocked.
>>>> +         */
>>>> +        amdgpu_bo_ref(bo);
>>>> +    }
>>>
>>> Of hand that doesn't sounds correct to me. Why should the exported
>>> bo be closed through amdgpu_gem_object_free()?
>>>
>>> Regards,
>>> Christian.
>>>
>>>>         return buf;
>>>>   }
>>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20200505/568ef85c/attachment.htm>