[PATCH] drm/ttm: Drain workqueue before sys manager release

Karolina Stolarek karolina.stolarek at intel.com
Mon Oct 16 11:31:51 UTC 2023


On 16.10.2023 13:03, Christian König wrote:
> Am 13.10.23 um 16:34 schrieb Karolina Stolarek:
>> In rare cases, a delayed destruction of a BO with a system resource
>> could stay in the workqueue until drain_workqueue() is called in
>> ttm_device_fini(). An attempt to free a resource from an already
>> released manager results in NULL pointer dereference. Move the step
>> of draining and destroying the workqueue so it happens before the
>> ttm_sys_manager cleanup.
> 
> Good catch! But it's not only the drainage of the workqueue which is 
> at the wrong place here.
> 
> Removing the device from the device list should come before 
> destroying the system domain as well.

I wonder when this would cause problems. But of course, it makes sense
to fix both with one patch.

> So I think you should probably rather move the chunk:
> 
> man = ttm_manager_type(bdev, TTM_PL_SYSTEM); 
> ttm_resource_manager_set_used(man, false); 
> ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL);
> 
> after the destroy_workqueue() instead of the other way around.>
> Apart from that looks good to me.

Thanks for taking a look and the suggestion, I'll send a v2 once I
finish testing the change. Also, I'll drop the Fixes tag, as it seems
the order issue has been there since the very beginning.

All the best,
Karolina

> 
> Thanks, Christian.
> 
>> 
>> Fixes: 9bff18d13473 ("drm/ttm: use per BO cleanup workers") 
>> Signed-off-by: Karolina Stolarek <karolina.stolarek at intel.com> ---
>>  Some background: I stumbled upon this issue when testing 
>> ttm_bo_pipeline_gutting() with BO with an active dma_resv fence.
>> In ~2% of the runs, the delayed destruction of the ghost wouldn't 
>> happen until the drain_queue() step. man->func->free(man, *res)
>> got called via ttm_bo_cleanup_memtype_use(), the manager and its 
>> functions were nowhere to be seen, resulting in a nulptr deref.
>> 
>> drivers/gpu/drm/ttm/ttm_device.c | 6 +++--- 1 file changed, 3 
>> insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/ttm/ttm_device.c 
>> b/drivers/gpu/drm/ttm/ttm_device.c index
>> 7726a72befc5..753126581620 100644 ---
>> a/drivers/gpu/drm/ttm/ttm_device.c +++ 
>> b/drivers/gpu/drm/ttm/ttm_device.c @@ -232,6 +232,9 @@ void 
>> ttm_device_fini(struct ttm_device *bdev) struct 
>> ttm_resource_manager *man; unsigned i; + drain_workqueue(bdev->wq);
>> +    destroy_workqueue(bdev->wq); + man = ttm_manager_type(bdev,
>> TTM_PL_SYSTEM); ttm_resource_manager_set_used(man, false); 
>> ttm_set_driver_manager(bdev, TTM_PL_SYSTEM, NULL); @@ -240,9
>> +243,6 @@ void ttm_device_fini(struct ttm_device *bdev) 
>> list_del(&bdev->device_list); mutex_unlock(&ttm_global_mutex); - 
>> drain_workqueue(bdev->wq); -    destroy_workqueue(bdev->wq); - 
>> spin_lock(&bdev->lru_lock); for (i = 0; i < TTM_MAX_BO_PRIORITY; 
>> ++i) if (list_empty(&man->lru[0]))
> 


More information about the dri-devel mailing list