[PATCH] drm/amdkfd: Fix circular lock in nocpsch path
Pan, Xinhui
Xinhui.Pan at amd.com
Wed Jun 16 04:01:00 UTC 2021
> 2021年6月16日 02:22,Kuehling, Felix <Felix.Kuehling at amd.com> 写道:
>
> [+Xinhui]
>
>
> Am 2021-06-15 um 1:50 p.m. schrieb Amber Lin:
>> Calling free_mqd inside of destroy_queue_nocpsch_locked can cause a
>> circular lock. destroy_queue_nocpsch_locked is called under a DQM lock,
>> which is taken in MMU notifiers, potentially in FS reclaim context.
>> Taking another lock, which is BO reservation lock from free_mqd, while
>> causing an FS reclaim inside the DQM lock creates a problematic circular
>> lock dependency. Therefore move free_mqd out of
>> destroy_queue_nocpsch_locked and call it after unlocking DQM.
>>
>> Signed-off-by: Amber Lin <Amber.Lin at amd.com>
>> Reviewed-by: Felix Kuehling <Felix.Kuehling at amd.com>
>
> Let's submit this patch as is. I'm making some comments inline for
> things that Xinhui can address in his race condition patch.
>
>
>> ---
>> .../drm/amd/amdkfd/kfd_device_queue_manager.c | 18 +++++++++++++-----
>> 1 file changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> index 72bea5278add..c069fa259b30 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
>> @@ -486,9 +486,6 @@ static int destroy_queue_nocpsch_locked(struct device_queue_manager *dqm,
>> if (retval == -ETIME)
>> qpd->reset_wavefronts = true;
>>
>> -
>> - mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj);
>> -
>> list_del(&q->list);
>> if (list_empty(&qpd->queues_list)) {
>> if (qpd->reset_wavefronts) {
>> @@ -523,6 +520,8 @@ static int destroy_queue_nocpsch(struct device_queue_manager *dqm,
>> int retval;
>> uint64_t sdma_val = 0;
>> struct kfd_process_device *pdd = qpd_to_pdd(qpd);
>> + struct mqd_manager *mqd_mgr =
>> + dqm->mqd_mgrs[get_mqd_type_from_queue_type(q->properties.type)];
>>
>> /* Get the SDMA queue stats */
>> if ((q->properties.type == KFD_QUEUE_TYPE_SDMA) ||
>> @@ -540,6 +539,8 @@ static int destroy_queue_nocpsch(struct device_queue_manager *dqm,
>> pdd->sdma_past_activity_counter += sdma_val;
>> dqm_unlock(dqm);
>>
>> + mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj);
>> +
>> return retval;
>> }
>>
>> @@ -1629,7 +1630,7 @@ static bool set_cache_memory_policy(struct device_queue_manager *dqm,
>> static int process_termination_nocpsch(struct device_queue_manager *dqm,
>> struct qcm_process_device *qpd)
>> {
>> - struct queue *q, *next;
>> + struct queue *q;
>> struct device_process_node *cur, *next_dpn;
>> int retval = 0;
>> bool found = false;
>> @@ -1637,12 +1638,19 @@ static int process_termination_nocpsch(struct device_queue_manager *dqm,
>> dqm_lock(dqm);
>>
>> /* Clear all user mode queues */
>> - list_for_each_entry_safe(q, next, &qpd->queues_list, list) {
>> + while (!list_empty(&qpd->queues_list)) {
>> + struct mqd_manager *mqd_mgr;
>> int ret;
>>
>> + q = list_first_entry(&qpd->queues_list, struct queue, list);
>> + mqd_mgr = dqm->mqd_mgrs[get_mqd_type_from_queue_type(
>> + q->properties.type)];
>> ret = destroy_queue_nocpsch_locked(dqm, qpd, q);
>> if (ret)
>> retval = ret;
>> + dqm_unlock(dqm);
>> + mqd_mgr->free_mqd(mqd_mgr, q->mqd, q->mqd_mem_obj);
>> + dqm_lock(dqm);
>
> This is the correct way to clean up the list when dropping the dqm-lock
> in the middle. Xinhui, you can use the same method in
> process_termination_cpsch.
>
yes, that is the right way to walk through the list. thanks.
> I believe the swapping of the q->mqd with a temporary variable is not
> needed. When free_mqd is called, the queue is no longer on the
> qpd->queues_list, so destroy_queue cannot race with it. If we ensure
> that queues are always removed from the list before calling free_mqd,
> and that list-removal happens under the dqm_lock, then there should be
> no risk of a race condition that causes a double-free.
>
no, the double free exists because pqm_destroy_queue fetch the queue from qid by get_queue_by_qid()
the race is like below.
pqm_destroy_queue
get_queue_by_qid process_termination_cpsch
destroy_queue_cpsch
lock
list_for_each_entry_safe
list_del(q)
unlock
free_mqd
lock
list_del(q)
unlock
free_mqd
> Regards,
> Felix
>
>
>> }
>>
>> /* Unregister process */
More information about the amd-gfx
mailing list