[RFC PATCH] drm/amdkfd: Run restore_workers on freezable WQs

Christian König christian.koenig at amd.com
Mon Oct 30 17:48:37 UTC 2023



Am 30.10.23 um 18:38 schrieb Felix Kuehling:
> On 2023-10-30 12:16, Christian König wrote:
>>>>> @@ -1904,6 +1906,19 @@ kfd_process_gpuid_from_node(struct 
>>>>> kfd_process *p, struct kfd_node *node,
>>>>>       return -EINVAL;
>>>>>   }
>>>>>   +static void signal_eviction_fence(struct kfd_process *p)
>>>>> +{
>>>>> +    spin_lock(&p->ef_lock);
>>>>> +    if (!p->ef)
>>>>> +        goto unlock_out;
>>>>> +    dma_fence_signal(p->ef);
>>>>
>>>> This needs to grab the internal lock of the eviction fence, I'm not 
>>>> sure that has correct ordering with the newly added ef_lock.
>>>
>>> Hmm, the only thing we could get a circular lock dependency would 
>>> be, if we took the p->ef_lock in a fence callback. I don't think 
>>> that ever happens, because even the eviction work runs on a worker 
>>> thread (exactly to avoid such lock dependency issues).
>>>
>>> Anyway, I could try to move the fence_signal out of the critical 
>>> section. The lock is only there to ensure that exactly one process 
>>> signals and frees the fence.
>>>
>>
>> So basically either the eviction worker or the GPU reset could signal 
>> this fence.
>>
>> In that case I think it would be simpler to grab the reset lock in 
>> the eviction worker to protect against a concurrent reset.
>
> Which reset lock? adev->reset_cntl->reset_lock? I only see that lock 
> taken in aldebaran_mode2_perform_reset. I don't understand why this is 
> in ASIC-specific code. But even so, it's only taken during the actual 
> reset (in aldebaran_mode2_perform_reset). I don't think it covers the 
> pre-reset code path that signals the eviction fence.

No, what I mean is adev->reset_domain->sem. It's hold in write lock 
during the reset and you can grab the read side if you need to delay a 
reset.

But thinking about that a bit more, you actually don't need any of this. 
Something like this here should do fine:

tmp = dma_fence_get_rcu_safe(&p->ef);
dma_fence_signal(tmp);
dma_fence_put(tmp);

dma_fences are always RCU protected and can be signaled from multiple 
sources by design.

Regards,
Christian.

>
> Regards,
>   Felix
>
>
>>
>> Regards,
>> Christian.



More information about the amd-gfx mailing list