[RFC PATCH] drm/amdkfd: Run restore_workers on freezable WQs

Mon Oct 30 17:38:16 UTC 2023

On 2023-10-30 12:16, Christian König wrote:
>>>> @@ -1904,6 +1906,19 @@ kfd_process_gpuid_from_node(struct 
>>>> kfd_process *p, struct kfd_node *node,
>>>>       return -EINVAL;
>>>>   }
>>>>   +static void signal_eviction_fence(struct kfd_process *p)
>>>> +{
>>>> +    spin_lock(&p->ef_lock);
>>>> +    if (!p->ef)
>>>> +        goto unlock_out;
>>>> +    dma_fence_signal(p->ef);
>>>
>>> This needs to grab the internal lock of the eviction fence, I'm not 
>>> sure that has correct ordering with the newly added ef_lock.
>>
>> Hmm, the only thing we could get a circular lock dependency would be, 
>> if we took the p->ef_lock in a fence callback. I don't think that 
>> ever happens, because even the eviction work runs on a worker thread 
>> (exactly to avoid such lock dependency issues).
>>
>> Anyway, I could try to move the fence_signal out of the critical 
>> section. The lock is only there to ensure that exactly one process 
>> signals and frees the fence.
>>
>
> So basically either the eviction worker or the GPU reset could signal 
> this fence.
>
> In that case I think it would be simpler to grab the reset lock in the 
> eviction worker to protect against a concurrent reset.

Which reset lock? adev->reset_cntl->reset_lock? I only see that lock 
taken in aldebaran_mode2_perform_reset. I don't understand why this is 
in ASIC-specific code. But even so, it's only taken during the actual 
reset (in aldebaran_mode2_perform_reset). I don't think it covers the 
pre-reset code path that signals the eviction fence.

Regards,
   Felix

>
> Regards,
> Christian.