[RFC PATCH] drm/amdkfd: Run restore_workers on freezable WQs
Felix Kuehling
felix.kuehling at amd.com
Mon Oct 30 17:38:16 UTC 2023
On 2023-10-30 12:16, Christian König wrote:
>>>> @@ -1904,6 +1906,19 @@ kfd_process_gpuid_from_node(struct
>>>> kfd_process *p, struct kfd_node *node,
>>>> return -EINVAL;
>>>> }
>>>> +static void signal_eviction_fence(struct kfd_process *p)
>>>> +{
>>>> + spin_lock(&p->ef_lock);
>>>> + if (!p->ef)
>>>> + goto unlock_out;
>>>> + dma_fence_signal(p->ef);
>>>
>>> This needs to grab the internal lock of the eviction fence, I'm not
>>> sure that has correct ordering with the newly added ef_lock.
>>
>> Hmm, the only thing we could get a circular lock dependency would be,
>> if we took the p->ef_lock in a fence callback. I don't think that
>> ever happens, because even the eviction work runs on a worker thread
>> (exactly to avoid such lock dependency issues).
>>
>> Anyway, I could try to move the fence_signal out of the critical
>> section. The lock is only there to ensure that exactly one process
>> signals and frees the fence.
>>
>
> So basically either the eviction worker or the GPU reset could signal
> this fence.
>
> In that case I think it would be simpler to grab the reset lock in the
> eviction worker to protect against a concurrent reset.
Which reset lock? adev->reset_cntl->reset_lock? I only see that lock
taken in aldebaran_mode2_perform_reset. I don't understand why this is
in ASIC-specific code. But even so, it's only taken during the actual
reset (in aldebaran_mode2_perform_reset). I don't think it covers the
pre-reset code path that signals the eviction fence.
Regards,
Felix
>
> Regards,
> Christian.
More information about the amd-gfx
mailing list