[RFC PATCH] drm/amdkfd: Run restore_workers on freezable WQs
Felix Kuehling
felix.kuehling at amd.com
Mon Oct 30 17:56:17 UTC 2023
On 2023-10-30 13:48, Christian König wrote:
>
>
> Am 30.10.23 um 18:38 schrieb Felix Kuehling:
>> On 2023-10-30 12:16, Christian König wrote:
>>>>>> @@ -1904,6 +1906,19 @@ kfd_process_gpuid_from_node(struct
>>>>>> kfd_process *p, struct kfd_node *node,
>>>>>> return -EINVAL;
>>>>>> }
>>>>>> +static void signal_eviction_fence(struct kfd_process *p)
>>>>>> +{
>>>>>> + spin_lock(&p->ef_lock);
>>>>>> + if (!p->ef)
>>>>>> + goto unlock_out;
>>>>>> + dma_fence_signal(p->ef);
>>>>>
>>>>> This needs to grab the internal lock of the eviction fence, I'm
>>>>> not sure that has correct ordering with the newly added ef_lock.
>>>>
>>>> Hmm, the only thing we could get a circular lock dependency would
>>>> be, if we took the p->ef_lock in a fence callback. I don't think
>>>> that ever happens, because even the eviction work runs on a worker
>>>> thread (exactly to avoid such lock dependency issues).
>>>>
>>>> Anyway, I could try to move the fence_signal out of the critical
>>>> section. The lock is only there to ensure that exactly one process
>>>> signals and frees the fence.
>>>>
>>>
>>> So basically either the eviction worker or the GPU reset could
>>> signal this fence.
>>>
>>> In that case I think it would be simpler to grab the reset lock in
>>> the eviction worker to protect against a concurrent reset.
>>
>> Which reset lock? adev->reset_cntl->reset_lock? I only see that lock
>> taken in aldebaran_mode2_perform_reset. I don't understand why this
>> is in ASIC-specific code. But even so, it's only taken during the
>> actual reset (in aldebaran_mode2_perform_reset). I don't think it
>> covers the pre-reset code path that signals the eviction fence.
>
> No, what I mean is adev->reset_domain->sem. It's hold in write lock
> during the reset and you can grab the read side if you need to delay a
> reset.
>
> But thinking about that a bit more, you actually don't need any of
> this. Something like this here should do fine:
>
> tmp = dma_fence_get_rcu_safe(&p->ef);
> dma_fence_signal(tmp);
> dma_fence_put(tmp);
dma_fence_get_rcu_safe gets a new reference that dma_fence_put drops. It
doesn't drop the original reference in p->ef.
I need a way to ensure that exactly one thread frees the original
reference in p->ef. Even with RCU, concurrent writers still need a lock
or a mutex. So I don't think I can avoid using another lock here because
I have potentially two concurrent writers.
Regards,
Felix
>
> dma_fences are always RCU protected and can be signaled from multiple
> sources by design.
>
> Regards,
> Christian.
>
>>
>> Regards,
>> Felix
>>
>>
>>>
>>> Regards,
>>> Christian.
>
More information about the amd-gfx
mailing list