Lockdep spalt on killing a processes

Andrey Grodzovsky andrey.grodzovsky at amd.com
Thu Oct 28 17:26:26 UTC 2021


On 2021-10-27 3:58 p.m., Andrey Grodzovsky wrote:
>
> On 2021-10-27 10:50 a.m., Christian König wrote:
>> Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky:
>>>
>>> On 2021-10-27 10:34 a.m., Christian König wrote:
>>>> Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky:
>>>>> [SNIP]
>>>>>>
>>>>>>> Let me please know if I am still missing some point of yours.
>>>>>>
>>>>>> Well, I mean we need to be able to handle this for all drivers.
>>>>>
>>>>>
>>>>> For sure, but as i said above in my opinion we need to change only 
>>>>> for those drivers that don't use the _locked version.
>>>>
>>>> And that absolutely won't work.
>>>>
>>>> See the dma_fence is a contract between drivers, so you need the 
>>>> same calling convention between all drivers.
>>>>
>>>> Either we always call the callback with the lock held or we always 
>>>> call it without the lock, but sometimes like that and sometimes 
>>>> otherwise won't work.
>>>>
>>>> Christian.
>>>
>>>
>>> I am not sure I fully understand what problems this will cause but 
>>> anyway, then we are back to irq_work. We cannot embed irq_work as 
>>> union within dma_fenc's cb_list
>>> because it's already reused as timestamp and as rcu head after the 
>>> fence is signaled. So I will do it within drm_scheduler with single 
>>> irq_work per drm_sched_entity
>>> as we discussed before.
>>
>> That won't work either. We free up the entity after the cleanup 
>> function. That's the reason we use the callback on the job in the 
>> first place.
>
>
> Yep, missed it.
>
>
>>
>> We could overlead the cb structure in the job though.
>
>
> I guess, since no one else is using this member it after the cb executed.
>
> Andrey


Attached a patch. Give it a try please, I tested it on my side and tried 
to generate the right conditions to trigger this code path by repeatedly 
submitting commands while issuing GPU reset to stop the scheduler and 
then killing command submissions process in the middle. But for some 
reason looks like the job_queue was always empty already at the time of 
entity kill.

Andrey


>
>
>>
>> Christian.
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>> Andrey
>>>>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-sched-Avoid-lockdep-spalt-on-killing-a-processes.patch
Type: text/x-patch
Size: 4398 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20211028/47bc663a/attachment-0001.bin>


More information about the amd-gfx mailing list