[PATCH v5 1/2] drm/sched: Refactor ring mirror list handling.

Tue Jan 15 22:01:58 UTC 2019

On 01/11/2019 05:03 PM, Andrey Grodzovsky wrote:
>
>
> On 01/11/2019 02:11 PM, Koenig, Christian wrote:
>> Am 11.01.19 um 16:37 schrieb Grodzovsky, Andrey:
>>> On 01/11/2019 04:42 AM, Koenig, Christian wrote:
>>>> Am 10.01.19 um 16:56 schrieb Grodzovsky, Andrey:
>>>>> [SNIP]
>>>>>>>> But we will not be adding the cb back in drm_sched_stop 
>>>>>>>> anymore, now we
>>>>>>>> are only going to add back the cb in drm_sched_startr after 
>>>>>>>> rerunning
>>>>>>>> those jobs in drm_sched_resubmit_jobs and assign them a new parent
>>>>>>>> there
>>>>>>>> anyway.
>>>>>>> Yeah, but when we find that we don't need to reset anything anymore
>>>>>>> then adding the callbacks again won't be possible any more.
>>>>>>>
>>>>>>> Christian.
>>>>>> I am not sure I understand it, can u point me to example of how this
>>>>>> will happen ? I am attaching my latest patches with waiting only for
>>>>>> the last job's fence here just so we are on same page regarding 
>>>>>> the code.
>>>> Well the whole idea is to prepare all schedulers, then check once more
>>>> if the offending job hasn't completed in the meantime.
>>>>
>>>> If the job completed we need to be able to rollback everything and
>>>> continue as if nothing had happened.
>>>>
>>>> Christian.
>>> Oh, but this piece of functionality - skipping HW ASIC reset in case 
>>> the
>>> guilty job done is totally missing form this patch series and so needs
>>> to be added. So what you say actually is that for the case were we skip
>>> HW asic reset because the guilty job did complete we also need to skip
>>> resubmitting the jobs in drm_sched_resubmit_jobs and hence preserve the
>>> old parent fence pointer for reuse ? If so I would like to add all this
>>> functionality as a third patch since the first 2 patches are more about
>>> resolving race condition with jobs in flight while doing reset - 
>>> what do
>>> you think ?
>> Yeah, sounds perfectly fine to me.
>>
>> Christian.
>
> I realized there is another complication now for XGMI hive use case, 
> we currently skip gpu recover for adev in case another gpu recover for 
> different adev in same hive is running, under the assumption that we 
> are going to reset all devices in hive anyway because that should 
> cover our own dev too. But if we chose to skip now HW asic reset if 
> our guilty job did finish we will aslo not HW reset any other devices 
> in the hive even if one of them might actually had a bad job, wanted 
> to do gpu recover but skipped it because our recover was in progress 
> in that time.
> My general idea on that is to keep a list of guilty jobs per hive, 
> when you start gpu recover you first add you guilty job to the hive 
> and trylock hive->reset_lock. Any owner of hive->reset_lock (gpu 
> recovery in progress) once he finished his recovery and released 
> hive->reset_lock should go over hive->bad_jobs_list and if at least 
> one of them is still not signaled (not done) trigger another gpu 
> recovery and so on. If you do manage to trylock you also go over the 
> list, clean it and perform recovery. This list probably needs to be 
> protected with per hive lock.
> I also think we can for now at least finish reviewing the first 2 
> patches and submit them since as I said before they are not dealing 
> with this topic and fixing existing race conditions. If you are OK 
> with that I can send for review the last iteration of the first 2 
> patches where I wait for the last fence in ring mirror list.
>
> Andrey

I implemented HW reset avoidance including XGMI use case according to 
the plan i specified. Patch is attached but I can't test it yet due to 
XGMI regression in PSP which is supposed to be fixed soon. Please take a 
look.

Andrey

>
>>
>>> Andrey
>>>>>> Andrey
>>>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-sched-Skip-HW-reset-if-guilty-job-is-signaled.patch
Type: text/x-patch
Size: 15172 bytes
Desc: 0001-drm-sched-Skip-HW-reset-if-guilty-job-is-signaled.patch
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190115/f4145cba/attachment-0001.bin>