[PATCH v5 1/2] drm/sched: Refactor ring mirror list handling.

Thu Jan 17 07:45:50 UTC 2019

Am 16.01.19 um 18:17 schrieb Grodzovsky, Andrey:
>
>
>
> On 01/16/2019 11:02 AM, Koenig, Christian wrote:
>> Am 16.01.19 um 16:45 schrieb Grodzovsky, Andrey:
>>>
>>>
>>>
>>> On 01/16/2019 02:46 AM, Christian König wrote:
>>>> Am 15.01.19 um 23:01 schrieb Grodzovsky, Andrey:
>>>>> On 01/11/2019 05:03 PM, Andrey Grodzovsky wrote:
>>>>>> On 01/11/2019 02:11 PM, Koenig, Christian wrote:
>>>>>>> Am 11.01.19 um 16:37 schrieb Grodzovsky, Andrey:
>>>>>>>> On 01/11/2019 04:42 AM, Koenig, Christian wrote:
>>>>>>>>> Am 10.01.19 um 16:56 schrieb Grodzovsky, Andrey:
>>>>>>>>>> [SNIP]
>>>>>>>>>>>>> But we will not be adding the cb back in drm_sched_stop
>>>>>>>>>>>>> anymore, now we
>>>>>>>>>>>>> are only going to add back the cb in drm_sched_startr after
>>>>>>>>>>>>> rerunning
>>>>>>>>>>>>> those jobs in drm_sched_resubmit_jobs and assign them a new parent
>>>>>>>>>>>>> there
>>>>>>>>>>>>> anyway.
>>>>>>>>>>>> Yeah, but when we find that we don't need to reset anything anymore
>>>>>>>>>>>> then adding the callbacks again won't be possible any more.
>>>>>>>>>>>>
>>>>>>>>>>>> Christian.
>>>>>>>>>>> I am not sure I understand it, can u point me to example of how this
>>>>>>>>>>> will happen ? I am attaching my latest patches with waiting only for
>>>>>>>>>>> the last job's fence here just so we are on same page regarding
>>>>>>>>>>> the code.
>>>>>>>>> Well the whole idea is to prepare all schedulers, then check once more
>>>>>>>>> if the offending job hasn't completed in the meantime.
>>>>>>>>>
>>>>>>>>> If the job completed we need to be able to rollback everything and
>>>>>>>>> continue as if nothing had happened.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>> Oh, but this piece of functionality - skipping HW ASIC reset in case
>>>>>>>> the
>>>>>>>> guilty job done is totally missing form this patch series and so needs
>>>>>>>> to be added. So what you say actually is that for the case were we skip
>>>>>>>> HW asic reset because the guilty job did complete we also need to skip
>>>>>>>> resubmitting the jobs in drm_sched_resubmit_jobs and hence preserve the
>>>>>>>> old parent fence pointer for reuse ? If so I would like to add all this
>>>>>>>> functionality as a third patch since the first 2 patches are more about
>>>>>>>> resolving race condition with jobs in flight while doing reset -
>>>>>>>> what do
>>>>>>>> you think ?
>>>>>>> Yeah, sounds perfectly fine to me.
>>>>>>>
>>>>>>> Christian.
>>>>>> I realized there is another complication now for XGMI hive use case,
>>>>>> we currently skip gpu recover for adev in case another gpu recover for
>>>>>> different adev in same hive is running, under the assumption that we
>>>>>> are going to reset all devices in hive anyway because that should
>>>>>> cover our own dev too. But if we chose to skip now HW asic reset if
>>>>>> our guilty job did finish we will aslo not HW reset any other devices
>>>>>> in the hive even if one of them might actually had a bad job, wanted
>>>>>> to do gpu recover but skipped it because our recover was in progress
>>>>>> in that time.
>>>>>> My general idea on that is to keep a list of guilty jobs per hive,
>>>>>> when you start gpu recover you first add you guilty job to the hive
>>>>>> and trylock hive->reset_lock. Any owner of hive->reset_lock (gpu
>>>>>> recovery in progress) once he finished his recovery and released
>>>>>> hive->reset_lock should go over hive->bad_jobs_list and if at least
>>>>>> one of them is still not signaled (not done) trigger another gpu
>>>>>> recovery and so on. If you do manage to trylock you also go over the
>>>>>> list, clean it and perform recovery. This list probably needs to be
>>>>>> protected with per hive lock.
>>>>>> I also think we can for now at least finish reviewing the first 2
>>>>>> patches and submit them since as I said before they are not dealing
>>>>>> with this topic and fixing existing race conditions. If you are OK
>>>>>> with that I can send for review the last iteration of the first 2
>>>>>> patches where I wait for the last fence in ring mirror list.
>>>>>>
>>>>>> Andrey
>>>>> I implemented HW reset avoidance including XGMI use case according to
>>>>> the plan i specified. Patch is attached but I can't test it yet due to
>>>>> XGMI regression in PSP which is supposed to be fixed soon. Please take a
>>>>> look.
>>>>
>>>> Looks a bit too complicated on first glance. In general we should 
>>>> probably get away from handling a hive in any special way.
>>>
>>> Yes, I guess i can do it the same way as the generic handling in 
>>> amdgpu_device_gpu_recover - there is a list of devices to process 
>>> which is of size 1 for non xgmi use case or more then 1 for XGMI.
>>>
>>>>
>>>> Multiple timeout jobs in a hive are identical to multiple timeout 
>>>> jobs on different engines on a single device.
>>>>
>>>> How about the following handling:
>>>> 1. Add the timeout job to a list.
>>>> 2. Try to grab a lock to handle the reset, if that doesn't work 
>>>> because there is already a reset underway return immediately.
>>>> 3. Stop all schedulers on all affected devices including stopping 
>>>> the timeouts and detaching all callbacks.
>>>> 4. Double check the list of timed out jobs, if all hw fences of all 
>>>> jobs are completed now we actually don't need to do anything.
>>>
>>> What if all the jobs on the timed out list did complete but other 
>>> job (or jobs) for which we removed the time out timer became hanged 
>>> ? Wouldn't we  miss a required reset in this case and wouldn't even 
>>> have any indication of their hang ?
>>
>> If the timeout triggers before we disable it we will have the job on 
>> the list of jobs which are hanging.
>>
>> If we found that we don't reset and re-enable the timeout it will 
>> trigger a bit later and we can do the check again.
>>
>> Thinking about this a bit more we actually don't need to change the 
>> handling in any way. The idea with the list is just overkill.
>>
>> See when the we re-start the scheduler we will also re-arm the 
>> timeout, so it will trigger again. It is just a matter of clever 
>> timeout re-arming.
>>
>> Christian.
>
> This implies that hanged job might take in worst case twice (or more 
> if there are recurrent resets) then the TO parameter value to trigger 
> - is this OK ?

Re-arming the timeout should probably have a much reduced value when the 
job hasn't changed. E.g. something like a few ms.

> About flushing tdr jobs in progress from .free_job cb - looks like 
> drm_sched_job_finish->cancel_delayed_work_sync is not enough, we still 
> need to take care of flushing all sced->work_tdr for a device and for 
> all devices in hive for XGMI.
> What do you think ?

Why should that be necessary? We only wait for the delayed work to make 
sure that the job is not destroyed while dealing with it.

Christian.

>
> Andrey
>
>>
>>>
>>>> 5. Do the reset on all affected devices.
>>>> 6. Drop the lock.
>>>> 7. Add callbacks again and restart the schedulers.
>>>
>>> I see your steps don't include flushing any tdr in progress from 
>>> drm_sched_job_finish (or as I did it from amdgpu_job_free_cb) does 
>>> it mean you don't think we need to flush in order to avoid freeing 
>>> job while it still might be accessed from time out  list ?
>>>
>>> Andrey
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Andrey
>>>>>
>>>>>>>> Andrey
>>>>>>>>>>> Andrey
>>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>> _______________________________________________
>>>>>>> amd-gfx mailing list
>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>
>>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190117/3947a9bf/attachment.html>