[PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

Christian König ckoenig.leichtzumerken at gmail.com
Mon Nov 26 19:34:12 UTC 2018


> What I mean is - should we get rid of dma_fence_add/remove_callback 
> logic in drm_sched_job_timedout and do it for each driver in between
>
> scheduler deactivation  and activation back ?
>

Yes, exactly. That's the reason why I already have a revert for the 
patch and remove the dance from drm_sched_job_timedout again.

Christian.


Am 26.11.18 um 20:28 schrieb Grodzovsky, Andrey:
>
>
> Actually, after looking again at drm_sched_job_timedout  from which 
> the amdgpu_device_gpu_recover will be called I see that we already 
> disconnect all the pending scheduler fences from the HW fence, including
> the guilty job. I also see that in drm_sched_job_timedout 
> job_list_lock is released before calling sched->ops->timedout_job and 
> then required after, so new jobs can slip into ring_mirror_list in 
> between.
>
> And also i will end up going over the ring_mirror_list twice, once 
> from amdgpu_device_post_asic_reset and later from 
> drm_sched_job_timedout - this might cause double fence processing.
>
> Isn't it more correct only do the disconnect from HW fence after the 
> schedules have been stopped and connect back before we restart the 
> schedulers (as you pointed out here before)
>
> What I mean is - should we get rid of dma_fence_add/remove_callback 
> logic in drm_sched_job_timedout and do it for each driver in between
>
> scheduler deactivation  and activation back ?
>
> Andrey
>
>
> On 11/22/2018 02:56 PM, Grodzovsky, Andrey wrote:
>>>>> Additional to that I would try improve the pre, middle, post handling
>>>>> towards checking if we made some progress in between.
>>>>>
>>>>> In other words we stop all schedulers in the pre handling and
>>>>> disconnect the scheduler fences from the hardware fence like I did in
>>>>> patch "drm/sched: fix timeout handling v2".
>>>>>
>>>>> Then before we do the actual reset in the middle handling we check if
>>>>> the offending job has completed or at least made some progress in the
>>>>> meantime.
>>>> I understand how to check if the job completed - if it's fence already
>>>> signaled, but how do I test if the job made 'at least some progress' ?
>>> Good question. Maybe we can somehow query from the hardware the number
>>> of primitives or pixels processed so far and then compare after a moment?
>> I will check on this later. In the mean while I will update the code
>> with the proposed per hive locking and I will add the check if the
>> guilty job completed before ASIC reset skipping the reset if it's did.
>>
>> Andrey
>>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20181126/c4591659/attachment.html>


More information about the amd-gfx mailing list