[diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Wed Sep 1 04:49:32 UTC 2021

On 2021-09-01 12:40 a.m., Jingwen Chen wrote:
> On Wed Sep 01, 2021 at 12:28:59AM -0400, Andrey Grodzovsky wrote:
>> On 2021-09-01 12:25 a.m., Jingwen Chen wrote:
>>> On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
>>>> I will answer everything here -
>>>>
>>>> On 2021-08-31 9:58 p.m., Liu, Monk wrote:
>>>>
>>>>
>>>>       [AMD Official Use Only]
>>>>
>>>>
>>>>       In the previous discussion, you guys stated that we should drop the
>>>>       “kthread_should_park” in cleanup_job.
>>>>
>>>>
>>>>       @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
>>>>       *sched)
>>>>
>>>>       {
>>>>
>>>>               struct drm_sched_job *job, *next;
>>>>
>>>>
>>>>       -       /*
>>>>
>>>>       -        * Don't destroy jobs while the timeout worker is running  OR
>>>>       thread
>>>>
>>>>       -        * is being parked and hence assumed to not touch pending_list
>>>>
>>>>       -        */
>>>>
>>>>       -       if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>
>>>>       -           !cancel_delayed_work(&sched->work_tdr)) ||
>>>>
>>>>       -           kthread_should_park())
>>>>
>>>>       -               return NULL;
>>>>
>>>>
>>>>       But I suddenly have a question here: if return the timedout job no matter
>>>>       kthread_should_park() or not, then we are backing to the original problem
>>>>       again: that the timedout_job is suddenly signaling and cleanup_job still
>>>>       returns it to sched_main and job is freed while it is still handling by
>>>>       vendor’s timeout callback
>>>>
>>>>
>>>>       If we return NULL when kthread_should_park() in cleanup_job, we can prevent
>>>>       above scenario from happening: once a job is processed by job_timedout we
>>>>       can stop its scheduler, and after that even this job suddenly signaled the
>>>>       cleanup_job won’t return it so sched_main won’t free it in parallel …
>>>>
>>>>
>>>>       What do you think ?
>>>>
>>>>
>>>> Is your analysis above takes into account that you also submit
>>>> '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a
>>>> problem -
>>> Hi Andrey,
>>> Monk has talked to me and we agreed that as there're multiple opinions about the
>>> '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
>>> 1 is an independent patch to fix some error. So we should not take the patch 2 into
>>> analysis.
>>>
>>>> I think that as long as you put kthread_park(sched->thread) BEFORE
>>>> fetching next bad job from pending list (under spinlock) there is no
>>>> such issue as in the case you describe because this potential bad job
>>>> that became signaled will be removed from pending list before you
>>>> even fetch the next job and by the time you fetch it the scheduler
>>>> thread is already stopped anyway
>>>>
>>>> If you don't submit and we keep the removal hack for now then also no problem
>>>> because
>>>> we temporary remove the job we fetch for TDR from pending list under spinlock
>>>> exactly to avoid this race
>>>>
>>> So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)?
>>> patch v3 keeps this kthread_should_park check.
>>
>> But since in both cases looks like there is no danger of use after free
>> then I see no reason to keep kthread_should_park check.
>>
>> Andrey
> OK, I get it. So patch v4 has removed this check, can you help review
> [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)?


Sure

Andrey


>>
>>> Best Regards,
>>> JingWen
>>>>       Thanks
>>>>
>>>>
>>>>       ------------------------------------------
>>>>
>>>>       Monk Liu | Cloud-GPU Core team
>>>>
>>>>       ------------------------------------------
>>>>
>>>>
>>>>       From: Liu, Monk
>>>>       Sent: Wednesday, September 1, 2021 9:23 AM
>>>>       To: Koenig, Christian <Christian.Koenig at amd.com>; Grodzovsky, Andrey
>>>>       <Andrey.Grodzovsky at amd.com>; Daniel Vetter <daniel at ffwll.ch>; Chen, JingWen
>>>>       <JingWen.Chen2 at amd.com>
>>>>       Cc: DRI Development <dri-devel at lists.freedesktop.org>;
>>>>       amd-gfx at lists.freedesktop.org
>>>>       Subject: [diagnostic TDR mode patches] unify our solution opinions/
>>>>       suggestions in one thread
>>>>
>>>>
>>>>       [AMD Official Use Only]
>>>>
>>>>
>>>>       Hi Daniel/Christian/Andrey
>>>>
>>>>
>>>>       It looks the voice from you three are spread over those email floods to me,
>>>>       the feature we are working on (diagnostic TDR scheme) is pending there for
>>>>       more than 6 month (we started it from feb 2021).
>>>>
>>>>
>>>>       Honestly speaking the email ways that we are using now is not friendly and
>>>>       quite painful to me ….
>>>>
>>>>       Can we try to put all our opinions, suggestions, or even objects here
>>>>       together, let’s go through them one by one, it’s too hard for us to reply
>>>>       each email on different questions .
>>>>
>>>>
>>>>       For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>>>>
>>>>
>>>>       This is a fixing patch on the timeout timer in scheduler, can we complete
>>>>       this one first ? it should already resolved all the questions and
>>>>       suggestions.
>>>>
>>>>
>>>> I have no objections for this one besides getting rid of the
>>>> kthread_should_park()) return null part,
>>>> if my answer above is not wrong then it seems superfluous to me
>>>>
>>>>
>>>>
>>>>       For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>>>>
>>>>
>>>>       I think I already explained the questions raised by Daniel in other thread
>>>>       , regarding why I use __kthread_should_park()
>>>>
>>>>
>>>> Is this race free ? Can't the other thread execute kthread_park after the check
>>>> ?
>>>>
>>>>
>>>>       For other aspects, can we put all our opinion synthesized here ?
>>>>
>>>>
>>>> So to summarize from previous threads I think that the best solution
>>>> to the problem being solved in this patch is if we do HW fence embedding
>>>> at the drm_sched_job level instead of doing it only for amdgpu, and modifying
>>>> all
>>>> the drivers to support this we can both remove this hack and solve the race
>>>> against concurrent drm_sched_cleanup_jobs job freeing just by taking reference
>>>> to the hw fence of the job at the beginning of drm_sched_job_timedout
>>>>
>>>> If doing this refactoring for all the drivers is not an option now and you need
>>>> a quick
>>>> solution such as the serialization you do here then take into account other
>>>> concurrent
>>>> TDR handlers that might run, as mentioned before, without serializing against
>>>> other TDR handlers from other
>>>> schedulers you just race here against them, e.g. you parked it now but another
>>>> one in progress will unpark it as part of calling  drm_sched_start for other
>>>> rings.
>>>> So you either need a global lock or dedicated single threaded queue as Daniel
>>>> suggested.
>>>> At minimum we should change cancel_delayed_work in drm_sched_stop to
>>>> cancel_delayed_work_sync
>>>> to cancel and flush all concurrent TDRs and probably move it to the begining of
>>>> the function, after kthread_park
>>>> and before we start playing with the pending list.
>>>>
>>>> P.S One comment I had regarding single threaded queue is that in case we have
>>>> multiple TDR
>>>> arising from same event we will proceed to resetting multiple times - something
>>>> that with trylock
>>>> we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and
>>>> amdgpu_device_lock_hive_adev)
>>>> Daniel mentioned it's not a problem but I still haven't understood why not.
>>>>
>>>> Andrey
>>>>
>>>>
>>>>
>>>>       Thanks !
>>>>
>>>>
>>>>       ------------------------------------------
>>>>
>>>>       Monk Liu | Cloud-GPU Core team
>>>>
>>>>       ------------------------------------------
>>>>
>>>>