[diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread
Jingwen Chen
Jingwen.Chen2 at amd.com
Wed Sep 1 04:25:13 UTC 2021
On Wed Sep 01, 2021 at 12:04:47AM -0400, Andrey Grodzovsky wrote:
> I will answer everything here -
>
> On 2021-08-31 9:58 p.m., Liu, Monk wrote:
>
>
> [AMD Official Use Only]
>
>
>
> In the previous discussion, you guys stated that we should drop the
> “kthread_should_park” in cleanup_job.
>
>
>
> @@ -676,15 +676,6 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler
> *sched)
>
> {
>
> struct drm_sched_job *job, *next;
>
>
>
> - /*
>
> - * Don't destroy jobs while the timeout worker is running OR
> thread
>
> - * is being parked and hence assumed to not touch pending_list
>
> - */
>
> - if ((sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>
> - !cancel_delayed_work(&sched->work_tdr)) ||
>
> - kthread_should_park())
>
> - return NULL;
>
>
>
> But I suddenly have a question here: if return the timedout job no matter
> kthread_should_park() or not, then we are backing to the original problem
> again: that the timedout_job is suddenly signaling and cleanup_job still
> returns it to sched_main and job is freed while it is still handling by
> vendor’s timeout callback
>
>
>
> If we return NULL when kthread_should_park() in cleanup_job, we can prevent
> above scenario from happening: once a job is processed by job_timedout we
> can stop its scheduler, and after that even this job suddenly signaled the
> cleanup_job won’t return it so sched_main won’t free it in parallel …
>
>
>
> What do you think ?
>
>
> Is your analysis above takes into account that you also submit
> '[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' then I don't see a
> problem -
Hi Andrey,
Monk has talked to me and we agreed that as there're multiple opinions about the
'[PATCH 2/2] drm/sched: serialize job_timeout and scheduler' and patch
1 is an independent patch to fix some error. So we should not take the patch 2 into
analysis.
> I think that as long as you put kthread_park(sched->thread) BEFORE
> fetching next bad job from pending list (under spinlock) there is no
> such issue as in the case you describe because this potential bad job
> that became signaled will be removed from pending list before you
> even fetch the next job and by the time you fetch it the scheduler
> thread is already stopped anyway
>
> If you don't submit and we keep the removal hack for now then also no problem
> because
> we temporary remove the job we fetch for TDR from pending list under spinlock
> exactly to avoid this race
>
So can you help review [PATCH 1/2] drm/sched: fix the bug of time out calculation(v3)?
patch v3 keeps this kthread_should_park check.
Best Regards,
JingWen
>
>
> Thanks
>
>
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>
>
> From: Liu, Monk
> Sent: Wednesday, September 1, 2021 9:23 AM
> To: Koenig, Christian <Christian.Koenig at amd.com>; Grodzovsky, Andrey
> <Andrey.Grodzovsky at amd.com>; Daniel Vetter <daniel at ffwll.ch>; Chen, JingWen
> <JingWen.Chen2 at amd.com>
> Cc: DRI Development <dri-devel at lists.freedesktop.org>;
> amd-gfx at lists.freedesktop.org
> Subject: [diagnostic TDR mode patches] unify our solution opinions/
> suggestions in one thread
>
>
>
> [AMD Official Use Only]
>
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me,
> the feature we are working on (diagnostic TDR scheme) is pending there for
> more than 6 month (we started it from feb 2021).
>
>
>
> Honestly speaking the email ways that we are using now is not friendly and
> quite painful to me ….
>
> Can we try to put all our opinions, suggestions, or even objects here
> together, let’s go through them one by one, it’s too hard for us to reply
> each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete
> this one first ? it should already resolved all the questions and
> suggestions.
>
>
> I have no objections for this one besides getting rid of the
> kthread_should_park()) return null part,
> if my answer above is not wrong then it seems superfluous to me
>
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other thread
> , regarding why I use __kthread_should_park()
>
>
> Is this race free ? Can't the other thread execute kthread_park after the check
> ?
>
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
> So to summarize from previous threads I think that the best solution
> to the problem being solved in this patch is if we do HW fence embedding
> at the drm_sched_job level instead of doing it only for amdgpu, and modifying
> all
> the drivers to support this we can both remove this hack and solve the race
> against concurrent drm_sched_cleanup_jobs job freeing just by taking reference
> to the hw fence of the job at the beginning of drm_sched_job_timedout
>
> If doing this refactoring for all the drivers is not an option now and you need
> a quick
> solution such as the serialization you do here then take into account other
> concurrent
> TDR handlers that might run, as mentioned before, without serializing against
> other TDR handlers from other
> schedulers you just race here against them, e.g. you parked it now but another
> one in progress will unpark it as part of calling drm_sched_start for other
> rings.
> So you either need a global lock or dedicated single threaded queue as Daniel
> suggested.
> At minimum we should change cancel_delayed_work in drm_sched_stop to
> cancel_delayed_work_sync
> to cancel and flush all concurrent TDRs and probably move it to the begining of
> the function, after kthread_park
> and before we start playing with the pending list.
>
> P.S One comment I had regarding single threaded queue is that in case we have
> multiple TDR
> arising from same event we will proceed to resetting multiple times - something
> that with trylock
> we mostly avoid today within amdgpu (see amdgpu_device_lock_adev and
> amdgpu_device_lock_hive_adev)
> Daniel mentioned it's not a problem but I still haven't understood why not.
>
> Andrey
>
>
>
>
> Thanks !
>
>
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>
>
More information about the amd-gfx
mailing list