[PATCH v2] Revert "drm/scheduler: Avoid accessing freed bad job."

Tue Aug 31 18:24:52 UTC 2021

On 2021-08-31 9:11 a.m., Daniel Vetter wrote:
> On Thu, Aug 26, 2021 at 11:04:14AM +0200, Daniel Vetter wrote:
>> On Thu, Aug 19, 2021 at 11:25:09AM -0400, Andrey Grodzovsky wrote:
>>> On 2021-08-19 5:30 a.m., Daniel Vetter wrote:
>>>> On Wed, Aug 18, 2021 at 10:51:00AM -0400, Andrey Grodzovsky wrote:
>>>>> On 2021-08-18 10:42 a.m., Daniel Vetter wrote:
>>>>>> On Wed, Aug 18, 2021 at 10:36:32AM -0400, Andrey Grodzovsky wrote:
>>>>>>> On 2021-08-18 10:32 a.m., Daniel Vetter wrote:
>>>>>>>> On Wed, Aug 18, 2021 at 10:26:25AM -0400, Andrey Grodzovsky wrote:
>>>>>>>>> On 2021-08-18 10:02 a.m., Alex Deucher wrote:
>>>>>>>>>
>>>>>>>>>> + dri-devel
>>>>>>>>>>
>>>>>>>>>> Since scheduler is a shared component, please add dri-devel on all
>>>>>>>>>> scheduler patches.
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 18, 2021 at 7:21 AM Jingwen Chen <Jingwen.Chen2 at amd.com> wrote:
>>>>>>>>>>> [Why]
>>>>>>>>>>> for bailing job, this commit will delete it from pending list thus the
>>>>>>>>>>> bailing job will never have a chance to be resubmitted even in advance
>>>>>>>>>>> tdr mode.
>>>>>>>>>>>
>>>>>>>>>>> [How]
>>>>>>>>>>> after embeded hw_fence into amdgpu_job is done, the race condition that
>>>>>>>>>>> this commit tries to work around is completely solved.So revert this
>>>>>>>>>>> commit.
>>>>>>>>>>> This reverts commit 135517d3565b48f4def3b1b82008bc17eb5d1c90.
>>>>>>>>>>> v2:
>>>>>>>>>>> add dma_fence_get/put() around timedout_job to avoid concurrent delete
>>>>>>>>>>> during processing timedout_job
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2 at amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>>       drivers/gpu/drm/scheduler/sched_main.c | 23 +++++------------------
>>>>>>>>>>>       1 file changed, 5 insertions(+), 18 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> index a2a953693b45..f9b9b3aefc4a 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> @@ -314,6 +314,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>>       {
>>>>>>>>>>>              struct drm_gpu_scheduler *sched;
>>>>>>>>>>>              struct drm_sched_job *job;
>>>>>>>>>>> +       struct dma_fence *fence;
>>>>>>>>>>>              enum drm_gpu_sched_stat status = DRM_GPU_SCHED_STAT_NOMINAL;
>>>>>>>>>>>
>>>>>>>>>>>              sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
>>>>>>>>>>> @@ -325,11 +326,10 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>>
>>>>>>>>>>>              if (job) {
>>>>>>>>>>>                      /*
>>>>>>>>>>> -                * Remove the bad job so it cannot be freed by concurrent
>>>>>>>>>>> -                * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
>>>>>>>>>>> -                * is parked at which point it's safe.
>>>>>>>>>>> +                * Get job->s_fence->parent here to avoid concurrent delete during
>>>>>>>>>>> +                * processing timedout_job
>>>>>>>>>>>                       */
>>>>>>>>>>> -               list_del_init(&job->list);
>>>>>>>>>>> +               fence = dma_fence_get(job->s_fence->parent);
>>>>>>>>> While this is true for amdgpu, it has no meaning for other drivers for whom
>>>>>>>>> we haven't
>>>>>>>>> done the refactoring of embedding HW fence (parent) into the job structure.
>>>>>>>>> In fact thinking
>>>>>>>>> about it, unless you do the HW fence embedding for all the drivers using the
>>>>>>>>> scheduler you cannot
>>>>>>>>> revert this patch or you will just break them.
>>>>>>>> btw, why did you do that embedding? I do still have my patches with
>>>>>>>> dma_fence annotations floating around, but my idea at least was to fix
>>>>>>>> that issue with a mempool, not with embeddeding. What was the motivation
>>>>>>>> for embedding the wh fence?
>>>>>>>> -Daniel
>>>>>>> The motivation was 2 fold, avoid memory allocation during jobs submissions
>>>>>>> (HW fence allocation) because as Christian explained this leads to deadlock
>>>>>>> with
>>>>>>> mm code during evictions due to memory pressure (Christian can clarify if I
>>>>>>> messed
>>>>>> Yeah that's the exact same thing I've chased with my dma_fence
>>>>>> annotations, but thus far zero to none interested in getting it sorted. I
>>>>>> think it'd be good to have some cross-driver agreement on how this should
>>>>>> be solved before someone just charges ahead ...
>>>>>>
>>>>>>> this explanation). Second is to exactly revert this patch because while it
>>>>>>> solved the issue
>>>>>>> described in the patch it created another with drivers who baildc out early
>>>>>>> during TDR handling
>>>>>>> for various reason and the job would just leak because it was already
>>>>>>> removed form pending list.
>>>>>> Can't we reinsert it before we restart the scheduler thread? It might need
>>>>>> a separate list for that due to the lockless queue tricks. Or am I
>>>>>> thinking about the wrong kind of "we lost the job"?
>>>>>> -Danile
>>>>> If you look at the original patch it would reinsert it even earlier - right
>>>>> after stopping the  SW scheduler thread, and even then it was to late for
>>>>> some drivers as they would decide to return back from their TDR handler even
>>>>> before that. It is solvable but in an ugly way as far as I see, you need to
>>>>> require each driver in his code to put the job back in the list if they do
>>>>> it before reaching the place where scheduler framework does it. Kind of
>>>>> spaghetti code seems to me.
>>>> Hm yeah I didn't realize this all happens before we stop the scheduler
>>>> thread.
>>>>
>>>> Why can't we stop the scheduler thread first, so that there's guaranteed
>>>> no race? I've recently had a lot of discussions with panfrost folks about
>>>> their reset that spawns across engines, and without stopping the scheduler
>>>> thread first before you touch anything it's just plain impossible.
>>>
>>> Talked with Christian on that, for each TDR we actually stop all the
>>> schedulers for all the rings and not only the hanged ring since
>>> ASIC reset will impact all the rings anyway. So we cannot allow
>>> other timeout handlers for other rings run in parallel to ours
>>> as they will stop/restart the threads we just stopped and rely
>>> on them being stopped. So it's all done with device wide lock
>>> inside the amdgpu tTDR handler. Only inside the locked
>>> section then we may stop/restart the scheduler threads.
>>> Christian also mentioned that you proposed at some point
>>> to serialize all TDR handling into single threading for all rings - this
>>> seems
>>> like something that could be used - we then don't need any
>>> locking against TDR handlers from other rings and then we may
>>> stop the scheduler thread as first step
>>>
>>>
>>>> I'm also still not understanding what exactly you guys have done,
>>>> can someone please dig out the the amdgpu patches that motivate all this
>>>> maybe that's clearer? A full explanation would still be good since I've
>>>> only started in scheduler stuff.
>>>
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Fde7515d43659f852590645a688f8d493e4a18141&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C94e4badd78c04cb74ad208d96c80debd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660123033001546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=itjKBNUdOAyze1%2FOWJtBD7ed%2B8PBbB28QbJEddkc98w%3D&reserved=0
>> Uh, it would have been really good if this was discussed a bit wider
>> beforehand. Now we have rather diverging approaches to this. Also would be
>> really good to resurrect the dma_fence annotations too.
>>
>> Can you guys pls spend a bit of time on this? Shouldn't be to hard to type
>> up rfc conversion patches for the other drivers.
> Ping for this. Currently the hw fence is returned from the ->run_job
> callback, and that's not great design.

What's the problem you see there ?

>
> If we embed it, then I think it should start existing latest from
> drm_sched_job_arm. Maybe not yet initialized, but at least allocated. So
> the right thing to do here is to have the hw fence as a pointer in
> struct drm_sched_job. And check in drm_sched_job_arm() that it's at least
> allocated.

Why we need to allocate the HW fence if it's embedded within a job struct ?

>
> Otherwise we're just diverging across drivers and tempting them to do the
> wrong thing with the current ->run_job callback interface.

Maybe we should switch from embedding in driver level job struct as it's now
to drm_sched_job and just leave the fence initialization to driver 
specific code ?

Andrey

>
> Can you guys look into this?
> -Daniel
>
>>>> Another thing I recently pondered for tdr races looking at i915 code is
>>>> whether the tdr should first block the completion fence for that job. My
>>>> motivation is to have a race-free error capture (if the completion races
>>>> then we might start evicting memory and everything goes boom), but maybe
>>>> that helps here too. Some kind of atomic "block this fence from
>>>> completing thing.
>>>>
>>>> Or I'm I completely guessing in the wrong direction?
>>>
>>> I think we already do it here - https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.14-rc1%2Fsource%2Fdrivers%2Fgpu%2Fdrm%2Fscheduler%2Fsched_main.c%23L410&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C94e4badd78c04cb74ad208d96c80debd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660123033001546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Maya0Mk1sAliheOv7fCM8bTC7qTOp74Agt1u67kYCJw%3D&reserved=0
>> Ah yes this works becase drm/sched has separate hw fence from the logical
>> job fence.
>> -Daniel
>>
>>> Andrey
>>>
>>>
>>>> -Daniel
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>> Andrey
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>                      spin_unlock(&sched->job_list_lock);
>>>>>>>>>>>
>>>>>>>>>>>                      status = job->sched->ops->timedout_job(job);
>>>>>>>>>>> @@ -342,6 +342,7 @@ static void drm_sched_job_timedout(struct work_struct *work)
>>>>>>>>>>>                              job->sched->ops->free_job(job);
>>>>>>>>>>>                              sched->free_guilty = false;
>>>>>>>>>>>                      }
>>>>>>>>>>> +               dma_fence_put(fence);
>>>>>>>>>>>              } else {
>>>>>>>>>>>                      spin_unlock(&sched->job_list_lock);
>>>>>>>>>>>              }
>>>>>>>>>>> @@ -392,20 +393,6 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>>>>>>>>>>>
>>>>>>>>>>>              kthread_park(sched->thread);
>>>>>>>>>>>
>>>>>>>>>>> -       /*
>>>>>>>>>>> -        * Reinsert back the bad job here - now it's safe as
>>>>>>>>>>> -        * drm_sched_get_cleanup_job cannot race against us and release the
>>>>>>>>>>> -        * bad job at this point - we parked (waited for) any in progress
>>>>>>>>>>> -        * (earlier) cleanups and drm_sched_get_cleanup_job will not be called
>>>>>>>>>>> -        * now until the scheduler thread is unparked.
>>>>>>>>>>> -        */
>>>>>>>>>>> -       if (bad && bad->sched == sched)
>>>>>>>>>>> -               /*
>>>>>>>>>>> -                * Add at the head of the queue to reflect it was the earliest
>>>>>>>>>>> -                * job extracted.
>>>>>>>>>>> -                */
>>>>>>>>>>> -               list_add(&bad->list, &sched->pending_list);
>>>>>>>>>>> -
>>>>>>>>>>>              /*
>>>>>>>>>>>               * Iterate the job list from later to  earlier one and either deactive
>>>>>>>>>>>               * their HW callbacks or remove them from pending list if they already
>>>>>>>>>>> --
>>>>>>>>>>> 2.25.1
>>>>>>>>>>>
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7C94e4badd78c04cb74ad208d96c80debd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660123033001546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=r7EGQcWGcRinVxmD%2F%2FIFA8WgRpYNnt7feQseD92U6kc%3D&reserved=0