Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout

Tue Jan 19 16:58:28 UTC 2021

On 1/19/21 11:39 AM, Chen, Horace wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> Hi Andrey,
>
> I think the list in the XGMI hive won't be break in the middle if we lock the 
> device before we change the list. Because if 2 devices in 1 hive went into the 
> function, it will follow the same sequence to lock the devices. So one of them 
> will definately break at the first device. I add iterate devices here is just 
> to lock all device in the hive since we will change the device sequence in the 
> hive soon after.

I didn't mean break in a sense of breaking the list itself, I just meant the 
literal 'break' instruction
to terminate the iteration once you failed to lock a particular device.

>
> The reason to break the interation in the middle is that the list is changed 
> during the iteration without taking any lock. It is quite bad since I'm fixing 
> one of this issue. And for XGMI hive, there are 2 locks protecting the list, 
> one is the device lock I changed here, the other one is in front of my change, 
> there is a hive->lock to protect the hive.
>
> Even the bad thing really happened, I think moving back through the list is 
> also very dengerous since we don't know what the list finally be, Unless we 
> stack the devices we have iterated through a mirrored list. That can be a big 
> change.

Not sure we are on the same page, my concern is let's sat your XGMI hive 
consists of 2 devices, you manged to call successfully do
amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you will 
bail out without releasing dev1, no ?

Andrey

>
>
> I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some 
> test and update the code later.
>
> Thanks & Regards,
> Horace.
> --------------------------------------------------------------------------------
> *发件人:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> *发送时间:* 2021年1月19日 22:33
> *收件人:* Chen, Horace <Horace.Chen at amd.com>; amd-gfx at lists.freedesktop.org 
> <amd-gfx at lists.freedesktop.org>
> *抄送:* Quan, Evan <Evan.Quan at amd.com>; Tuikov, Luben <Luben.Tuikov at amd.com>; 
> Koenig, Christian <Christian.Koenig at amd.com>; Deucher, Alexander 
> <Alexander.Deucher at amd.com>; Xiao, Jack <Jack.Xiao at amd.com>; Zhang, Hawking 
> <Hawking.Zhang at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Xu, Feifei 
> <Feifei.Xu at amd.com>; Wang, Kevin(Yang) <Kevin1.Wang at amd.com>; Xiaojie Yuan 
> <xiaojie.yuan at amd.com>
> *主题:* Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
>
> On 1/19/21 7:22 AM, Horace Chen wrote:
> > Fix a racing issue when jobs on 2 rings timeout simultaneously.
> >
> > If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> > will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> > by 2 local linked list, which may cause wild pointer issue in
> > iterating.
> >
> > lock the device earily to prevent the node be added to 2 different
> > lists.
> >
> > Signed-off-by: Horace Chen <horace.chen at amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
> >   1 file changed, 30 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 4d434803fb49..9574da3abc32 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >        int i, r = 0;
> >        bool need_emergency_restart = false;
> >        bool audio_suspended = false;
> > +     bool get_dev_lock = false;
> >
> >        /*
> >         * Special case: RAS triggered and full reset isn't supported
> > @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
> *adev,
> >         * Build list of devices to reset.
> >         * In case we are in XGMI hive mode, resort the device list
> >         * to put adev in the 1st position.
> > +      *
> > +      * lock the device before we try to operate the linked list
> > +      * if didn't get the device lock, don't touch the linked list since
> > +      * others may iterating it.
> >         */
> >        INIT_LIST_HEAD(&device_list);
> >        if (adev->gmc.xgmi.num_physical_nodes > 1) {
> >                if (!hive)
> >                        return -ENODEV;
> > -             if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> > - list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> > -             device_list_handle = &hive->device_list;
> > +
> > +             list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> > +                     get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> > +                     if (!get_dev_lock)
> > +                             break;
>
>
> What about unlocking back all the devices you already locked if the break
> happens in the middle of the iteration ?
> Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
> the current code.
>
> Also, maybe now it's better to separate the actual locking in
> amdgpu_device_lock_adev
> from the other stuff going on there since I don't think you would wont to toggle
> stuff
> like adev->mp1_state back and forth and also the function name is not
> descriptive of
> the other stuff going on there anyway.
>
> Andrey
>
>
> > +             }
> > +             if (get_dev_lock) {
> > +                     if (!list_is_first(&adev->gmc.xgmi.head, 
> &hive->device_list))
> > + list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> > +                     device_list_handle = &hive->device_list;
> > +             }
> >        } else {
> > - list_add_tail(&adev->gmc.xgmi.head, &device_list);
> > -             device_list_handle = &device_list;
> > +             get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> > +             tmp_adev = adev;
> > +             if (get_dev_lock) {
> > + list_add_tail(&adev->gmc.xgmi.head, &device_list);
> > +                     device_list_handle = &device_list;
> > +             }
> > +     }
> > +
> > +     if (!get_dev_lock) {
> > +             dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as 
> another already in progress",
> > +                                     job ? job->base.id : -1);
> > +             r = 0;
> > +             /* even we skipped this reset, still need to set the job to 
> guilty */
> > +             goto skip_recovery;
> >        }
> >
> >        /* block all schedulers and reset given job's ring */
> >        list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> > -             if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> > -                     dev_info(tmp_adev->dev, "Bailing on TDR for 
> s_job:%llx, as another already in progress",
> > -                               job ? job->base.id : -1);
> > -                     r = 0;
> > -                     goto skip_recovery;
> > -             }
> > -
> >                /*
> >                 * Try to put the audio codec into suspend state
> >                 * before gpu reset started.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210119/6292dbd6/attachment-0001.htm>