Re: 回复: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
Andrey Grodzovsky
Andrey.Grodzovsky at amd.com
Tue Jan 19 16:58:28 UTC 2021
On 1/19/21 11:39 AM, Chen, Horace wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
>
> Hi Andrey,
>
> I think the list in the XGMI hive won't be break in the middle if we lock the
> device before we change the list. Because if 2 devices in 1 hive went into the
> function, it will follow the same sequence to lock the devices. So one of them
> will definately break at the first device. I add iterate devices here is just
> to lock all device in the hive since we will change the device sequence in the
> hive soon after.
I didn't mean break in a sense of breaking the list itself, I just meant the
literal 'break' instruction
to terminate the iteration once you failed to lock a particular device.
>
> The reason to break the interation in the middle is that the list is changed
> during the iteration without taking any lock. It is quite bad since I'm fixing
> one of this issue. And for XGMI hive, there are 2 locks protecting the list,
> one is the device lock I changed here, the other one is in front of my change,
> there is a hive->lock to protect the hive.
>
> Even the bad thing really happened, I think moving back through the list is
> also very dengerous since we don't know what the list finally be, Unless we
> stack the devices we have iterated through a mirrored list. That can be a big
> change.
Not sure we are on the same page, my concern is let's sat your XGMI hive
consists of 2 devices, you manged to call successfully do
amdgpu_device_lock_adev for dev1 but then failed for dev2, in this case you will
bail out without releasing dev1, no ?
Andrey
>
>
> I'm ok to seperate the locking in amdgpu_device_lock_adev here, I'll do some
> test and update the code later.
>
> Thanks & Regards,
> Horace.
> --------------------------------------------------------------------------------
> *发件人:* Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> *发送时间:* 2021年1月19日 22:33
> *收件人:* Chen, Horace <Horace.Chen at amd.com>; amd-gfx at lists.freedesktop.org
> <amd-gfx at lists.freedesktop.org>
> *抄送:* Quan, Evan <Evan.Quan at amd.com>; Tuikov, Luben <Luben.Tuikov at amd.com>;
> Koenig, Christian <Christian.Koenig at amd.com>; Deucher, Alexander
> <Alexander.Deucher at amd.com>; Xiao, Jack <Jack.Xiao at amd.com>; Zhang, Hawking
> <Hawking.Zhang at amd.com>; Liu, Monk <Monk.Liu at amd.com>; Xu, Feifei
> <Feifei.Xu at amd.com>; Wang, Kevin(Yang) <Kevin1.Wang at amd.com>; Xiaojie Yuan
> <xiaojie.yuan at amd.com>
> *主题:* Re: [PATCH 1/2] drm/amdgpu: race issue when jobs on 2 ring timeout
>
> On 1/19/21 7:22 AM, Horace Chen wrote:
> > Fix a racing issue when jobs on 2 rings timeout simultaneously.
> >
> > If 2 rings timed out at the same time, the amdgpu_device_gpu_recover
> > will be reentered. Then the adev->gmc.xgmi.head will be grabbed
> > by 2 local linked list, which may cause wild pointer issue in
> > iterating.
> >
> > lock the device earily to prevent the node be added to 2 different
> > lists.
> >
> > Signed-off-by: Horace Chen <horace.chen at amd.com>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 42 +++++++++++++++-------
> > 1 file changed, 30 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 4d434803fb49..9574da3abc32 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4540,6 +4540,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> > int i, r = 0;
> > bool need_emergency_restart = false;
> > bool audio_suspended = false;
> > + bool get_dev_lock = false;
> >
> > /*
> > * Special case: RAS triggered and full reset isn't supported
> > @@ -4582,28 +4583,45 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
> *adev,
> > * Build list of devices to reset.
> > * In case we are in XGMI hive mode, resort the device list
> > * to put adev in the 1st position.
> > + *
> > + * lock the device before we try to operate the linked list
> > + * if didn't get the device lock, don't touch the linked list since
> > + * others may iterating it.
> > */
> > INIT_LIST_HEAD(&device_list);
> > if (adev->gmc.xgmi.num_physical_nodes > 1) {
> > if (!hive)
> > return -ENODEV;
> > - if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
> > - list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> > - device_list_handle = &hive->device_list;
> > +
> > + list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> > + get_dev_lock = amdgpu_device_lock_adev(tmp_adev, hive);
> > + if (!get_dev_lock)
> > + break;
>
>
> What about unlocking back all the devices you already locked if the break
> happens in the middle of the iteration ?
> Note that at skip_recovery: we don't do it. BTW, i see this issue is already in
> the current code.
>
> Also, maybe now it's better to separate the actual locking in
> amdgpu_device_lock_adev
> from the other stuff going on there since I don't think you would wont to toggle
> stuff
> like adev->mp1_state back and forth and also the function name is not
> descriptive of
> the other stuff going on there anyway.
>
> Andrey
>
>
> > + }
> > + if (get_dev_lock) {
> > + if (!list_is_first(&adev->gmc.xgmi.head,
> &hive->device_list))
> > + list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
> > + device_list_handle = &hive->device_list;
> > + }
> > } else {
> > - list_add_tail(&adev->gmc.xgmi.head, &device_list);
> > - device_list_handle = &device_list;
> > + get_dev_lock = amdgpu_device_lock_adev(adev, hive);
> > + tmp_adev = adev;
> > + if (get_dev_lock) {
> > + list_add_tail(&adev->gmc.xgmi.head, &device_list);
> > + device_list_handle = &device_list;
> > + }
> > + }
> > +
> > + if (!get_dev_lock) {
> > + dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as
> another already in progress",
> > + job ? job->base.id : -1);
> > + r = 0;
> > + /* even we skipped this reset, still need to set the job to
> guilty */
> > + goto skip_recovery;
> > }
> >
> > /* block all schedulers and reset given job's ring */
> > list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> > - if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> > - dev_info(tmp_adev->dev, "Bailing on TDR for
> s_job:%llx, as another already in progress",
> > - job ? job->base.id : -1);
> > - r = 0;
> > - goto skip_recovery;
> > - }
> > -
> > /*
> > * Try to put the audio codec into suspend state
> > * before gpu reset started.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20210119/6292dbd6/attachment-0001.htm>
More information about the amd-gfx
mailing list