[PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal

Tue Feb 27 09:52:52 UTC 2018

> Is my understanding wrong ?...
The first thing that cancel_delayed_work_sync() does is it takes the 
work item from the TODO list.

The it makes sure that the work is currently not running on another CPU.

The return value just indicates if the work was still on the TODO list 
or not, e.g. when false is returned when the work was already done 
before the call was made, true otherwise.

So the while loop isn't necessary here.

Regards,
Christian.

Am 27.02.2018 um 10:09 schrieb Liu, Monk:
>> Never use schedule() in a while loop as long as you don't work on core locking primitives or interrupt handling or stuff like that. In this particular case a single call to cancel_delayed_work_sync() should be sufficient.
> I thought that only one call on cancel_delayed_work_sync() cannot guarantee that after this function returned, the work had been finished, so if the work is still pending (return you TRUE) there
> And the caller just continue and release driver, we would run to page fault in that work since all structures pointers accessed in that work is invalid now
>
> That's why I use a while() loop to keep calling cancel_delayed_work_sync() till the work is not pending any more
>
> Is my understanding wrong ?...
> /Monk
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年2月27日 16:42
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on bare-metal
>
> Am 27.02.2018 um 05:45 schrieb Liu, Monk:
>>> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
>> I already did it:
>>> +	if (!amdgpu_sriov_vf(adev))
>>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>>> +			schedule(); /* to make sure late_init_work really stopped */
>> What do you mean never call "schedule()" this way ?
> Never use schedule() in a while loop as long as you don't work on core locking primitives or interrupt handling or stuff like that. In this particular case a single call to cancel_delayed_work_sync() should be sufficient.
>
> Additional to that I think you can drop the "if (!amdgpu_sriov_vf(adev))" check and just test for "if (r)" or something like that. In other words if there is an error we need to cancel the late init work independent if we are on SRIOV or bare metal.
>
> Regards,
> Christian.
>
>> Please show me how to do it to achieve the same goal and I can modify
>> my patch
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
>> Sent: 2018年2月26日 18:08
>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 09/22] drm/amdgpu: fix kmd reload bug on
>> bare-metal
>>
>> Am 26.02.2018 um 06:18 schrieb Monk Liu:
>>> issue:
>>> on bare-metal when doing kmd reload test, there are chance that
>>> kernel hit fatal error afer driver unloaded/reloaded
>>>
>>> fix:
>>> the cause is that those "idle work" not really stopped and if kmd was
>>> is unloaded too quick that were chance that "idle work" run after
>>> driver structures already released which introduces this issue.
>> In this case I think it would be much better to wait for the idle work before trying to unload the driver.
>>
>>> Change-Id: Idb0f7db771e7ca60dba925d1d0f48b1de08dc89e
>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++++--
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c    | 3 ++-
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c    | 4 +++-
>>>     3 files changed, 12 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 54145ec..69fb5e50 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -1419,7 +1419,8 @@ static int amdgpu_device_ip_late_init(struct amdgpu_device *adev)
>>>     		}
>>>     	}
>>>     
>>> -	mod_delayed_work(system_wq, &adev->late_init_work,
>>> +	if (!amdgpu_sriov_vf(adev))
>>> +		mod_delayed_work(system_wq, &adev->late_init_work,
>>>     			msecs_to_jiffies(AMDGPU_RESUME_MS));
>>>     
>>>     	amdgpu_device_fill_reset_magic(adev);
>>> @@ -2087,7 +2088,11 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>>>     		adev->firmware.gpu_info_fw = NULL;
>>>     	}
>>>     	adev->accel_working = false;
>>> -	cancel_delayed_work_sync(&adev->late_init_work);
>>> +
>>> +	if (!amdgpu_sriov_vf(adev))
>>> +		while (cancel_delayed_work_sync(&adev->late_init_work))
>>> +			schedule(); /* to make sure late_init_work really stopped */
>> Never use schedule() like that.
>>
>> Regards,
>> Christian.
>>
>>> +
>>>     	/* free i2c buses */
>>>     	if (!amdgpu_device_has_dc_support(adev))
>>>     		amdgpu_i2c_fini(adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> index caba610..337db57 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
>>> @@ -299,7 +299,8 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
>>>     		return 0;
>>>     
>>>     	if (!amdgpu_sriov_vf(adev))
>>> -		cancel_delayed_work_sync(&adev->uvd.idle_work);
>>> +		while (cancel_delayed_work_sync(&adev->uvd.idle_work))
>>> +			schedule(); /* to make sure idle work really stopped */
>>>     
>>>     	for (i = 0; i < adev->uvd.max_handles; ++i)
>>>     		if (atomic_read(&adev->uvd.handles[i]))
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> index a829350..2874fda 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vce.c
>>> @@ -243,7 +243,9 @@ int amdgpu_vce_suspend(struct amdgpu_device *adev)
>>>     		return 0;
>>>     
>>>     	if (!amdgpu_sriov_vf(adev))
>>> -		cancel_delayed_work_sync(&adev->vce.idle_work);
>>> +		while (cancel_delayed_work_sync(&adev->vce.idle_work))
>>> +			schedule(); /* to make sure the idle_work really stopped */
>>> +
>>>     	/* TODO: suspending running encoding sessions isn't supported */
>>>     	return -EINVAL;
>>>     }