[PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV

Liu, Monk Monk.Liu at amd.com
Mon Jun 29 09:03:10 UTC 2020


>>> We explicitly added the asynchronously IB test for SRIOV to make driver load faster. Why is that now a problem?
Well I didn't notice this change explicitly for SRIOV, at least from the code it is quite a normal change 


>>> And why would it help when the VM shuts down? We cancel/flush the test during driver unload/suspend as well.
If you do the sequence like: load amdgpu and shutdown VM immediately , you will hit problems .

Virtual Machine's shutdown won't be blocked if the IB test is on the fly , e.g.: you can do "init 0" right after "modprobe amdgpu" and the "amdgpu_device_fini" won't be called 
Thus the flushing on IB test won't happen so the IB test is still running with the VM already shutdown and lead to GPU crash (usually VCN/VCE crash on invalid memory address)

 
_____________________________________
Monk Liu|GPU Virtualization Team |AMD


-----Original Message-----
From: Christian König <ckoenig.leichtzumerken at gmail.com> 
Sent: Monday, June 29, 2020 4:18 PM
To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: make IB test synchronize with init for SRIOV

Am 29.06.20 um 09:11 schrieb Monk Liu:
> From: pengzhou <PengJu.Zhou at amd.com>
>
> issue:
> originally we kickoff IB test asynchronously with driver's init, thus 
> the IB test may still running when the driver loading done (modprobe amdgpu done).
> if we shutdown VM immediately after amdgpu driver loaded then GPU may 
> hang because the IB test is still running
>
> fix:
> make IB test synchronize with driver init thus it won't still running 
> when we shutdown the VM.

We explicitly added the asynchronously IB test for SRIOV to make driver load faster. Why is that now a problem?

And why would it help when the VM shuts down? We cancel/flush the test during driver unload/suspend as well.

>
> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 29 ++++++++++++++++++++++++-----
>   1 file changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 457f5d2..4f54660 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3292,8 +3292,16 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	/* must succeed. */
>   	amdgpu_ras_resume(adev);
>   
> -	queue_delayed_work(system_wq, &adev->delayed_init_work,
> +	if (amdgpu_sriov_vf(adev)) {
> +		r = amdgpu_ib_ring_tests(adev);
> +		if (r) {
> +			DRM_ERROR("ib ring test failed (%d).\n", r);
> +			return r;
> +		}
> +	} else {
> +		queue_delayed_work(system_wq, &adev->delayed_init_work,
>   			   msecs_to_jiffies(AMDGPU_RESUME_MS));
> +	}
>   
>   	r = sysfs_create_files(&adev->dev->kobj, amdgpu_dev_attributes);
>   	if (r) {
> @@ -3329,7 +3337,8 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
>   	int r;
>   
>   	DRM_INFO("amdgpu: finishing device.\n");
> -	flush_delayed_work(&adev->delayed_init_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		flush_delayed_work(&adev->delayed_init_work);

You can drop this change, flushing a work which was never scheduled is harmless.

>   	adev->shutdown = true;
>   
>   	/* make sure IB test finished before entering exclusive mode @@ 
> -3425,7 +3434,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon)
>   	if (fbcon)
>   		amdgpu_fbdev_set_suspend(adev, 1);
>   
> -	cancel_delayed_work_sync(&adev->delayed_init_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		cancel_delayed_work_sync(&adev->delayed_init_work);
>   
>   	if (!amdgpu_device_has_dc_support(adev)) {
>   		/* turn off display hw */
> @@ -3528,8 +3538,16 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>   	if (r)
>   		return r;
>   
> -	queue_delayed_work(system_wq, &adev->delayed_init_work,
> +	if (amdgpu_sriov_vf(adev)) {
> +		r = amdgpu_ib_ring_tests(adev);
> +		if (r) {
> +			DRM_ERROR("ib ring test failed (%d).\n", r);
> +			return r;
> +		}
> +	} else {
> +		queue_delayed_work(system_wq, &adev->delayed_init_work,
>   			   msecs_to_jiffies(AMDGPU_RESUME_MS));
> +	}
>   
>   	if (!amdgpu_device_has_dc_support(adev)) {
>   		/* pin cursors */
> @@ -3554,7 +3572,8 @@ int amdgpu_device_resume(struct drm_device *dev, bool fbcon)
>   		return r;
>   
>   	/* Make sure IB tests flushed */
> -	flush_delayed_work(&adev->delayed_init_work);
> +	if (!amdgpu_sriov_vf(adev))
> +		flush_delayed_work(&adev->delayed_init_work);
>   
>   	/* blat the mode back in */
>   	if (fbcon) {



More information about the amd-gfx mailing list