[PATCH] drm/amdgpu:remove debugfs file in amdgpu_device_finish

Tue Nov 7 09:56:48 UTC 2017

Hi Gary,

well that patch is nonsense to begin with.

amdgpu_device_init() does quite a bunch of other initialization which is 
not cleaned up by amdgpu_device_fini(), so the debugfs files are only 
the tip of the iceberg here.

Please revert 2316518efc459928ad1d3d2d3511ea5fbda19475 and then we can 
try again from scratch.

What we need to do is return -EAGAIN from amdgpu_driver_load_kms. Then 
in amdgpu_pci_probe() we can catch that error and call 
drm_dev_register() multiple times if necessary.

This way we can also optionally pci_disable_device() / 
pci_enable_device() between tries if appropriate.

Regards,
Christian.

Am 07.11.2017 um 09:02 schrieb Sun, Gary:
> Hi Christian,
>
> The feature is for GPU virtualization and has been checked in, you can refer to the following patch or commit 75b126427778218b36cfb68637e4f8d0e584b8ef.
>
>  From 2316518efc459928ad1d3d2d3511ea5fbda19475 Mon Sep 17 00:00:00 2001
> From: pding <Pixel.Ding at amd.com>
> Date: Mon, 23 Oct 2017 17:22:09 +0800
> Subject: [PATCH 001/121] drm/amdgpu: retry init if it fails due to exclusive mode timeout (v3)
>
> The exclusive mode has real-time limitation in reality, such like being
> done in 300ms. It's easy observed if running many VF/VMs in single host
> with heavy CPU workload.
>
> If we find the init fails due to exclusive mode timeout, try it again.
>
> v2:
>   - rewrite the condition for readable value.
>
> v3:
>   - fix typo, add comments for sleep
>
> Acked-by: Alex Deucher <alexander.deucher at amd.com>
> Signed-off-by: pding <Pixel.Ding at amd.com>
> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> Signed-off-by: Gary Sun <Gary.Sun at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   10 ++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |   15 +++++++++++++--
>   2 files changed, 23 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 125f77d..385b10e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2303,6 +2303,15 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   
>   	r = amdgpu_init(adev);
>   	if (r) {
> +		/* failed in exclusive mode due to timeout */
> +		if (amdgpu_sriov_vf(adev) &&
> +		    !amdgpu_sriov_runtime(adev) &&
> +		    amdgpu_virt_mmio_blocked(adev) &&
> +		    !amdgpu_virt_wait_reset(adev)) {
> +			dev_err(adev->dev, "VF exclusive mode timeout\n");
> +			r = -EAGAIN;
> +			goto failed;
> +		}
>   		dev_err(adev->dev, "amdgpu_init failed\n");
>   		amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_AMDGPU_INIT_FAIL, 0, 0);
>   		amdgpu_fini(adev);
> @@ -2390,6 +2399,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	amdgpu_vf_error_trans_all(adev);
>   	if (runtime)
>   		vga_switcheroo_fini_domain_pm_ops(adev->dev);
> +
>   	return r;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 720139e..f313eee 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -86,7 +86,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev)
>   int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
>   {
>   	struct amdgpu_device *adev;
> -	int r, acpi_status;
> +	int r, acpi_status, retry = 0;
>   
>   #ifdef CONFIG_DRM_AMDGPU_SI
>   	if (!amdgpu_si_support) {
> @@ -122,6 +122,7 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
>   		}
>   	}
>   #endif
> +retry_init:
>   
>   	adev = kzalloc(sizeof(struct amdgpu_device), GFP_KERNEL);
>   	if (adev == NULL) {
> @@ -144,7 +145,17 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
>   	 * VRAM allocation
>   	 */
>   	r = amdgpu_device_init(adev, dev, dev->pdev, flags);
> -	if (r) {
> +	if (r == -EAGAIN && ++retry <= 3) {
> +		adev->virt.caps &= ~AMDGPU_SRIOV_CAPS_RUNTIME;
> +		adev->virt.ops = NULL;
> +		amdgpu_device_fini(adev);
> +		kfree(adev);
> +		dev->dev_private = NULL;
> +		/* Don't request EX mode too frequently which is attacking */
> +		msleep(5000);
> +		dev_err(&dev->pdev->dev, "retry init %d\n", retry);
> +		goto retry_init;
> +	} else if (r) {
>   		dev_err(&dev->pdev->dev, "Fatal error during GPU init\n");
>   		goto out;
>   	}