[PATCH] drm/amdgpu:remove debugfs file in amdgpu_device_finish

Wed Nov 8 02:40:31 UTC 2017

Hi Christian,

The retry_init only handles the failure caused by exclusive mode timeout. It checks the MMIO to see if there’s exclusive mode timeout, and retry init if there’s, otherwise just return error.

For exclusive timeout case, the host layer issues a FLR on this VF so driver needn't cleanup hardware status, amdgpu_device_fini here just is used to cleanup the software.

It’s tested and proved working correctly. Although the debugfs files are only the tip of the iceberg, it’s the only issue we found in this version of retry_init.

— 
Sincerely Yours,
Pixel

On 07/11/2017, 5:56 PM, "Koenig, Christian" <Christian.Koenig at amd.com> wrote:

>Hi Gary,
>
>well that patch is nonsense to begin with.
>
>amdgpu_device_init() does quite a bunch of other initialization which is 
>not cleaned up by amdgpu_device_fini(), so the debugfs files are only 
>the tip of the iceberg here.
>
>Please revert 2316518efc459928ad1d3d2d3511ea5fbda19475 and then we can 
>try again from scratch.
>
>What we need to do is return -EAGAIN from amdgpu_driver_load_kms. Then 
>in amdgpu_pci_probe() we can catch that error and call 
>drm_dev_register() multiple times if necessary.
>
>This way we can also optionally pci_disable_device() / 
>pci_enable_device() between tries if appropriate.
>
>Regards,
>Christian.
>
>Am 07.11.2017 um 09:02 schrieb Sun, Gary:
>> Hi Christian,
>>
>> The feature is for GPU virtualization and has been checked in, you can refer to the following patch or commit 75b126427778218b36cfb68637e4f8d0e584b8ef.
>>
>>  From 2316518efc459928ad1d3d2d3511ea5fbda19475 Mon Sep 17 00:00:00 2001
>> From: pding <Pixel.Ding at amd.com>
>> Date: Mon, 23 Oct 2017 17:22:09 +0800
>> Subject: [PATCH 001/121] drm/amdgpu: retry init if it fails due to exclusive mode timeout (v3)
>>
>> The exclusive mode has real-time limitation in reality, such like being
>> done in 300ms. It's easy observed if running many VF/VMs in single host
>> with heavy CPU workload.
>>
>> If we find the init fails due to exclusive mode timeout, try it again.
>>
>> v2:
>>   - rewrite the condition for readable value.
>>
>> v3:
>>   - fix typo, add comments for sleep
>>
>> Acked-by: Alex Deucher <alexander.deucher at amd.com>
>> Signed-off-by: pding <Pixel.Ding at amd.com>
>> Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
>> Signed-off-by: Gary Sun <Gary.Sun at amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   10 ++++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |   15 +++++++++++++--
>>   2 files changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 125f77d..385b10e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2303,6 +2303,15 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>   
>>   	r = amdgpu_init(adev);
>>   	if (r) {
>> +		/* failed in exclusive mode due to timeout */
>> +		if (amdgpu_sriov_vf(adev) &&
>> +		    !amdgpu_sriov_runtime(adev) &&
>> +		    amdgpu_virt_mmio_blocked(adev) &&
>> +		    !amdgpu_virt_wait_reset(adev)) {
>> +			dev_err(adev->dev, "VF exclusive mode timeout\n");
>> +			r = -EAGAIN;
>> +			goto failed;
>> +		}
>>   		dev_err(adev->dev, "amdgpu_init failed\n");
>>   		amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_AMDGPU_INIT_FAIL, 0, 0);
>>   		amdgpu_fini(adev);
>> @@ -2390,6 +2399,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>   	amdgpu_vf_error_trans_all(adev);
>>   	if (runtime)
>>   		vga_switcheroo_fini_domain_pm_ops(adev->dev);
>> +
>>   	return r;
>>   }
>>   
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 720139e..f313eee 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -86,7 +86,7 @@ void amdgpu_driver_unload_kms(struct drm_device *dev)
>>   int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
>>   {
>>   	struct amdgpu_device *adev;
>> -	int r, acpi_status;
>> +	int r, acpi_status, retry = 0;
>>   
>>   #ifdef CONFIG_DRM_AMDGPU_SI
>>   	if (!amdgpu_si_support) {
>> @@ -122,6 +122,7 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
>>   		}
>>   	}
>>   #endif
>> +retry_init:
>>   
>>   	adev = kzalloc(sizeof(struct amdgpu_device), GFP_KERNEL);
>>   	if (adev == NULL) {
>> @@ -144,7 +145,17 @@ int amdgpu_driver_load_kms(struct drm_device *dev, unsigned long flags)
>>   	 * VRAM allocation
>>   	 */
>>   	r = amdgpu_device_init(adev, dev, dev->pdev, flags);
>> -	if (r) {
>> +	if (r == -EAGAIN && ++retry <= 3) {
>> +		adev->virt.caps &= ~AMDGPU_SRIOV_CAPS_RUNTIME;
>> +		adev->virt.ops = NULL;
>> +		amdgpu_device_fini(adev);
>> +		kfree(adev);
>> +		dev->dev_private = NULL;
>> +		/* Don't request EX mode too frequently which is attacking */
>> +		msleep(5000);
>> +		dev_err(&dev->pdev->dev, "retry init %d\n", retry);
>> +		goto retry_init;
>> +	} else if (r) {
>>   		dev_err(&dev->pdev->dev, "Fatal error during GPU init\n");
>>   		goto out;
>>   	}
>
>