[PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
Liu, Monk
Monk.Liu at amd.com
Tue Oct 10 07:12:55 UTC 2017
Then the question is how we treat recovery if VRAM lost ?
-----Original Message-----
From: Koenig, Christian
Sent: 2017年10月10日 14:59
To: Liu, Monk <Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org; Daenzer, Michel <Michel.Daenzer at amd.com>
Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu reseted
As Nicolai explained that approach simply won't work.
The fd is used by more than just the closed source Vulkan driver and I think even by some components not developed by AMD (common X code?
Michel please comment as well).
So closing it and reopening it to handle a GPU reset is simply not an option.
Regards,
Christian.
Am 10.10.2017 um 06:26 schrieb Liu, Monk:
> After VRAM lost happens, all clients no matter radv/mesa/ogl is
> useless,
>
> Any drivers uses this FD should be denied by KMD after VRAM lost, and
> UMD can destroy/close this FD and re-open it and rebuild all resources
>
> That's the only option for VRAM lost case
>
>
>
> -----Original Message-----
> From: Nicolai Hähnle [mailto:nhaehnle at gmail.com]
> Sent: 2017年10月9日 19:01
> To: Liu, Monk <Monk.Liu at amd.com>; Koenig, Christian
> <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
> reseted
>
> On 09.10.2017 10:35, Liu, Monk wrote:
>> Please be aware that this policy is what the strict mode defined and
>> what customer want, And also please check VK spec, it defines that
>> after GPU reset all vk INSTANCE should close/release its
>> resource/device/ctx and all buffers, and call re-initvkinstance after
>> gpu reset
> Sorry, but you simply cannot implement a correct user-space implementation of those specs on top of this.
>
> It will break as soon as you have both OpenGL and Vulkan running in the same process (or heck, our Vulkan and radv :)), because both drivers will use the same fd.
>
> Cheers,
> Nicolai
>
>
>
>> So this whole approach is what just aligned with the spec, and to not
>> influence with current MESA/OGL client that's why I put the whole
>> approach into the strict mode And by default strict mode is not
>> selected
>>
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
>> Sent: 2017年10月9日 16:26
>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>> Subject: Re: [PATCH 09/12] drm/amdgpu/sriov:return -ENODEV if gpu
>> reseted
>>
>> Am 30.09.2017 um 08:03 schrieb Monk Liu:
>>> for SRIOV strict mode gpu reset:
>>>
>>> In kms open we mark the latest adev->gpu_reset_counter in fpriv we
>>> return -ENODEV in cs_ioctl or info_ioctl if they found
>>> fpriv->gpu_reset_counter != adev->gpu_reset_counter.
>>>
>>> this way we prevent a potential bad process/FD from submitting cmds
>>> and notify userspace with -ENODEV.
>>>
>>> userspace should close all BO/ctx and re-open dri FD to re-create
>>> virtual memory system for this process
>> The whole aproach is a NAK from my side.
>>
>> We need to enable userspace to continue, not force it into process termination to recover. Otherwise we could send a SIGTERM in the first place.
>>
>> Regards,
>> Christian.
>>
>>> Change-Id: Ib4c179f28a3d0783837566f29de07fc14aa9b9a4
>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 5 +++++
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 +++++++
>>> 3 files changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index de9c164..b40d4ba 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -772,6 +772,7 @@ struct amdgpu_fpriv {
>>> struct idr bo_list_handles;
>>> struct amdgpu_ctx_mgr ctx_mgr;
>>> u32 vram_lost_counter;
>>> + int gpu_reset_counter;
>>> };
>>>
>>> /*
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> index 9467cf6..6a1515e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>> @@ -1199,6 +1199,11 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>>> if (amdgpu_kms_vram_lost(adev, fpriv))
>>> return -ENODEV;
>>>
>>> + if (amdgpu_sriov_vf(adev) &&
>>> + amdgpu_sriov_reset_level == 1 &&
>>> + fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> + return -ENODEV;
>>> +
>>> parser.adev = adev;
>>> parser.filp = filp;
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index 282f45b..bd389cf 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -285,6 +285,11 @@ static int amdgpu_info_ioctl(struct drm_device *dev, void *data, struct drm_file
>>> if (amdgpu_kms_vram_lost(adev, fpriv))
>>> return -ENODEV;
>>>
>>> + if (amdgpu_sriov_vf(adev) &&
>>> + amdgpu_sriov_reset_level == 1 &&
>>> + fpriv->gpu_reset_counter < atomic_read(&adev->gpu_reset_counter))
>>> + return -ENODEV;
>>> +
>>> switch (info->query) {
>>> case AMDGPU_INFO_ACCEL_WORKING:
>>> ui32 = adev->accel_working;
>>> @@ -824,6 +829,8 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>> goto out_suspend;
>>> }
>>>
>>> + fpriv->gpu_reset_counter = atomic_read(&adev->gpu_reset_counter);
>>> +
>>> r = amdgpu_vm_init(adev, &fpriv->vm,
>>> AMDGPU_VM_CONTEXT_GFX, 0);
>>> if (r) {
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.
More information about the amd-gfx
mailing list