[PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled

Andrey Grodzovsky andrey.grodzovsky at amd.com
Wed Aug 17 14:11:11 UTC 2022


On 2022-08-17 10:01, Andrey Grodzovsky wrote:
>
> On 2022-08-17 09:44, Alex Deucher wrote:
>> On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas <YiPeng.Chai at amd.com> 
>> wrote:
>>> [AMD Official Use Only - General]
>>>
>>> Hi Alex:
>>>    When removing an amdgpu device, it may be difficult to change the 
>>> order of psp_hw_fini calls.
>>>
>>> 1. The drm_dev_unplug call is at the beginning in the 
>>> amdgpu_pci_remove function,  which makes the gpu device inaccessible 
>>> for userspace operations.  If the call to psp_hw_fini was moved 
>>> before drm_dev_unplug,  userspace could access the gpu device but 
>>> the psp might be removing. It has unknown issues.
>>>
>> +Andrey Grodzovsky
>>
>> We should fix the ordering in amdgpu_pci_remove() then I guess? There
>> are lots of places where drm_dev_enter() is used to protect access to
>> the hardware which could be similarly affected.
>>
>> Alex
>
>
> We probably can try to move drm_dev_unplug after 
> amdgpu_driver_unload_kms. I don't remember now why drm_dev_unplug must 
> be the first thing we do in amdgpu_pci_remove and what impact it will 
> have but maybe give it a try.
> Also see if you can run libdrm hotplug test suite before and after the 
> change.
>
> Andrey


Thinking a bit more about this - i guess the main problem with this will 
be that in case of real hot unplug (which is hard to test unless you 
have a real GPU cage with extenal GPU) this move will cause trying to 
accesses HW registers
or MMIO ranges from VRAM BOs when HW is missing when you try to shut 
down the HW in HW fini IP block specific callbacks , this in turn will 
return garbage for reads (or all 1s maybe) which is what we probably 
were trying to avoid by putting drm_dev_unplug as the first thing. So 
it's probably a bit problematic.

Andrey


>
>
>>
>>> 2. psp_hw_fini is called by the .hw_fini iterator in 
>>> amdgpu_device_ip_fini_early, referring to the code starting from 
>>> amdgpu_pci_remove to .hw_fini is called,
>>>     there are many preparatory operations before calling .hw_fini,  
>>> which makes it very difficult to change the order of psp_hw_fini or 
>>> all block .hw_fini.
>>>
>>>     So can we do a workaround in psp_cmd_submit_buf when removing 
>>> amdgpu device?
>>>
>>> -----Original Message-----
>>> From: Alex Deucher <alexdeucher at gmail.com>
>>> Sent: Monday, August 15, 2022 10:22 PM
>>> To: Chai, Thomas <YiPeng.Chai at amd.com>
>>> Cc: amd-gfx at lists.freedesktop.org; Zhang, Hawking 
>>> <Hawking.Zhang at amd.com>; Chen, Guchun <Guchun.Chen at amd.com>; Chai, 
>>> Thomas <YiPeng.Chai at amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually 
>>> sent to psp when amdgpu is uninstalled
>>>
>>> On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai <YiPeng.Chai at amd.com> 
>>> wrote:
>>>> The psp_cmd_submit_buf function is called by psp_hw_fini to send TA
>>>> unload messages to psp to terminate ras, asd and tmr.
>>>> But when amdgpu is uninstalled, drm_dev_unplug is called earlier than
>>>> psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
>>>> static void amdgpu_pci_remove(struct pci_dev *pdev) {
>>>>          drm_dev_unplug
>>>>          ......
>>>> amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
>>>>                  ->.hw_fini->psp_hw_fini->...
>>>>                  ->psp_ta_unload->psp_cmd_submit_buf
>>>>          ......
>>>> }
>>>> The program will return when calling drm_dev_enter in
>>>> psp_cmd_submit_buf.
>>>>
>>>> So the call to drm_dev_enter in psp_cmd_submit_buf should be removed,
>>>> so that the TA unload messages can be sent to the psp when amdgpu is
>>>> uninstalled.
>>>>
>>>> Signed-off-by: YiPeng Chai <YiPeng.Chai at amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 ----
>>>>   1 file changed, 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>>>> index b067ce45d226..0578d8d094a7 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
>>>> @@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp,
>>>>          if (psp->adev->no_hw_access)
>>>>                  return 0;
>>>>
>>>> -       if (!drm_dev_enter(adev_to_drm(psp->adev), &idx))
>>>> -               return 0;
>>>> -
>>> This check is to prevent the hardware from being accessed if the 
>>> card is removed.  I think we need to fix the ordering elsewhere.
>>>
>>> Alex
>>>
>>>>          memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
>>>>
>>>>          memcpy(psp->cmd_buf_mem, cmd, sizeof(struct
>>>> psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct 
>>>> psp_context *psp,
>>>>          }
>>>>
>>>>   exit:
>>>> -       drm_dev_exit(idx);
>>>>          return ret;
>>>>   }
>>>>
>>>> -- 
>>>> 2.25.1
>>>>


More information about the amd-gfx mailing list