[PATCH v3 2/2] drm/admgpu: Present amdgpu_task_info in VM_FAULTS.

Mon Jul 9 10:14:48 UTC 2018

On 07/09/2018 04:55 PM, Christian König wrote:
> Am 09.07.2018 um 09:48 schrieb Zhang, Jerry (Junwei):
>> On 07/09/2018 03:04 PM, Christian König wrote:
>>> Am 09.07.2018 um 07:13 schrieb Zhang, Jerry (Junwei):
>>>> On 07/06/2018 03:27 AM, Andrey Grodzovsky wrote:
>>>>> Extract and present the reposnsible process and thread when
>>>>> VM_FAULT happens.
>>>>>
>>>>> v2: Use getter and setter functions.
>>>>>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c |  4 ++++
>>>>>   drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c  | 10 +++++++---
>>>>>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  |  9 +++++++--
>>>>>   3 files changed, 18 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> index 7a625f3..609c8f5 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>>>>> @@ -187,6 +187,10 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, void *data)
>>>>>       if (p->uf_entry.robj)
>>>>>           p->job->uf_addr = uf_offset;
>>>>>       kfree(chunk_array);
>>>>> +
>>>>> +    /* Use this opportunity to fill in task info for the vm */
>>>>> +    amdgpu_vm_set_task_info(vm);
>>>>> +
>>>>
>>>> Shall we set the task info when vm init?
>>>
>>> No, vm_init() is called from a completely different process which is later on user of the VM.
>>
>> Originally I thought UMD opened DRI node and create a VM by vm_init(), then every command submission
>> would be passed in the same VM initialized by vm_init().
>>
>> So that's different process?
>
> The display server, e.g. X or Wayland.
>
> See with DRI3 the process of opening a connection to the hardware is that the display server open the file descriptor and with that calls vm_init.
>
> And then this file descriptor is passed to the client processes through IPC.

Thanks to reply.
yes, it's likely to open amdgpu node in ddx when driver probe and pass it to other client.

While that looks like just in the process of initialization that X server loads ddx driver.
But when I start 2 glxgears, that kms_open()->vm_init() will be called twice, which looks related to App as well.
(anyway, I will check it more)

Even so, it sounds vm_init() should be created firstly, then we use that VM for process on every command submission.
So I thought to set the task info at the first time vm_init() and use that info on VM fault process func.

Jerry

>
> Regards,
> Christian.
>
>>
>>
>>>
>>>>
>>>>
>>>>>       return 0;
>>>>>
>>>>>   free_all_kdata:
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>>>> index 08753e7..75f3ffb 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
>>>>> @@ -46,7 +46,6 @@
>>>>>
>>>>>   #include "ivsrcid/ivsrcid_vislands30.h"
>>>>>
>>>>> -
>>>>>   static void gmc_v8_0_set_gmc_funcs(struct amdgpu_device *adev);
>>>>>   static void gmc_v8_0_set_irq_funcs(struct amdgpu_device *adev);
>>>>>   static int gmc_v8_0_wait_for_idle(void *handle);
>>>>> @@ -1449,8 +1448,13 @@ static int gmc_v8_0_process_interrupt(struct amdgpu_device *adev,
>>>>>           gmc_v8_0_set_fault_enable_default(adev, false);
>>>>>
>>>>>       if (printk_ratelimit()) {
>>>>> -        dev_err(adev->dev, "GPU fault detected: %d 0x%08x\n",
>>>>> -            entry->src_id, entry->src_data[0]);
>>>>> +        struct amdgpu_task_info task_info = { 0 };
>>>>> +
>>>>> +        amdgpu_vm_get_task_info(adev, entry->pasid, &task_info);
>>>>
>>>> Shall we find vm and get the vm->task_info directly instead of filling local variable task_info?
>>>> (current style is also OK for me, just for more info)
>>>
>>> No, we can't guarantee that the task_info pointer in the VM won't be freed after dropping the spinlock. So returning a copy here is the better approach.
>>>
>>>>
>>>> And we may also check the return value for "NULL" case, otherwise it may cause access errorin dev_err(),
>>>> if failed to find vm (although, it's most unlikely to happen).
>>>
>>> Since we use a copy of the task info we should never get a NULL pointer. The string should already be zero terminated with the "{ 0 }" initialization above.
>>
>> Thanks to explain that more.
>> Got it, fine for me.
>>
>> Jerry
>>
>>>
>>> Christian.
>>>
>>>>
>>>> Jerry
>>>>
>>>>> +
>>>>> +        dev_err(adev->dev, "GPU fault detected: %d 0x%08x for process %s pid %d thread %s pid %d\n",
>>>>> +            entry->src_id, entry->src_data[0], task_info.process_name,
>>>>> +            task_info.tgid, task_info.task_name, task_info.pid);
>>>>>           dev_err(adev->dev, " VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x%08X\n",
>>>>>               addr);
>>>>>           dev_err(adev->dev, " VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x%08X\n",
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>> index 691a659..9df94b4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>>>>> @@ -259,11 +259,16 @@ static int gmc_v9_0_process_interrupt(struct amdgpu_device *adev,
>>>>>       }
>>>>>
>>>>>       if (printk_ratelimit()) {
>>>>> +        struct amdgpu_task_info task_info = { 0 };
>>>>> +
>>>>> +        amdgpu_vm_get_task_info(adev, entry->pasid, &task_info);
>>>>> +
>>>>>           dev_err(adev->dev,
>>>>> -            "[%s] VMC page fault (src_id:%u ring:%u vmid:%u pasid:%u)\n",
>>>>> +            "[%s] VMC page fault (src_id:%u ring:%u vmid:%u pasid:%u, for process %s pid %d thread %s pid %d\n)\n",
>>>>>               entry->vmid_src ? "mmhub" : "gfxhub",
>>>>>               entry->src_id, entry->ring_id, entry->vmid,
>>>>> -            entry->pasid);
>>>>> +            entry->pasid, task_info.process_name, task_info.tgid,
>>>>> +            task_info.task_name, task_info.pid);
>>>>>           dev_err(adev->dev, "  at page 0x%016llx from %d\n",
>>>>>               addr, entry->client_id);
>>>>>           if (!amdgpu_sriov_vf(adev))
>>>>>
>>>
>