[PATCH] drm/amdgpu: Add more process info in VM for debug
Liu, Monk
Monk.Liu at amd.com
Mon Dec 17 09:02:55 UTC 2018
Hi Christian,
I think for some SRIOV customers they need this rich information,
Maybe we can use an kernel option to let user select if rich or simple information should be printed upon job TDR ?
In SRIOV branch we can set it enable by default while set by default disable for drm-next
/Monk
-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Huang, Trigger
Sent: Monday, December 17, 2018 4:27 PM
To: Koenig, Christian <Christian.Koenig at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Qu, Jim <Jim.Qu at amd.com>
Subject: RE: [PATCH] drm/amdgpu: Add more process info in VM for debug
Hi Christian,
Yes, if the test machine is still there for debugging, we can login it and check a lot of things, such as ' ps -p 1 -o args ' as you suggested.
But sometimes, the system is not alive anymore, and we only got some log files (such as kern.log ) from QA or customers.
And at this time, the full command line information in dmesg is quite useful. Let's take the following message for example:
[ 1740.047122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring comp_1.0.0 timeout, signaled seq=91571, emitted seq=91572
[ 1740.050167] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* The job's process information is as below:
[ 1740.053160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process SDMA pid 2098 thread SDMA pid 2098, cmd line:SDMA --mode goldimage_compare --offscreen --n-swapchain-images 3 --gpu 0 --frontend test_executor --n-test-threads 4 When customer/QA reported that there is comp ring job timeout when running a big test case, vk-example, but the machine is not alive anymore, and can't login for debugging. Then we need to re-run the whole vk-example to reproduce this issue, and this will waste a lot of time.
But if we get the specific sub-base in the kern.log file when job timeout happened, then we can only try the specific one, here it is ' SDMA --mode goldimage_compare --offscreen --n-swapchain-images 3 --gpu 0 --frontend test_executor --n-test-threads 4'
maybe several minutes later, this issues is reproduced.
Thanks,
Trigger.
-----Original Message-----
From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Koenig, Christian
Sent: Monday, December 17, 2018 3:50 PM
To: Huang, Trigger <Trigger.Huang at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Qu, Jim <Jim.Qu at amd.com>
Subject: Re: [PATCH] drm/amdgpu: Add more process info in VM for debug
Hi Trigger,
> Does this make sense?
Sorry I wasn't clear enough. The key point that we don't want/need the full commandline of the process here is that we can already get that from the information we have.
E.g. "ps -p 1 -o args" gives you the command line of the process 1.
The only case where directly printing this into the logs is useful is when we run into a total system crash and in this case the processed is only the trigger, but not the root cause.
Regards,
Christian.
Am 17.12.18 um 04:03 schrieb Huang, Trigger:
> Hi Christian,
>
> Many thanks for pointing out the mistakes
>
> I have some comments as below, would you help to check again?
>
> First of all you can't get the process info during VM creating since that can happen in X as well.
> [Trigger]: Ok, I will keep the original logic, which is that set the vm info in cs . I will still invoke kfree(cmd_line) in amdgpu_vm_fini to avoid memory leak.
>
> Second when a timeout happen the VM structure might already be released, so using job->vm is illegal here. What we could try is to get the VM using the PASID.
> [Trigger]: Ok, I will do it in job timeout like what VMC page fault's
> handler does
>
> And last we don't want to keep the full command line around.
> [Trigger]: well, actually, the detailed command line is just what we want.
> For example, there are thousands of sub-cases of one big test case, and for each sub-case, the arguments may also be different.
> In some corner case, test machine is hung after running the big test case for several hours even several days, it is really painful to wait another several hours to reproduce it and debug.
> But if we know the last sub-case running on the test machine, then
> this issues *may* can be reproduced by only running the specific sub-case with specific arguments for several rounds, and in this situation, it will save us a lot time for reproducing and debugging.
> Does this make sense? If not, how about we add a parameter, such as amdgpu_vm_debug_verbose, to turn on/off the cmd line dumping?
>
>
> Thanks,
> Trigger
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken at gmail.com>
> Sent: Saturday, December 15, 2018 8:23 PM
> To: Huang, Trigger <Trigger.Huang at amd.com>;
> amd-gfx at lists.freedesktop.org
> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>; Qu, Jim
> <Jim.Qu at amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Add more process info in VM for debug
>
> Am 15.12.18 um 09:56 schrieb Trigger Huang:
>> When debugging VMC page fault and ring hang issues, the detailed
>> process information is quite helpful, especially when the issue can
>> only be reproduced after a very long time running. With this
>> information, only run the specific sub-testcase may also will
>> reproduce the issue, which may save a lot of time for debugging.
>>
>> With this patch, the process information is similar as following.
>> When VMC page fault issue happened:
>> [ 142.978417] amdgpu 0000:00:08.0: [gfxhub] VMC page fault (src_id:0
>> ring:171 vmid:2 pasid:32769), [ 142.978542] amdgpu 0000:00:08.0: for process ocltst pid 1354 thread ocltst pid 1354, args:./ocltst -m oclperf.so -t OCLPerfDeviceEnqueueEvent,
>> [ 142.978693] amdgpu 0000:00:08.0: in page starting at address 0x0000000000000000 from 27
>>
>> When ring hang issue happened:
>> [ 1740.047122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>> comp_1.0.0 timeout, signaled seq=91571, emitted seq=91572 [ 1740.050167] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* The job's process information is as below:
>> [ 1740.053160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
>> SDMA pid 2098 thread SDMA pid 2098, cmd line:SDMA --mode
>> goldimage_compare --offscreen --n-swapchain-images 3 --gpu 0
>> --frontend test_executor --n-test-threads 4
>>
>> Signed-off-by: Trigger Huang <Trigger.Huang at amd.com>
> Well NAK, we intentionally didn't do it this way.
>
> First of all you can't get the process info during VM creating since that can happen in X as well.
>
> Second when a timeout happen the VM structure might already be released, so using job->vm is illegal here. What we could try is to get the VM using the PASID.
>
> And last we don't want to keep the full command line around.
>
> The only valid addition I can see here is to print the thread info when the timeout happens.
>
> Regards,
> Christian.
>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 3 ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 ++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 1 +
>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 11 ++++++-----
>> 5 files changed, 26 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 1c49b82..1a2d0c9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -235,9 +235,6 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, union drm_amdgpu_cs
>> p->job->uf_addr = uf_offset;
>> kfree(chunk_array);
>>
>> - /* Use this opportunity to fill in task info for the vm */
>> - amdgpu_vm_set_task_info(vm);
>> -
>> return 0;
>>
>> free_all_kdata:
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index e0af44f..c75ecb3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -43,6 +43,14 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>> job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
>> ring->fence_drv.sync_seq);
>>
>> + if (job->vm) {
>> + struct amdgpu_task_info *ti = &(job->vm->task_info);
>> +
>> + DRM_ERROR("The job's process information is as below:\n");
>> + DRM_ERROR("Process %s, thread %s, cmd line:%s\n",
>> + ti->process_name, ti->task_name, ti->cmd_line);
>> + }
>> +
>> if (amdgpu_device_should_recover_gpu(ring->adev))
>> amdgpu_device_gpu_recover(ring->adev, job);
>> }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index e73d152..24f3cbd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -30,6 +30,7 @@
>> #include <linux/idr.h>
>> #include <drm/drmP.h>
>> #include <drm/amdgpu_drm.h>
>> +#include <linux/string_helpers.h>
>> #include "amdgpu.h"
>> #include "amdgpu_trace.h"
>> #include "amdgpu_amdkfd.h"
>> @@ -3045,6 +3046,8 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>> goto error_free_root;
>>
>> vm->pasid = pasid;
>> +
>> + amdgpu_vm_set_task_info(vm);
>> }
>>
>> vm->fault_hash = init_fault_hash(); @@ -3223,6 +3226,9 @@ void
>> amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>> spin_lock_irqsave(&adev->vm_manager.pasid_lock, flags);
>> idr_remove(&adev->vm_manager.pasid_idr, vm->pasid);
>> spin_unlock_irqrestore(&adev->vm_manager.pasid_lock, flags);
>> +
>> + kfree(vm->task_info.cmd_line);
>> + vm->task_info.cmd_line = NULL;
>> }
>>
>> kfree(vm->fault_hash);
>> @@ -3391,6 +3397,11 @@ void amdgpu_vm_set_task_info(struct amdgpu_vm *vm)
>> vm->task_info.tgid = current->group_leader->pid;
>> get_task_comm(vm->task_info.process_name, current->group_leader);
>> }
>> +
>> + vm->task_info.cmd_line =
>> + kstrdup_quotable_cmdline(current, GFP_KERNEL);
>> + if (!vm->task_info.cmd_line)
>> + DRM_DEBUG_DRIVER("Failed to get cmdline!\n");
>> }
>> }
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> index e8dcfd5..9fab787 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> @@ -175,6 +175,7 @@ struct amdgpu_vm_pte_funcs {
>> struct amdgpu_task_info {
>> char process_name[TASK_COMM_LEN];
>> char task_name[TASK_COMM_LEN];
>> + char *cmd_line;
>> pid_t pid;
>> pid_t tgid;
>> };
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> index bacdaef..c3e3558 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> @@ -325,11 +325,12 @@ static int gmc_v9_0_process_interrupt(struct amdgpu_device *adev,
>> amdgpu_vm_get_task_info(adev, entry->pasid, &task_info);
>>
>> dev_err(adev->dev,
>> - "[%s] VMC page fault (src_id:%u ring:%u vmid:%u pasid:%u, for process %s pid %d thread %s pid %d)\n",
>> - entry->vmid_src ? "mmhub" : "gfxhub",
>> - entry->src_id, entry->ring_id, entry->vmid,
>> - entry->pasid, task_info.process_name, task_info.tgid,
>> - task_info.task_name, task_info.pid);
>> + "[%s] VMC page fault (src_id:%u ring:%u vmid:%u pasid:%u), ",
>> + entry->vmid_src ? "mmhub" : "gfxhub", entry->src_id,
>> + entry->ring_id, entry->vmid, entry->pasid);
>> + dev_err(adev->dev, "for process %s pid %d thread %s pid %d, args:%s,",
>> + task_info.process_name, task_info.tgid,
>> + task_info.task_name, task_info.pid, task_info.cmd_line);
>> dev_err(adev->dev, " in page starting at address 0x%016llx from %d\n",
>> addr, entry->client_id);
>> if (!amdgpu_sriov_vf(adev))
_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list