[PATCH] drm/amdgpu: trace amdgpu_job fence details

Fri Feb 24 18:11:01 UTC 2017

On 2017-02-24 03:13 AM, Christian König wrote:
> Am 23.02.2017 um 21:48 schrieb Andres Rodriguez:
>>
>>
>> On 2017-02-23 02:46 PM, Andres Rodriguez wrote:
>>>
>>>
>>> On 2017-02-23 03:20 AM, Christian König wrote:
>>>> Am 23.02.2017 um 00:47 schrieb Andres Rodriguez:
>>>>> This trace is intended to provide the required information to
>>>>> associate
>>>>> the completion of an amdgpu_job with its corresponding dma_fence_*
>>>>> tracepoints.
>>>>>
>>>>> Signed-off-by: Andres Rodriguez <andresx7 at gmail.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |  2 ++
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 22
>>>>> ++++++++++++++++++++++
>>>>>   2 files changed, 24 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> index 86a1242..84a04e4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>>>>> @@ -177,6 +177,8 @@ static struct dma_fence *amdgpu_job_run(struct
>>>>> amd_sched_job *sched_job)
>>>>>       /* if gpu reset, hw fence will be replaced here */
>>>>>       dma_fence_put(job->fence);
>>>>>       job->fence = dma_fence_get(fence);
>>>>> +    trace_amdgpu_job_attach_fence(job);
>>>>> +
>>>>>       amdgpu_job_free_resources(job);
>>>>>       return fence;
>>>>>   }
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> index a18ae1e..0a61ed9 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
>>>>> @@ -147,6 +147,28 @@ TRACE_EVENT(amdgpu_sched_run_job,
>>>>>                 __entry->fence, __entry->ring_name, __entry->num_ibs)
>>>>>   );
>>>>>   +TRACE_EVENT(amdgpu_job_attach_fence,
>>>>> +        TP_PROTO(struct amdgpu_job *job),
>>>>> +        TP_ARGS(job),
>>>>> +        TP_STRUCT__entry(
>>>>> +                 __field(struct amdgpu_device *, adev)
>>>>> +                 __field(struct amd_sched_job *, sched_job)
>>>>
>>>> Neither add adev nor sched_job pointer to your trace point.
>>> Would it be okay then to grab a reference to the scheduler fence in
>>> amdgpu_job_run and print the sched fence pointer here?
>>>
>>> That would give us enough info to correlate other trace points like
>>> amdgpu_cs_ioctl aor amdgpu_sched_run_job to the respective dma_fence
>>> tracepoints.
>>>
>>>
>>>>
>>>> The adev pointer is completely meaningless in the logs and the
>>>> scheduler job might already be freed when the printk is called.
>> Hey Christian,
>>
>> I was just double checking the lifetime of the scheduler job, and it
>> should still be valid during amdgpu_job_run. The lifetime of the job
>> is tied to the scheduler finished fence which won't be signaled until
>> amdgpu_job_run finishes even if the HW fence has signaled.
>>
>> This is assuming that the trace printks are synchronous.
>
> That assumption is incorrect.
>
> The parameters are stored on a ring buffer when the trace point is
> executed, but the string is composed later on when you request the trace
> content from userspace.
>
> Additional to that printing pointers which are heavily reallocated is a
> bad idea because once the job is freed the pointer will certainly be
> reused for another job.
>
> Using the numbers from the scheduler fence is the right approach here
> and you don't even need to grab another reference, just save the context
> and sequence number as identifier for the job.

Thanks for the explanation.

Using the scheduler fence as you suggested also means that we have the 
data available on the amdgpu_sched_run_job and we won't require an extra 
trace point. So that is overall cleaner.

I'll send a patch to use that approach instead. I won't be removing 
anything from amdgpu_sched_run_job though since I'm not sure of the 
backwards compat rules for trace events.

Regards,
Andres

>
> Regards,
> Christian.
>
>>
>> So I think printing the job and dropping adev should be okay if the
>> above holds true, unless I'm missing something else.
>>
>>>>
>>>> Just checked the source and a couple of other trace points get this
>>>> horrible wrong as well, so that isn't your fault. Going to put it on
>>>> my todo to fix those.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> +                 __string(timeline,
>>>>> job->fence->ops->get_timeline_name(job->fence))
>>>>> +                 __field(unsigned int, context)
>>>>> +                 __field(unsigned int, seqno)
>>>>> +                 ),
>>>>> +
>>>>> +        TP_fast_assign(
>>>>> +               __entry->adev = job->adev;
>>>>> +               __entry->sched_job = &job->base;
>>>>> +               __assign_str(timeline,
>>>>> job->fence->ops->get_timeline_name(job->fence))
>>>>> +               __entry->context = job->fence->context;
>>>>> +               __entry->seqno = job->fence->seqno;
>>>>> +               ),
>>>>> +        TP_printk("adev=%p, sched_job=%p, timeline:%s, context:%u,
>>>>> seqno:%u",
>>>>> +              __entry->adev, __entry->sched_job,
>>>>> +              __get_str(timeline), __entry->context, __entry->seqno)
>>>>> +);
>>>>>     TRACE_EVENT(amdgpu_vm_grab_id,
>>>>>           TP_PROTO(struct amdgpu_vm *vm, int ring, struct
>>>>> amdgpu_job *job),
>>>>
>>>>
>>>
>>
>