[PATCH] drm/amdgpu: fix the null pointer to get timeline by scheduler fence

Christian König christian.koenig at amd.com
Wed Aug 8 07:10:07 UTC 2018


Yeah that is a known issue, but this solution is not correct either.

See the scheduler where the job is execute on is simply not determined 
yet when we want to trace it.

So using the scheduler name from the entity is wrong as well.

We should probably move the reschedule from drm_sched_entity_push_job() 
to drm_sched_job_init() to fix that.

I will prepare a patch for that today,
Christian.

Am 08.08.2018 um 09:05 schrieb Huang Rui:
> We won't initialize fence scheduler in drm_sched_fence_create() anymore, so it
> will refer null fence scheduler if open trace event to get the timeline name.
> Actually, it is the scheduler name from the entity, so add a macro to replace
> legacy getting timeline name by job.
>
> [  212.844281] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [  212.852401] PGD 8000000427c13067 P4D 8000000427c13067 PUD 4235fc067 PMD 0
> [  212.859419] Oops: 0000 [#1] SMP PTI
> [  212.862981] CPU: 4 PID: 1520 Comm: amdgpu_test Tainted: G           OE     4.18.0-rc1-custom #1
> [  212.872194] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
> [  212.881704] RIP: 0010:drm_sched_fence_get_timeline_name+0x2b/0x30 [gpu_sched]
> [  212.888948] Code: 1f 44 00 00 48 8b 47 08 48 3d c0 b1 4f c0 74 13 48 83 ef 60 48 3d 60 b1 4f c0 b8 00 00 00 00 48 0f 45 f8 48 8b 87 e0 00 00 00 <48> 8b 40 18 c3 0f 1f 44 00 00 b8 01 00 00 00 c3 0f 1f 44 00 00 0f
> [  212.908162] RSP: 0018:ffffa3ed81f27af0 EFLAGS: 00010246
> [  212.913483] RAX: 0000000000000000 RBX: 0000000000070034 RCX: ffffa3ed81f27da8
> [  212.920735] RDX: ffff8f24ebfb5460 RSI: ffff8f24e40d3c00 RDI: ffff8f24ebfb5400
> [  212.928008] RBP: ffff8f24e40d3c00 R08: 0000000000000000 R09: ffffffffae4deafc
> [  212.935263] R10: ffffffffada000ed R11: 0000000000000001 R12: ffff8f24e891f898
> [  212.942558] R13: 0000000000000000 R14: ffff8f24ebc46000 R15: ffff8f24e3de97a8
> [  212.949796] FS:  00007ffff7fd2700(0000) GS:ffff8f24fed00000(0000) knlGS:0000000000000000
> [  212.958047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  212.963921] CR2: 0000000000000018 CR3: 0000000423422003 CR4: 00000000003606e0
> [  212.971201] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  212.978482] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  212.985720] Call Trace:
> [  212.988236]  trace_event_raw_event_amdgpu_cs_ioctl+0x4c/0x170 [amdgpu]
> [  212.994904]  ? amdgpu_ctx_add_fence+0xa9/0x110 [amdgpu]
> [  213.000246]  ? amdgpu_job_free_resources+0x4b/0x70 [amdgpu]
> [  213.005944]  amdgpu_cs_ioctl+0x16d1/0x1b50 [amdgpu]
> [  213.010920]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
> [  213.016354]  drm_ioctl_kernel+0x8a/0xd0 [drm]
> [  213.020794]  ? recalc_sigpending+0x17/0x50
> [  213.024965]  drm_ioctl+0x2d7/0x390 [drm]
> [  213.028979]  ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
> [  213.034366]  ? do_signal+0x36/0x700
> [  213.037928]  ? signal_wake_up_state+0x15/0x30
> [  213.042375]  amdgpu_drm_ioctl+0x46/0x80 [amdgpu]
>
> Signed-off-by: Huang Rui <ray.huang at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c    |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 10 ++++++----
>   2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index e12871d..be01e1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -1247,7 +1247,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>   
>   	amdgpu_job_free_resources(job);
>   
> -	trace_amdgpu_cs_ioctl(job);
> +	trace_amdgpu_cs_ioctl(job, entity);
>   	amdgpu_vm_bo_trace_cs(&fpriv->vm, &p->ticket);
>   	priority = job->base.s_priority;
>   	drm_sched_entity_push_job(&job->base, entity);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> index 8c2dab2..25cdcb7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h
> @@ -36,6 +36,8 @@
>   
>   #define AMDGPU_JOB_GET_TIMELINE_NAME(job) \
>   	 job->base.s_fence->finished.ops->get_timeline_name(&job->base.s_fence->finished)
> +#define AMDGPU_GET_SCHED_NAME(entity) \
> +	 (entity->rq->sched->name)
>   
>   TRACE_EVENT(amdgpu_mm_rreg,
>   	    TP_PROTO(unsigned did, uint32_t reg, uint32_t value),
> @@ -161,11 +163,11 @@ TRACE_EVENT(amdgpu_cs,
>   );
>   
>   TRACE_EVENT(amdgpu_cs_ioctl,
> -	    TP_PROTO(struct amdgpu_job *job),
> -	    TP_ARGS(job),
> +	    TP_PROTO(struct amdgpu_job *job, struct drm_sched_entity *entity),
> +	    TP_ARGS(job, entity),
>   	    TP_STRUCT__entry(
>   			     __field(uint64_t, sched_job_id)
> -			     __string(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
> +			     __string(timeline, AMDGPU_GET_SCHED_NAME(entity))
>   			     __field(unsigned int, context)
>   			     __field(unsigned int, seqno)
>   			     __field(struct dma_fence *, fence)
> @@ -175,7 +177,7 @@ TRACE_EVENT(amdgpu_cs_ioctl,
>   
>   	    TP_fast_assign(
>   			   __entry->sched_job_id = job->base.id;
> -			   __assign_str(timeline, AMDGPU_JOB_GET_TIMELINE_NAME(job))
> +			   __assign_str(timeline, AMDGPU_GET_SCHED_NAME(entity))
>   			   __entry->context = job->base.s_fence->finished.context;
>   			   __entry->seqno = job->base.s_fence->finished.seqno;
>   			   __entry->ring_name = to_amdgpu_ring(job->base.sched)->name;



More information about the amd-gfx mailing list