[PATCH] drm/amdgpu: Fix the null pointer issue for tdr

Deng, Emily Emily.Deng at amd.com
Fri Nov 8 08:52:48 UTC 2019


Ping.....


Best wishes
Emily Deng



>-----Original Message-----
>From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Deng,
>Emily
>Sent: Friday, November 8, 2019 10:56 AM
>To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>gfx at lists.freedesktop.org
>Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>
>>-----Original Message-----
>>From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>Sent: Thursday, November 7, 2019 7:28 PM
>>To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>>Am 07.11.19 um 11:25 schrieb Emily Deng:
>>> When the job is already signaled, the s_fence is freed. Then it will
>>> has null pointer in amdgpu_device_gpu_recover.
>>
>>NAK, the s_fence is only set to NULL when the job is destroyed. See
>>drm_sched_job_cleanup().
>I know it is set to NULL in drm_sched_job_cleanup. But in one case, when it
>enter into the amdgpu_device_gpu_recover, it already in
>drm_sched_job_cleanup, and at this time, it will go to free job. But the
>amdgpu_device_gpu_recover sometimes is faster. At that time, job is not
>freed, but s_fence is already NULL.
>>
>>When you see a job without an s_fence then that means the problem is
>>somewhere else.
>>
>>Regards,
>>Christian.
>>
>>>
>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c     | 11 ++++++-----
>>>   2 files changed, 7 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index e6ce949..5a8f08e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct
>>amdgpu_device *adev,
>>>   	 *
>>>   	 * job->base holds a reference to parent fence
>>>   	 */
>>> -	if (job && job->base.s_fence->parent &&
>>> +	if (job && job->base.s_fence && job->base.s_fence->parent &&
>>>   	    dma_fence_is_signaled(job->base.s_fence->parent))
>>>   		job_signaled = true;
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 31809ca..56cc10e 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct
>>drm_sched_job
>>> *bad)
>>>
>>>   			spin_lock(&rq->lock);
>>>   			list_for_each_entry_safe(entity, tmp, &rq->entities,
>>list) {
>>> -				if (bad->s_fence->scheduled.context ==
>>> -				    entity->fence_context) {
>>> +				if (bad->s_fence && (bad->s_fence-
>>>scheduled.context ==
>>> +				    entity->fence_context)) {
>>>   					if (atomic_read(&bad->karma) >
>>>   					    bad->sched->hang_limit)
>>>   						if (entity->guilty)
>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct drm_gpu_scheduler
>>*sched, struct drm_sched_job *bad)
>>>   	 * This iteration is thread safe as sched thread is stopped.
>>>   	 */
>>>   	list_for_each_entry_safe_reverse(s_job, tmp, &sched-
>>>ring_mirror_list, node) {
>>> -		if (s_job->s_fence->parent &&
>>> +		if (s_job->s_fence && s_job->s_fence->parent &&
>>>   		    dma_fence_remove_callback(s_job->s_fence->parent,
>>>   					      &s_job->cb)) {
>>>   			atomic_dec(&sched->hw_rq_count); @@ -395,7
>+395,8 @@ void
>>> drm_sched_stop(struct drm_gpu_scheduler
>>*sched, struct drm_sched_job *bad)
>>>   			 *
>>>   			 * Job is still alive so fence refcount at least 1
>>>   			 */
>>> -			dma_fence_wait(&s_job->s_fence->finished, false);
>>> +			if (s_job->s_fence)
>>> +				dma_fence_wait(&s_job->s_fence->finished,
>>false);
>>>
>>>   			/*
>>>   			 * We must keep bad job alive for later use during @@
>>-438,7
>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool
>>full_recovery)
>>>   	 * GPU recovers can't run in parallel.
>>>   	 */
>>>   	list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list,
>>> node)
>>{
>>> -		struct dma_fence *fence = s_job->s_fence->parent;
>>> +		struct dma_fence *fence = s_job->s_fence ? s_job->s_fence-
>>>parent :
>>> +NULL;
>>>
>>>   		atomic_inc(&sched->hw_rq_count);
>>>
>
>_______________________________________________
>amd-gfx mailing list
>amd-gfx at lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx


More information about the amd-gfx mailing list