[PATCH] drm/amdgpu: Fix the null pointer issue for tdr
Koenig, Christian
Christian.Koenig at amd.com
Fri Nov 8 09:07:30 UTC 2019
Am 08.11.19 um 09:52 schrieb Deng, Emily:
> Ping.....
You need to give me at least enough time to wake up :)
>
>
> Best wishes
> Emily Deng
>
>
>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of Deng,
>> Emily
>> Sent: Friday, November 8, 2019 10:56 AM
>> To: Koenig, Christian <Christian.Koenig at amd.com>; amd-
>> gfx at lists.freedesktop.org
>> Subject: RE: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>
>>> -----Original Message-----
>>> From: Christian König <ckoenig.leichtzumerken at gmail.com>
>>> Sent: Thursday, November 7, 2019 7:28 PM
>>> To: Deng, Emily <Emily.Deng at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH] drm/amdgpu: Fix the null pointer issue for tdr
>>>
>>> Am 07.11.19 um 11:25 schrieb Emily Deng:
>>>> When the job is already signaled, the s_fence is freed. Then it will
>>>> has null pointer in amdgpu_device_gpu_recover.
>>> NAK, the s_fence is only set to NULL when the job is destroyed. See
>>> drm_sched_job_cleanup().
>> I know it is set to NULL in drm_sched_job_cleanup. But in one case, when it
>> enter into the amdgpu_device_gpu_recover, it already in
>> drm_sched_job_cleanup, and at this time, it will go to free job. But the
>> amdgpu_device_gpu_recover sometimes is faster. At that time, job is not
>> freed, but s_fence is already NULL.
No, that case can't happen. See here:
> drm_sched_job_cleanup(s_job);
>
> amdgpu_ring_priority_put(ring, s_job->s_priority);
> dma_fence_put(job->fence);
> amdgpu_sync_free(&job->sync);
> amdgpu_sync_free(&job->sched_sync);
> kfree(job);
The job itself is freed up directly after freeing the reference to the
s_fence.
So you are just papering over a much bigger problem here. This patch is
a clear NAK.
Regards,
Christian.
>>> When you see a job without an s_fence then that means the problem is
>>> somewhere else.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Signed-off-by: Emily Deng <Emily.Deng at amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>>>> drivers/gpu/drm/scheduler/sched_main.c | 11 ++++++-----
>>>> 2 files changed, 7 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index e6ce949..5a8f08e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4075,7 +4075,7 @@ int amdgpu_device_gpu_recover(struct
>>> amdgpu_device *adev,
>>>> *
>>>> * job->base holds a reference to parent fence
>>>> */
>>>> - if (job && job->base.s_fence->parent &&
>>>> + if (job && job->base.s_fence && job->base.s_fence->parent &&
>>>> dma_fence_is_signaled(job->base.s_fence->parent))
>>>> job_signaled = true;
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index 31809ca..56cc10e 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -334,8 +334,8 @@ void drm_sched_increase_karma(struct
>>> drm_sched_job
>>>> *bad)
>>>>
>>>> spin_lock(&rq->lock);
>>>> list_for_each_entry_safe(entity, tmp, &rq->entities,
>>> list) {
>>>> - if (bad->s_fence->scheduled.context ==
>>>> - entity->fence_context) {
>>>> + if (bad->s_fence && (bad->s_fence-
>>>> scheduled.context ==
>>>> + entity->fence_context)) {
>>>> if (atomic_read(&bad->karma) >
>>>> bad->sched->hang_limit)
>>>> if (entity->guilty)
>>>> @@ -376,7 +376,7 @@ void drm_sched_stop(struct drm_gpu_scheduler
>>> *sched, struct drm_sched_job *bad)
>>>> * This iteration is thread safe as sched thread is stopped.
>>>> */
>>>> list_for_each_entry_safe_reverse(s_job, tmp, &sched-
>>>> ring_mirror_list, node) {
>>>> - if (s_job->s_fence->parent &&
>>>> + if (s_job->s_fence && s_job->s_fence->parent &&
>>>> dma_fence_remove_callback(s_job->s_fence->parent,
>>>> &s_job->cb)) {
>>>> atomic_dec(&sched->hw_rq_count); @@ -395,7
>> +395,8 @@ void
>>>> drm_sched_stop(struct drm_gpu_scheduler
>>> *sched, struct drm_sched_job *bad)
>>>> *
>>>> * Job is still alive so fence refcount at least 1
>>>> */
>>>> - dma_fence_wait(&s_job->s_fence->finished, false);
>>>> + if (s_job->s_fence)
>>>> + dma_fence_wait(&s_job->s_fence->finished,
>>> false);
>>>> /*
>>>> * We must keep bad job alive for later use during @@
>>> -438,7
>>>> +439,7 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool
>>> full_recovery)
>>>> * GPU recovers can't run in parallel.
>>>> */
>>>> list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list,
>>>> node)
>>> {
>>>> - struct dma_fence *fence = s_job->s_fence->parent;
>>>> + struct dma_fence *fence = s_job->s_fence ? s_job->s_fence-
>>>> parent :
>>>> +NULL;
>>>>
>>>> atomic_inc(&sched->hw_rq_count);
>>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list