regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Christian König christian.koenig at amd.com
Tue Oct 17 09:13:31 UTC 2017


Am 17.10.2017 um 10:57 schrieb Nicolai Hähnle:
> On 13.10.2017 10:46, Liu, Monk wrote:
>> Just revert Nicolai’s patch,if other routine want to reference 
>> s_fence, it should get the finished fence in the first place/time,
>>
>> For gpu_reset routine, it refers to s_fence only on those unfinished 
>> job in sched_hw_job_reset, so totally safe to refer to s_fence pointer
>>
>> I wonder what issue Nicolai met with and submitted this patch
>
> The original motivation of my patch was to catch accidental use of 
> job->s_fence after the fence was destroyed in amd_sched_process_job. 
> Basically, prevent a dangling pointer. I still think that's a 
> reasonable idea, though clearly my first attempt at it was just wrong.

It was very reasonable that you pointed out the problem, it's just that 
the solution to set it to NULL was wrong.

Instead we should just have increased the lifetime of the reference.

>
> In Christian's v2 patch, it might make sense to add
>
>     spin_unlock(&sched->job_list_lock);
> +    dma_fence_put(&s_job->s_fence->finished);
> +    s_job->s_fence = NULL;
>     sched->ops->free_job(s_job);
>
> ... though I'm not 100% certain about how the fence lifetimes work.

That's a bit pointless because sched->ops->free_job() will free s_job as 
the next thing we do here.

Regards,
Christian.

>
> Cheers,
> Nicolai
>
>
>>
>> BR Monk
>>
>> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On 
>> Behalf Of *Liu, Monk
>> *Sent:* 2017年10月13日16:40
>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Nicolai Hähnle 
>> <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org
>> *Subject:* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> I doubt it would always work fine…
>>
>> First, we have FENCE_TRACE reference s_fence->finished after 
>> “fence_signal(&fence->finished)”
>>
>> Second, we have trace_amd_sched_proess_job(s_fence) after 
>> “amd_sched_fence_finished()”,
>>
>> If you put the finished before free_job() and by coincidence the 
>> job_finish() get very soon executed you’ll have odds to hit wild 
>> pointer on above two cases
>>
>> BR Monk
>>
>> *From:*Koenig, Christian
>> *Sent:* 2017年10月13日16:17
>> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; Nicolai 
>> Hähnle <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>>; 
>> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> Yeah, that change is actually incorrect and should be reverted.
>>
>> What we really need to do is remove dropping sched_job->s_fence from 
>> amd_sched_process_job() into amd_sched_job_finish() directly before 
>> the call to free_job().
>>
>> Regards,
>> Christian.
>>
>> Am 13.10.2017 um 09:24 schrieb Liu, Monk:
>>
>>     commit d6c650c0a8f6f671e49553725e1db541376d95f2
>>     Author: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>     <mailto:nicolai.haehnle at amd.com>
>>     @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
>>
>>                      fence = sched->ops->run_job(sched_job);
>>                      amd_sched_fence_scheduled(s_fence);
>>     +
>>     +               /* amd_sched_process_job drops the job's reference
>>     of the fence. */
>>     +               sched_job->s_fence = NULL;
>>     +
>>                      if (fence) {
>>                              s_fence->parent = dma_fence_get(fence);
>>                              r = dma_fence_add_callback(fence, 
>> &s_fence->cb,
>>
>>     Hi Nicolai
>>
>>     with this patch, you will break 
>> "amdgpu_sched_hw_job_reset()"routine:
>>
>>     void
>>
>>     amd_sched_hw_job_reset(structamd_gpu_scheduler
>>
>>     *sched)
>>
>>     {
>>
>>     structamd_sched_job
>>
>>     *s_job;
>>
>>     spin_lock(&sched->job_list_lock);
>>
>>     list_for_each_entry_reverse(s_job,
>>
>>     &sched->ring_mirror_list, node) {
>>
>>     if(s_job->s_fence->parent
>>
>>     &&
>>
>>     fence_remove_callback(s_job->s_fence->parent,
>>
>>                           &s_job->s_fence->cb))
>>
>>     {
>>
>>     fence_put(s_job->s_fence->parent);
>>
>>                  s_job->s_fence->parent
>>
>>     =
>>
>>     NULL;
>>
>>     atomic_dec(&sched->hw_rq_count);
>>
>>              }
>>
>>          }
>>
>>     spin_unlock(&sched->job_list_lock);
>>
>>     }
>>
>>     see that without sched_job->s_fence, you cannot remove the callback
>>     from its hw fence,
>>
>>     any idea??
>>
>>     BR Monk
>>
>
>



More information about the amd-gfx mailing list