regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Christian König christian.koenig at amd.com
Fri Oct 13 10:13:58 UTC 2017


Good point as well. How about the attached version?

This time we keep an extra reference in amd_sched_process_job() until we 
are sure that we don't need the s_fence any more.

Regards,
Christian.

Am 13.10.2017 um 11:13 schrieb Liu, Monk:
>
> your patch looks good,  do you think we should also do this:
>
>
>  void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
>  {
> -       int ret = fence_signal(&fence->scheduled);
> +       int ret;
> +
> +       fence_get(&fence->scheduled;)
> +       ret = fence_signal(&fence->scheduled);
>
>         if (!ret)
>                 FENCE_TRACE(&fence->scheduled, "signaled from irq 
> context\n");
>         else
>                 FENCE_TRACE(&fence->scheduled, "was already signaled\n");
> +       fence_put(&fence->scheduled);
>  }
>
>
> ------------------------------------------------------------------------
> *From:* Koenig, Christian
> *Sent:* Friday, October 13, 2017 5:00:27 PM
> *To:* Liu, Monk; Nicolai Hähnle; amd-gfx at lists.freedesktop.org
> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> There is chance that free_job() called before that 
>> “trace_amd_sched_process_job”, correct?
>>
> Correct, but that is harmless.
>
> Take a look what trace_amd_sched_process_job actually does, it just 
> prints the pointer of the fence structure (but the pointer might be 
> stale at this point).
>
> Nevertheless you are right that this is really ugly.
>
> How about the attached patch to fix the issue?
>
> Regards,
> Christian.
>
> Am 13.10.2017 um 10:51 schrieb Liu, Monk:
>>
>> The free_job() is called in sched_job_finish() which is queued on a 
>> WORK and scheduled from that “amd_sched_fence_finished()”
>>
>> So the finishing timing of free_job() is asynchronized with 
>> sched_process_job()
>>
>> There is chance that free_job() called before that 
>> “trace_amd_sched_process_job”, correct?
>>
>> And if so the s_fence referred by it maybe a wild pointer
>>
>> BR Monk
>>
>> *From:*Liu, Monk
>> *Sent:* 2017年10月13日16:49
>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Nicolai Hähnle 
>> <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org
>> *Subject:* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> No that’s not true
>>
>> The free_job() is called in sched_job_finish() which is queued on a 
>> WORK and scheduled from that “amd_sched_fence_finished()”
>>
>> So the finishing timing of free_job() is asynchronized with 
>> sched_process_job()
>>
>> How can you sure free_job() must before “trace_amd_sched_process_job”?
>>
>> *From:*Koenig, Christian
>> *Sent:* 2017年10月13日16:43
>> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; Nicolai 
>> Hähnle <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>>; 
>> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> The free_job() callback is called only way after the job has finished.
>>
>> That is one change actually made by you to the code :)
>>
>> Christian.
>>
>> Am 13.10.2017 um 10:39 schrieb Liu, Monk:
>>
>>     I doubt it would always work fine…
>>
>>     First, we have FENCE_TRACE reference s_fence->finished after
>>     “fence_signal(&fence->finished)”
>>
>>     Second, we have trace_amd_sched_proess_job(s_fence) after
>>     “amd_sched_fence_finished()”,
>>
>>     If you put the finished before free_job() and by coincidence the
>>     job_finish() get very soon executed you’ll have odds to hit wild
>>     pointer on above two cases
>>
>>     BR Monk
>>
>>     *From:*Koenig, Christian
>>     *Sent:* 2017年10月13日16:17
>>     *To:* Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
>>     Nicolai Hähnle <nhaehnle at gmail.com> <mailto:nhaehnle at gmail.com>;
>>     amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>>     *Subject:* Re: regression with
>>     d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>>     Yeah, that change is actually incorrect and should be reverted.
>>
>>     What we really need to do is remove dropping sched_job->s_fence
>>     from amd_sched_process_job() into amd_sched_job_finish() directly
>>     before the call to free_job().
>>
>>     Regards,
>>     Christian.
>>
>>     Am 13.10.2017 um 09:24 schrieb Liu, Monk:
>>
>>         commit d6c650c0a8f6f671e49553725e1db541376d95f2
>>         Author: Nicolai Hähnle <nicolai.haehnle at amd.com>
>>         <mailto:nicolai.haehnle at amd.com>
>>         @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
>>
>>                         fence = sched->ops->run_job(sched_job);
>>         amd_sched_fence_scheduled(s_fence);
>>         +
>>         +               /* amd_sched_process_job drops the job's
>>         reference of the fence. */
>>         +               sched_job->s_fence = NULL;
>>         +
>>                         if (fence) {
>>                                 s_fence->parent = dma_fence_get(fence);
>>                                 r = dma_fence_add_callback(fence,
>>         &s_fence->cb,
>>
>>         Hi Nicolai
>>
>>         with this patch, you will break
>>         "amdgpu_sched_hw_job_reset()"routine:
>>
>>         void
>>
>>         amd_sched_hw_job_reset(structamd_gpu_scheduler
>>
>>         *sched)
>>
>>         {
>>
>>         structamd_sched_job
>>
>>         *s_job;
>>
>>         spin_lock(&sched->job_list_lock);
>>
>>         list_for_each_entry_reverse(s_job,
>>
>>         &sched->ring_mirror_list, node) {
>>
>>         if(s_job->s_fence->parent
>>
>>         &&
>>
>>         fence_remove_callback(s_job->s_fence->parent,
>>
>>         &s_job->s_fence->cb))
>>
>>         {
>>
>>         fence_put(s_job->s_fence->parent);
>>
>>                     s_job->s_fence->parent
>>
>>         =
>>
>>         NULL;
>>
>>         atomic_dec(&sched->hw_rq_count);
>>
>>                 }
>>
>>             }
>>
>>         spin_unlock(&sched->job_list_lock);
>>
>>         }
>>
>>         see that without sched_job->s_fence, you cannot remove the
>>         callback from its hw fence,
>>
>>         any idea??
>>
>>         BR Monk
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171013/7f14204b/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-amd-sched-fix-job-tear-down-order-v2.patch
Type: text/x-patch
Size: 1853 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171013/7f14204b/attachment-0001.bin>


More information about the amd-gfx mailing list