regression with d6c650c0a8f6f671e49553725e1db541376d95f2

Tom St Denis tom.stdenis at amd.com
Fri Oct 13 11:10:54 UTC 2017


At the very least I don't get the hangs like I used to before the 
patchset so this patch isn't regressing any of that behaviour.

Tom


On 13/10/17 07:06 AM, Liu, Monk wrote:
> Yeah, this one looks good
> 
> You can put my reviewed-by on it
> 
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日18:14
> *To:* Liu, Monk <Monk.Liu at amd.com>; Nicolai Hähnle <nhaehnle at gmail.com>; 
> amd-gfx at lists.freedesktop.org
> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
> 
> Good point as well. How about the attached version?
> 
> This time we keep an extra reference in amd_sched_process_job() until we 
> are sure that we don't need the s_fence any more.
> 
> Regards,
> Christian.
> 
> Am 13.10.2017 um 11:13 schrieb Liu, Monk:
> 
>     your patch looks good,  do you think we should also do this:
> 
>       void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
>       {
>     -       int ret = fence_signal(&fence->scheduled);
>     +       int ret;
>     +
>     +       fence_get(&fence->scheduled;)
>     +       ret = fence_signal(&fence->scheduled);
> 
>              if (!ret)
>                      FENCE_TRACE(&fence->scheduled, "signaled from irq
>     context\n");
>              else
>                      FENCE_TRACE(&fence->scheduled, "was already
>     signaled\n");
>     +       fence_put(&fence->scheduled);
>       }
> 
>     ------------------------------------------------------------------------
> 
>     *From:*Koenig, Christian
>     *Sent:* Friday, October 13, 2017 5:00:27 PM
>     *To:* Liu, Monk; Nicolai Hähnle; amd-gfx at lists.freedesktop.org
>     <mailto:amd-gfx at lists.freedesktop.org>
>     *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>         There is chance that free_job() called before that
>         “trace_amd_sched_process_job”, correct?
> 
>     Correct, but that is harmless.
> 
>     Take a look what trace_amd_sched_process_job actually does, it just
>     prints the pointer of the fence structure (but the pointer might be
>     stale at this point).
> 
>     Nevertheless you are right that this is really ugly.
> 
>     How about the attached patch to fix the issue?
> 
>     Regards,
>     Christian.
> 
>     Am 13.10.2017 um 10:51 schrieb Liu, Monk:
> 
>         The free_job() is called in sched_job_finish() which is queued
>         on a WORK and scheduled from that “amd_sched_fence_finished()”
> 
>         So the finishing timing of free_job() is asynchronized with
>         sched_process_job()
> 
>         There is chance that free_job() called before that
>         “trace_amd_sched_process_job”, correct?
> 
>         And if so the s_fence referred by it maybe a wild pointer
> 
>         BR Monk
> 
>         *From:*Liu, Monk
>         *Sent:* 2017年10月13日16:49
>         *To:* Koenig, Christian <Christian.Koenig at amd.com>
>         <mailto:Christian.Koenig at amd.com>; Nicolai Hähnle
>         <nhaehnle at gmail.com> <mailto:nhaehnle at gmail.com>;
>         amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>         *Subject:* RE: regression with
>         d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>         No that’s not true
> 
>         The free_job() is called in sched_job_finish() which is queued
>         on a WORK and scheduled from that “amd_sched_fence_finished()”
> 
>         So the finishing timing of free_job() is asynchronized with
>         sched_process_job()
> 
>         How can you sure free_job() must before
>         “trace_amd_sched_process_job” ?
> 
>         *From:*Koenig, Christian
>         *Sent:* 2017年10月13日16:43
>         *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>;
>         Nicolai Hähnle <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>>;
>         amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>         *Subject:* Re: regression with
>         d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>         The free_job() callback is called only way after the job has
>         finished.
> 
>         That is one change actually made by you to the code :)
> 
>         Christian.
> 
>         Am 13.10.2017 um 10:39 schrieb Liu, Monk:
> 
>             I doubt it would always work fine…
> 
>             First, we have FENCE_TRACE reference s_fence->finished after
>             “fence_signal(&fence->finished)”
> 
>             Second, we have trace_amd_sched_proess_job(s_fence) after
>             “amd_sched_fence_finished()”,
> 
>             If you put the finished before free_job() and by coincidence
>             the job_finish() get very soon executed you’ll have odds to
>             hit wild pointer on above two cases
> 
>             BR Monk
> 
>             *From:*Koenig, Christian
>             *Sent:* 2017年10月13日16:17
>             *To:* Liu, Monk <Monk.Liu at amd.com>
>             <mailto:Monk.Liu at amd.com>; Nicolai Hähnle
>             <nhaehnle at gmail.com> <mailto:nhaehnle at gmail.com>;
>             amd-gfx at lists.freedesktop.org
>             <mailto:amd-gfx at lists.freedesktop.org>
>             *Subject:* Re: regression with
>             d6c650c0a8f6f671e49553725e1db541376d95f2
> 
>             Yeah, that change is actually incorrect and should be reverted.
> 
>             What we really need to do is remove dropping
>             sched_job->s_fence from amd_sched_process_job() into
>             amd_sched_job_finish() directly before the call to free_job().
> 
>             Regards,
>             Christian.
> 
>             Am 13.10.2017 um 09:24 schrieb Liu, Monk:
> 
>                 commit d6c650c0a8f6f671e49553725e1db541376d95f2
>                 Author: Nicolai Hähnle <nicolai.haehnle at amd.com>
>                 <mailto:nicolai.haehnle at amd.com>
>                 @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
> 
>                                  fence = sched->ops->run_job(sched_job);
>                                  amd_sched_fence_scheduled(s_fence);
>                 +
>                 +               /* amd_sched_process_job drops the job's
>                 reference of the fence. */
>                 +               sched_job->s_fence = NULL;
>                 +
>                                  if (fence) {
>                                          s_fence->parent =
>                 dma_fence_get(fence);
>                                          r =
>                 dma_fence_add_callback(fence, &s_fence->cb,
> 
>                 Hi Nicolai
> 
>                 with this patch, you will break
>                 "amdgpu_sched_hw_job_reset()"routine:
> 
>                 void
> 
>                 amd_sched_hw_job_reset(struct amd_gpu_scheduler
> 
>                 *sched)
> 
>                 {
> 
>                      struct amd_sched_job
> 
>                 *s_job;
> 
>                      spin_lock(&sched->job_list_lock);
> 
>                      list_for_each_entry_reverse(s_job,
> 
>                 &sched->ring_mirror_list, node) {
> 
>                          if (s_job->s_fence->parent
> 
>                 &&
> 
>                           fence_remove_callback(s_job->s_fence->parent,
> 
>                                       &s_job->s_fence->cb))
> 
>                 {
> 
>                              fence_put(s_job->s_fence->parent);
> 
>                              s_job->s_fence->parent
> 
>                 =
> 
>                 NULL;
> 
>                              atomic_dec(&sched->hw_rq_count);
> 
>                          }
> 
>                      }
> 
>                      spin_unlock(&sched->job_list_lock);
> 
>                 }
> 
>                 see that without sched_job->s_fence, you cannot remove
>                 the callback from its hw fence,
> 
>                 any idea??
> 
>                 BR Monk
> 
> 
> 
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 



More information about the amd-gfx mailing list