regression with d6c650c0a8f6f671e49553725e1db541376d95f2
Liu, Monk
Monk.Liu at amd.com
Fri Oct 13 11:15:25 UTC 2017
Because you didn't try GPU reset feature
-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of Tom St Denis
Sent: 2017年10月13日 19:11
To: amd-gfx at lists.freedesktop.org
Subject: Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
At the very least I don't get the hangs like I used to before the patchset so this patch isn't regressing any of that behaviour.
Tom
On 13/10/17 07:06 AM, Liu, Monk wrote:
> Yeah, this one looks good
>
> You can put my reviewed-by on it
>
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日18:14
> *To:* Liu, Monk <Monk.Liu at amd.com>; Nicolai Hähnle
> <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org
> *Subject:* Re: regression with
> d6c650c0a8f6f671e49553725e1db541376d95f2
>
> Good point as well. How about the attached version?
>
> This time we keep an extra reference in amd_sched_process_job() until
> we are sure that we don't need the s_fence any more.
>
> Regards,
> Christian.
>
> Am 13.10.2017 um 11:13 schrieb Liu, Monk:
>
> your patch looks good, do you think we should also do this:
>
> void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
> {
> - int ret = fence_signal(&fence->scheduled);
> + int ret;
> +
> + fence_get(&fence->scheduled;)
> + ret = fence_signal(&fence->scheduled);
>
> if (!ret)
> FENCE_TRACE(&fence->scheduled, "signaled from irq
> context\n");
> else
> FENCE_TRACE(&fence->scheduled, "was already
> signaled\n");
> + fence_put(&fence->scheduled);
> }
>
>
> ----------------------------------------------------------------------
> --
>
> *From:*Koenig, Christian
> *Sent:* Friday, October 13, 2017 5:00:27 PM
> *To:* Liu, Monk; Nicolai Hähnle; amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: regression with
> d6c650c0a8f6f671e49553725e1db541376d95f2
>
> There is chance that free_job() called before that
> “trace_amd_sched_process_job”, correct?
>
> Correct, but that is harmless.
>
> Take a look what trace_amd_sched_process_job actually does, it just
> prints the pointer of the fence structure (but the pointer might be
> stale at this point).
>
> Nevertheless you are right that this is really ugly.
>
> How about the attached patch to fix the issue?
>
> Regards,
> Christian.
>
> Am 13.10.2017 um 10:51 schrieb Liu, Monk:
>
> The free_job() is called in sched_job_finish() which is queued
> on a WORK and scheduled from that “amd_sched_fence_finished()”
>
> So the finishing timing of free_job() is asynchronized with
> sched_process_job()
>
> There is chance that free_job() called before that
> “trace_amd_sched_process_job”, correct?
>
> And if so the s_fence referred by it maybe a wild pointer
>
> BR Monk
>
> *From:*Liu, Monk
> *Sent:* 2017年10月13日16:49
> *To:* Koenig, Christian <Christian.Koenig at amd.com>
> <mailto:Christian.Koenig at amd.com>; Nicolai Hähnle
> <nhaehnle at gmail.com> <mailto:nhaehnle at gmail.com>;
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* RE: regression with
> d6c650c0a8f6f671e49553725e1db541376d95f2
>
> No that’s not true
>
> The free_job() is called in sched_job_finish() which is queued
> on a WORK and scheduled from that “amd_sched_fence_finished()”
>
> So the finishing timing of free_job() is asynchronized with
> sched_process_job()
>
> How can you sure free_job() must before
> “trace_amd_sched_process_job” ?
>
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日16:43
> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>;
> Nicolai Hähnle <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>>;
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: regression with
> d6c650c0a8f6f671e49553725e1db541376d95f2
>
> The free_job() callback is called only way after the job has
> finished.
>
> That is one change actually made by you to the code :)
>
> Christian.
>
> Am 13.10.2017 um 10:39 schrieb Liu, Monk:
>
> I doubt it would always work fine…
>
> First, we have FENCE_TRACE reference s_fence->finished after
> “fence_signal(&fence->finished)”
>
> Second, we have trace_amd_sched_proess_job(s_fence) after
> “amd_sched_fence_finished()”,
>
> If you put the finished before free_job() and by coincidence
> the job_finish() get very soon executed you’ll have odds to
> hit wild pointer on above two cases
>
> BR Monk
>
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日16:17
> *To:* Liu, Monk <Monk.Liu at amd.com>
> <mailto:Monk.Liu at amd.com>; Nicolai Hähnle
> <nhaehnle at gmail.com> <mailto:nhaehnle at gmail.com>;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: regression with
> d6c650c0a8f6f671e49553725e1db541376d95f2
>
> Yeah, that change is actually incorrect and should be reverted.
>
> What we really need to do is remove dropping
> sched_job->s_fence from amd_sched_process_job() into
> amd_sched_job_finish() directly before the call to free_job().
>
> Regards,
> Christian.
>
> Am 13.10.2017 um 09:24 schrieb Liu, Monk:
>
> commit d6c650c0a8f6f671e49553725e1db541376d95f2
> Author: Nicolai Hähnle <nicolai.haehnle at amd.com>
> <mailto:nicolai.haehnle at amd.com>
> @@ -611,6 +611,10 @@ static int amd_sched_main(void
> *param)
>
> fence = sched->ops->run_job(sched_job);
> amd_sched_fence_scheduled(s_fence);
> +
> + /* amd_sched_process_job drops the job's
> reference of the fence. */
> + sched_job->s_fence = NULL;
> +
> if (fence) {
> s_fence->parent =
> dma_fence_get(fence);
> r =
> dma_fence_add_callback(fence, &s_fence->cb,
>
> Hi Nicolai
>
> with this patch, you will break
> "amdgpu_sched_hw_job_reset()"routine:
>
> void
>
> amd_sched_hw_job_reset(struct amd_gpu_scheduler
>
> *sched)
>
> {
>
> struct amd_sched_job
>
> *s_job;
>
> spin_lock(&sched->job_list_lock);
>
> list_for_each_entry_reverse(s_job,
>
> &sched->ring_mirror_list, node) {
>
> if (s_job->s_fence->parent
>
> &&
>
>
> fence_remove_callback(s_job->s_fence->parent,
>
> &s_job->s_fence->cb))
>
> {
>
> fence_put(s_job->s_fence->parent);
>
> s_job->s_fence->parent
>
> =
>
> NULL;
>
> atomic_dec(&sched->hw_rq_count);
>
> }
>
> }
>
> spin_unlock(&sched->job_list_lock);
>
> }
>
> see that without sched_job->s_fence, you cannot remove
> the callback from its hw fence,
>
> any idea??
>
> BR Monk
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
_______________________________________________
amd-gfx mailing list
amd-gfx at lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list