regression with d6c650c0a8f6f671e49553725e1db541376d95f2
Nicolai Hähnle
nhaehnle at gmail.com
Tue Oct 17 08:57:55 UTC 2017
On 13.10.2017 10:46, Liu, Monk wrote:
> Just revert Nicolai’s patch,if other routine want to reference s_fence,
> it should get the finished fence in the first place/time,
>
> For gpu_reset routine, it refers to s_fence only on those unfinished job
> in sched_hw_job_reset, so totally safe to refer to s_fence pointer
>
> I wonder what issue Nicolai met with and submitted this patch
The original motivation of my patch was to catch accidental use of
job->s_fence after the fence was destroyed in amd_sched_process_job.
Basically, prevent a dangling pointer. I still think that's a reasonable
idea, though clearly my first attempt at it was just wrong.
In Christian's v2 patch, it might make sense to add
spin_unlock(&sched->job_list_lock);
+ dma_fence_put(&s_job->s_fence->finished);
+ s_job->s_fence = NULL;
sched->ops->free_job(s_job);
... though I'm not 100% certain about how the fence lifetimes work.
Cheers,
Nicolai
>
> BR Monk
>
> *From:*amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] *On Behalf
> Of *Liu, Monk
> *Sent:* 2017年10月13日16:40
> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Nicolai Hähnle
> <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org
> *Subject:* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>
> I doubt it would always work fine…
>
> First, we have FENCE_TRACE reference s_fence->finished after
> “fence_signal(&fence->finished)”
>
> Second, we have trace_amd_sched_proess_job(s_fence) after
> “amd_sched_fence_finished()”,
>
> If you put the finished before free_job() and by coincidence the
> job_finish() get very soon executed you’ll have odds to hit wild pointer
> on above two cases
>
> BR Monk
>
> *From:*Koenig, Christian
> *Sent:* 2017年10月13日16:17
> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; Nicolai
> Hähnle <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>>;
> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>
> Yeah, that change is actually incorrect and should be reverted.
>
> What we really need to do is remove dropping sched_job->s_fence from
> amd_sched_process_job() into amd_sched_job_finish() directly before the
> call to free_job().
>
> Regards,
> Christian.
>
> Am 13.10.2017 um 09:24 schrieb Liu, Monk:
>
> commit d6c650c0a8f6f671e49553725e1db541376d95f2
> Author: Nicolai Hähnle <nicolai.haehnle at amd.com>
> <mailto:nicolai.haehnle at amd.com>
> @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
>
> fence = sched->ops->run_job(sched_job);
> amd_sched_fence_scheduled(s_fence);
> +
> + /* amd_sched_process_job drops the job's reference
> of the fence. */
> + sched_job->s_fence = NULL;
> +
> if (fence) {
> s_fence->parent = dma_fence_get(fence);
> r = dma_fence_add_callback(fence, &s_fence->cb,
>
> Hi Nicolai
>
> with this patch, you will break "amdgpu_sched_hw_job_reset()"routine:
>
> void
>
> amd_sched_hw_job_reset(structamd_gpu_scheduler
>
> *sched)
>
> {
>
> structamd_sched_job
>
> *s_job;
>
> spin_lock(&sched->job_list_lock);
>
> list_for_each_entry_reverse(s_job,
>
> &sched->ring_mirror_list, node) {
>
> if(s_job->s_fence->parent
>
> &&
>
> fence_remove_callback(s_job->s_fence->parent,
>
> &s_job->s_fence->cb))
>
> {
>
> fence_put(s_job->s_fence->parent);
>
> s_job->s_fence->parent
>
> =
>
> NULL;
>
> atomic_dec(&sched->hw_rq_count);
>
> }
>
> }
>
> spin_unlock(&sched->job_list_lock);
>
> }
>
> see that without sched_job->s_fence, you cannot remove the callback
> from its hw fence,
>
> any idea??
>
> BR Monk
>
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
More information about the amd-gfx
mailing list