regression with d6c650c0a8f6f671e49553725e1db541376d95f2
Christian König
christian.koenig at amd.com
Fri Oct 13 10:13:58 UTC 2017
Good point as well. How about the attached version?
This time we keep an extra reference in amd_sched_process_job() until we
are sure that we don't need the s_fence any more.
Regards,
Christian.
Am 13.10.2017 um 11:13 schrieb Liu, Monk:
>
> your patch looks good, do you think we should also do this:
>
>
> void amd_sched_fence_scheduled(struct amd_sched_fence *fence)
> {
> - int ret = fence_signal(&fence->scheduled);
> + int ret;
> +
> + fence_get(&fence->scheduled;)
> + ret = fence_signal(&fence->scheduled);
>
> if (!ret)
> FENCE_TRACE(&fence->scheduled, "signaled from irq
> context\n");
> else
> FENCE_TRACE(&fence->scheduled, "was already signaled\n");
> + fence_put(&fence->scheduled);
> }
>
>
> ------------------------------------------------------------------------
> *From:* Koenig, Christian
> *Sent:* Friday, October 13, 2017 5:00:27 PM
> *To:* Liu, Monk; Nicolai Hähnle; amd-gfx at lists.freedesktop.org
> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> There is chance that free_job() called before that
>> “trace_amd_sched_process_job”, correct?
>>
> Correct, but that is harmless.
>
> Take a look what trace_amd_sched_process_job actually does, it just
> prints the pointer of the fence structure (but the pointer might be
> stale at this point).
>
> Nevertheless you are right that this is really ugly.
>
> How about the attached patch to fix the issue?
>
> Regards,
> Christian.
>
> Am 13.10.2017 um 10:51 schrieb Liu, Monk:
>>
>> The free_job() is called in sched_job_finish() which is queued on a
>> WORK and scheduled from that “amd_sched_fence_finished()”
>>
>> So the finishing timing of free_job() is asynchronized with
>> sched_process_job()
>>
>> There is chance that free_job() called before that
>> “trace_amd_sched_process_job”, correct?
>>
>> And if so the s_fence referred by it maybe a wild pointer
>>
>> BR Monk
>>
>> *From:*Liu, Monk
>> *Sent:* 2017年10月13日16:49
>> *To:* Koenig, Christian <Christian.Koenig at amd.com>; Nicolai Hähnle
>> <nhaehnle at gmail.com>; amd-gfx at lists.freedesktop.org
>> *Subject:* RE: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> No that’s not true
>>
>> The free_job() is called in sched_job_finish() which is queued on a
>> WORK and scheduled from that “amd_sched_fence_finished()”
>>
>> So the finishing timing of free_job() is asynchronized with
>> sched_process_job()
>>
>> How can you sure free_job() must before “trace_amd_sched_process_job”?
>>
>> *From:*Koenig, Christian
>> *Sent:* 2017年10月13日16:43
>> *To:* Liu, Monk <Monk.Liu at amd.com <mailto:Monk.Liu at amd.com>>; Nicolai
>> Hähnle <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>>;
>> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>> *Subject:* Re: regression with d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> The free_job() callback is called only way after the job has finished.
>>
>> That is one change actually made by you to the code :)
>>
>> Christian.
>>
>> Am 13.10.2017 um 10:39 schrieb Liu, Monk:
>>
>> I doubt it would always work fine…
>>
>> First, we have FENCE_TRACE reference s_fence->finished after
>> “fence_signal(&fence->finished)”
>>
>> Second, we have trace_amd_sched_proess_job(s_fence) after
>> “amd_sched_fence_finished()”,
>>
>> If you put the finished before free_job() and by coincidence the
>> job_finish() get very soon executed you’ll have odds to hit wild
>> pointer on above two cases
>>
>> BR Monk
>>
>> *From:*Koenig, Christian
>> *Sent:* 2017年10月13日16:17
>> *To:* Liu, Monk <Monk.Liu at amd.com> <mailto:Monk.Liu at amd.com>;
>> Nicolai Hähnle <nhaehnle at gmail.com> <mailto:nhaehnle at gmail.com>;
>> amd-gfx at lists.freedesktop.org <mailto:amd-gfx at lists.freedesktop.org>
>> *Subject:* Re: regression with
>> d6c650c0a8f6f671e49553725e1db541376d95f2
>>
>> Yeah, that change is actually incorrect and should be reverted.
>>
>> What we really need to do is remove dropping sched_job->s_fence
>> from amd_sched_process_job() into amd_sched_job_finish() directly
>> before the call to free_job().
>>
>> Regards,
>> Christian.
>>
>> Am 13.10.2017 um 09:24 schrieb Liu, Monk:
>>
>> commit d6c650c0a8f6f671e49553725e1db541376d95f2
>> Author: Nicolai Hähnle <nicolai.haehnle at amd.com>
>> <mailto:nicolai.haehnle at amd.com>
>> @@ -611,6 +611,10 @@ static int amd_sched_main(void *param)
>>
>> fence = sched->ops->run_job(sched_job);
>> amd_sched_fence_scheduled(s_fence);
>> +
>> + /* amd_sched_process_job drops the job's
>> reference of the fence. */
>> + sched_job->s_fence = NULL;
>> +
>> if (fence) {
>> s_fence->parent = dma_fence_get(fence);
>> r = dma_fence_add_callback(fence,
>> &s_fence->cb,
>>
>> Hi Nicolai
>>
>> with this patch, you will break
>> "amdgpu_sched_hw_job_reset()"routine:
>>
>> void
>>
>> amd_sched_hw_job_reset(structamd_gpu_scheduler
>>
>> *sched)
>>
>> {
>>
>> structamd_sched_job
>>
>> *s_job;
>>
>> spin_lock(&sched->job_list_lock);
>>
>> list_for_each_entry_reverse(s_job,
>>
>> &sched->ring_mirror_list, node) {
>>
>> if(s_job->s_fence->parent
>>
>> &&
>>
>> fence_remove_callback(s_job->s_fence->parent,
>>
>> &s_job->s_fence->cb))
>>
>> {
>>
>> fence_put(s_job->s_fence->parent);
>>
>> s_job->s_fence->parent
>>
>> =
>>
>> NULL;
>>
>> atomic_dec(&sched->hw_rq_count);
>>
>> }
>>
>> }
>>
>> spin_unlock(&sched->job_list_lock);
>>
>> }
>>
>> see that without sched_job->s_fence, you cannot remove the
>> callback from its hw fence,
>>
>> any idea??
>>
>> BR Monk
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171013/7f14204b/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-drm-amd-sched-fix-job-tear-down-order-v2.patch
Type: text/x-patch
Size: 1853 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20171013/7f14204b/attachment-0001.bin>
More information about the amd-gfx
mailing list