[PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

Christian König christian.koenig at amd.com
Fri Oct 13 15:20:17 UTC 2017


Am 13.10.2017 um 16:34 schrieb Michel Dänzer:
> On 12/10/17 07:11 PM, Christian König wrote:
>> Am 12.10.2017 um 18:49 schrieb Michel Dänzer:
>>> On 12/10/17 01:00 PM, Michel Dänzer wrote:
>>>> [0] I also got this, but I don't know yet if it's related:
>>> No, that seems to be a separate issue; I can still reproduce it with the
>>> huge page related changes reverted. Unfortunately, it doesn't seem to
>>> happen reliably on every piglit run.
>> Can you enable KASAN in your kernel,
> KASAN caught something else at the beginning of piglit, see the attached
> dmesg excerpt. Not sure it's related though.
>
> amdgpu_job_free_cb+0x13d/0x160 decodes to:
>
> amd_sched_get_job_priority at .../drivers/gpu/drm/amd/amdgpu/../scheduler/gpu_scheduler.h:182
>
> static inline enum amd_sched_priority
> amd_sched_get_job_priority(struct amd_sched_job *job)
> {
> 	return (job->s_entity->rq - job->sched->sched_rq); <===
> }
>
>   (inlined by) amdgpu_job_free_cb at .../drivers/gpu/drm/amd/amdgpu/amdgpu_job.c:107
>
> 	amdgpu_ring_priority_put(job->ring, amd_sched_get_job_priority(s_job));

Sounds a lot like the code Andres added is buggy somehow. Going to take 
a look as well.

>> and please look up at which line number amdgpu_vm_bo_invalidate+0x88
>> is.
> Looks like it's this line:
>
> 		if (evicted && bo->tbo.resv == vm->root.base.bo->tbo.resv) {
>
> Maybe vm or vm->root.base.bo is NULL?
Ah, of course!

We need to reserve the page directory root when we release it or 
otherwise we can run into a race with somebody else trying to evict it.

Going to send a patch in a minute,
Christian.


More information about the amd-gfx mailing list