[git pull] drm for 6.1-rc1
Dave Airlie
airlied at gmail.com
Fri Oct 7 02:45:32 UTC 2022
On Fri, 7 Oct 2022 at 09:45, Linus Torvalds
<torvalds at linux-foundation.org> wrote:
>
> On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airlied at gmail.com> wrote:
> >
> >
> > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088
> > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
>
> As far as I can tell, that's the line
>
> struct drm_gpu_scheduler *sched = s_fence->sched;
>
> where 's_fence' is NULL. The code is
>
> 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
> 5: 41 54 push %r12
> 7: 55 push %rbp
> 8: 53 push %rbx
> 9: 48 89 fb mov %rdi,%rbx
> c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction
> 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp)
> 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax
>
> and that next 'lock decl' instruction would have been the
>
> atomic_dec(&sched->hw_rq_count);
>
> at the top of drm_sched_job_done().
>
> Now, as to *why* you'd have a NULL s_fence, it would seem that
> drm_sched_job_cleanup() was called with an active job. Looking at that
> code, it does
>
> if (kref_read(&job->s_fence->finished.refcount)) {
> /* drm_sched_job_arm() has been called */
> dma_fence_put(&job->s_fence->finished);
> ...
>
> but then it does
>
> job->s_fence = NULL;
>
> anyway, despite the job still being active. The logic of that kind of
> "fake refcount" escapes me. The above looks fundamentally racy, not to
> say pointless and wrong (a refcount is a _count_, not a flag, so there
> could be multiple references to it, what says that you can just
> decrement one of them and say "I'm done").
>
> Now, _why_ any of that happens, I have no idea. I'm just looking at
> the immediate "that pointer is NULL" thing, and reacting to what looks
> like a completely bogus refcount pattern.
>
> But that odd refcount pattern isn't new, so it's presumably some user
> on the amd gpu side that changed.
>
> The problem hasn't happened again for me, but that's not saying a lot,
> since it was very random to begin with.
I chased down the culprit to a drm sched patch, I'll send you a pull
with a revert in it.
commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86
Author: Arvind Yadav <Arvind.Yadav at amd.com>
Date: Wed Sep 14 22:13:20 2022 +0530
drm/sched: Use parent fence instead of finished
Using the parent fence instead of the finished fence
to get the job status. This change is to avoid GPU
scheduler timeout error which can cause GPU reset.
Signed-off-by: Arvind Yadav <Arvind.Yadav at amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky at amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220914164321.2156-6-Arvind.Yadav@amd.com
Signed-off-by: Christian König <christian.koenig at amd.com>
I'll let Arvind and Christian maybe work out what is going wrong there.
Dave.
>
> Linus
More information about the dri-devel
mailing list