Losing completion interrupts with amdgpu on rx460

Matthew Macy mmacy at nextbsd.org
Wed Dec 28 08:35:32 UTC 2016


 ---- On Tue, 27 Dec 2016 12:51:37 -0800 Christian König <christian.koenig at amd.com> wrote ---- 
 > It's a well known problem that the completion interrupts are notorious 
 > unreliable.
 > 
 > That's why we have a fallback timer in amdgpu_fence.c which kicks an 
 > extra hardware probe after a certain timeout. Please double check that 
 > this one is working as expected.

I'm digging in to why the fallback process isn't signalling the straggling fences. 


        do {
                last_seq = atomic_read(&ring->fence_drv.last_seq);
                seq = amdgpu_fence_read(ring);

	} while (atomic_cmpxchg(&drv->last_seq, last_seq, seq) != last_seq);

        if (seq != ring->fence_drv.sync_seq) {
		printf("rescheduling fallback for %s\n", ring->name);
                amdgpu_fence_schedule_fallback(ring);
        }
        if (unlikely(seq == last_seq)) {
                printf("seek == last_seq == %u skipping fence_process\n", seq);
                return;
        }
Dec 28 00:22:31 daleks kernel: &fence->finished at 79042060348 f 353#2026: signaled from irq context
Dec 28 00:22:31 daleks kernel: fence at 79042062972 f 0#4598: signaled from process context
Dec 28 00:22:31 daleks kernel: &fence->scheduled at 79042069573 f 74#2353: signaled from irq context
Dec 28 00:22:31 daleks kernel: skipping fallback scheduling for gfx
Dec 28 00:22:31 daleks kernel: &fence->finished at 79042112606 f 75#2353: signaled from irq context
Dec 28 00:22:31 daleks kernel: fence at 79042115268 f 0#4599: signaled from process context
Dec 28 00:22:31 daleks kernel: &fence->scheduled at 79042168961 f 352#2027: signaled from irq context
Dec 28 00:22:31 daleks kernel: skipping fallback scheduling for gfx
Dec 28 00:22:31 daleks kernel: &fence->finished at 79042234434 f 353#2027: signaled from irq context
Dec 28 00:22:31 daleks kernel: fence at 79042237108 f 0#4600: signaled from process context
Dec 28 00:22:31 daleks kernel: 353#2028 sleeping tid 100721 at 79042673751
Dec 28 00:22:31 daleks kernel: running fence fallback for sdma0
Dec 28 00:22:31 daleks kernel: seek == last_seq == 607 skipping fence_process
Dec 28 00:22:31 daleks kernel: running fence fallback for gfx
Dec 28 00:22:31 daleks kernel: seek == last_seq == 4600 skipping fence_process


It looks like the sequence numbers are saying that the device did in fact complete? Too tired to think about it further now.

 > 
 > Another possibility is that the memory where the fence is written 
 > doesn't has the proper attributes (e.g. USWC vs. cached vs. uncached).

The only places where I see I memory attributes being set is in amdgpu_device_init for rmmio and the doorbell bar mapping in amdgpu_doorbell_init. The ioremap function will remap the memory uncacheable. The driver is unmodified from Linus' tree as of "drm/amdgpu: add gart recovery by gtt list V2" - about two thirds of the way through 4.9-rc1 (modulo git merge issues). Is there any place else I should be looking? Turning on INVARIANTS which scribbles memory on free (and thus aggressively flushing the cache) causes the hangs to take much much longer to occur - leading me to believe that it may well be a memory typing issue.
 

Thanks for getting back to me. 

-M

P.S.

A bit of a tangeent - but maybe you could also clarify if I'm doing something wrong when replaying commits from Linus' tree. The way I get the changesets and the sequence is by doing:
% git format-patch v4.8..v4.9-rc1 drivers/gpu/drm/*.* drivers/gpu/drm/i915 drivers/gpu/drm/amd drivers/gpu/drm/radeon include/drm include/uapi/drm

'git am' fails much of the time even when there aren't conflicts so what I do is I git cherry-pick the changesets in the order that they show up in the generated patches. I frequently end up with empty commits and sometimes the drivers will not end up with all the requisite changes merged in such that it doesn't compile.




 > Regards,
 > Christian.
 > 
 > Am 26.12.2016 um 02:54 schrieb Matthew Macy:
 > > I'm running an rx460 using the amdgpu driver from Linux 4.8 with Mesa 13/LLVM 3.9 and Xorg 1.18 on FreeBSD. It seems to largely perform pretty well.
 > >
 > > However, ever since I got Mesa working I will inevitably end up losing completion interrupts after X has been running for a brief period. I can bring the problem on more quickly by running glxgears with vblank_mode=0. It's a safe bet that the problem with the linuxkpi. However, since this bug is manifesting itself in a very hardware specific way I'm coming here for advice on what I can do to dump device state to better understand why it ceases to fire interrupts.
 > >
 > > I enabled FENCE_TRACE and added some logging to fence creation and fence_default_wait as well. The last interrupt in this particular excerpt is:
 > >
 > > "Dec 22 22:36:22 daleks kernel: fence at 210850477167 f 0#233745: signaled from irq context"
 > >
 > > amdgpu_cs_wait goes on to sleep on 411#116530 and never wake up. Any guidance would be much appreciated. Thanks in advance.
 > >
 > >
 > >
 > >
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl]
 > > Dec 22 22:36:22 daleks kernel: &fence->scheduled at 210850212762 f 86#116944: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_CS
 > > Dec 22 22:36:22 daleks kernel: amdgpu_ih_process: rptr 864, wptr 880
 > > Dec 22 22:36:22 daleks kernel: [drm:gfx_v8_0_eop_irq] IH: CP EOP
 > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850251259 f 411#116528: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: fence at 210850253222 f 0#233742: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 880, wptr 880
 > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] created fence 410#116529 411#116529 @210850271550
 > > Dec 22 22:36:22 daleks kernel: amdgpu_ih_process: rptr 880, wptr 896
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] [drm:gfx_v8_0_eop_irq] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: IH: CP EOP
 > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850308909 f 87#116944: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: fence at 210850310670 f 0#233743: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 896, wptr 896
 > > Dec 22 22:36:22 daleks kernel: &fence->scheduled at 210850325151 f 410#116529: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_CS
 > > Dec 22 22:36:22 daleks kernel: created fence 86#116945 87#116945 @210850375328
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl]
 > > Dec 22 22:36:22 daleks kernel: &fence->scheduled at 210850389385 f 86#116945: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] amdgpu_ih_process: rptr 896, wptr 912
 > > Dec 22 22:36:22 daleks kernel: [drm:gfx_v8_0_eop_irq] IH: CP EOP
 > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850416620 f 411#116529: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: fence at 210850418382 f 0#233744: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: pid=100793, dev=0xe200, auth=1, AMDGPU_CS
 > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] created fence 410#116530 411#116530 @210850440720
 > > Dec 22 22:36:22 daleks kernel: amdgpu_ih_process: rptr 912, wptr 912
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 912, wptr 928
 > > Dec 22 22:36:22 daleks kernel: [drm:gfx_v8_0_eop_irq] IH: CP EOP
 > > Dec 22 22:36:22 daleks kernel: &fence->finished at 210850475397 f 87#116945: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: fence at 210850477167 f 0#233745: signaled from irq context
 > > Dec 22 22:36:22 daleks kernel: pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:amdgpu_ih_process] amdgpu_ih_process: rptr 928, wptr 928
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_CS
 > > Dec 22 22:36:22 daleks kernel: created fence 86#116946 87#116946 @210850557790
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_CS
 > > Dec 22 22:36:22 daleks kernel: created fence 410#116531 411#116531 @210850614023
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100699, dev=0xe200, auth=1, AMDGPU_CS
 > > Dec 22 22:36:22 daleks kernel: created fence 86#116947 87#116947 @210850719230
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] pid=100793, dev=0xe200, auth=1, AMDGPU_WAIT_CS
 > > Dec 22 22:36:22 daleks kernel: [drm:drm_ioctl] amdgpu_cs_wait on 411#116530
 > > Dec 22 22:36:22 daleks kernel: pid=100699, dev=0xe200, auth=1, AMDGPU_BO_LIST
 > > Dec 22 22:36:22 daleks kernel: 411#116530 sleeping tid 100793 at 210850747487
 > >
 > >
 > > -M
 > >
 > > _______________________________________________
 > > amd-gfx mailing list
 > > amd-gfx at lists.freedesktop.org
 > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
 > 
 > 
 > _______________________________________________
 > amd-gfx mailing list
 > amd-gfx at lists.freedesktop.org
 > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
 > 




More information about the amd-gfx mailing list