GPU lockup CP stall for more than 10000msec on latest vanilla git

Markus Trippelsdorf markus at trippelsdorf.de
Tue Dec 18 08:12:38 PST 2012


On 2012.12.18 at 16:24 +0100, Maarten Lankhorst wrote:
> Op 18-12-12 14:38, Markus Trippelsdorf schreef:
> > On 2012.12.18 at 12:20 +0100, Michel Dänzer wrote:
> >> On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: 
> >>> On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote:
> >>>> On 2012.12.17 at 17:00 -0500, Alex Deucher wrote:
> >>>>> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf
> >>>>> <markus at trippelsdorf.de> wrote:
> >>>>>> On 2012.12.17 at 16:32 -0500, Alex Deucher wrote:
> >>>>>>> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf
> >>>>>>> <markus at trippelsdorf.de> wrote:
> >>>>>>>> As soon as I open the following website:
> >>>>>>>> http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html
> >>>>>>>>
> >>>>>>>> my Radeon RS780 stalls (GPU lockup) leaving the machine unusable:
> >>>>>>> Is this a regression?  Most likely a 3D driver bug unless you are only
> >>>>>>> seeing it with specific kernels.  What browser are you using and do
> >>>>>>> you have hw accelerated webgl, etc. enabled?  If so, what version of
> >>>>>>> mesa are you using?
> >>>>>> This is a regression, because it is caused by yesterdays merge of
> >>>>>> drm-next by Linus. IOW I only see this bug when running a
> >>>>>> v3.7-9432-g9360b53 kernel.
> >>>>> Can you bisect?  I'm guessing it may be related to the new DMA rings.  Possibly:
> >>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2
> >>>> Yes, the commit above causes the issue. 
> >>>>
> >>>>  2d6cc72  GPU lockups
> >>> With 2d6cc72 reverted I get:
> >>>
> >>> Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------
> >> Probably a separate issue, can you bisect this one as well?
> > Yes. Git-bisect points to:
> >
> > 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit
> > commit 85b144f860176ec18db927d6d9ecdfb24d9c6483
> > Author: Maarten Lankhorst <maarten.lankhorst at canonical.com>
> > Date:   Thu Nov 29 11:36:54 2012 +0000
> >
> >     drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock
> >     held, v3
> >
> > (Please note that this bug is a little bit harder to reproduce. But
> > when you scroll up and down for ~10 seconds on the webpage mentioned
> > above it will trigger the oops.
> > So while I'm not 100% sure that the issue is caused by exactly this
> > commit, the vicinity should be right)
> >
> Those dmesg warnings sound suspicious, looks like something is going
> very wrong there.
> 
> Can you revert the one before it? "drm/radeon: allow move_notify to be
> called without reservation" Reservation should be held at this point,
> that commit got in accidentally.
> 
> I doubt not holding a reservation is causing it though, I don't really
> see how that commit could cause it however, so can you please double
> check it never happened before that point, and only started at that
> commit?
> 
> also slap in a BUG_ON(!ttm_bo_is_reserved(bo)) in
> ttm_bo_cleanup_refs_and_unlock for good measure, and a
> BUG_ON(spin_trylock(&bdev->fence_lock)); to ttm_bo_wait.
> 
> I really don't see how that specific commit can be wrong though, so
> awaiting your results first before I try to dig more into it.

I just reran git-bisect just on your commits (from 1a1494def to 97a875cbd)
and I landed on the same commit as above:

commit 85b144f86 (drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock held, v3)

So now I'm pretty sure it's specifically this commit that started the
issue.

With your supposed debugging BUG_ONs added I still get:

Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------
Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40()
Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name
Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174
Dec 18 17:01:15 x4 kernel: Call Trace:
Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0
Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40
Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0
Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0
Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0
Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340
Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170
Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0
Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110
Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0
Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200
Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40
Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160
Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150
Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40
Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0
Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20
Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0
Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160
Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0
Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0
Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b
Dec 18 17:01:15 x4 kernel: ---[ end trace 485a2dd5755db51e ]---
Dec 18 17:01:15 x4 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000024
Dec 18 17:01:15 x4 kernel: IP: [<ffffffff81296488>] radeon_vm_bo_invalidate+0x18/0x30
Dec 18 17:01:15 x4 kernel: PGD 211d09067 PUD 211d52067 PMD 0
Dec 18 17:01:15 x4 kernel: Oops: 0002 [#1] SMP
Dec 18 17:01:15 x4 kernel: CPU 1
Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Tainted: G        W    3.7.0-rc7-00520-g85b144f-dirty #174 System manufacturer System Product Name/M4A78T-E
Dec 18 17:01:15 x4 kernel: RIP: 0010:[<ffffffff81296488>]  [<ffffffff81296488>] radeon_vm_bo_invalidate+0x18/0x30
Dec 18 17:01:15 x4 kernel: RSP: 0018:ffff880211ddfaa8  EFLAGS: 00010203
Dec 18 17:01:15 x4 kernel: RAX: 0000000000000000 RBX: ffff8801f94e1c48 RCX: ffff880205de3128
Dec 18 17:01:15 x4 kernel: RDX: 0000000000000001 RSI: ffff8801f94e1df0 RDI: ffff8801f94e1df8
Dec 18 17:01:15 x4 kernel: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
Dec 18 17:01:15 x4 kernel: R10: 0000000000000000 R11: ffff880216a766b8 R12: ffff880216a76590
Dec 18 17:01:15 x4 kernel: R13: ffffffff818383e0 R14: 0000000000000001 R15: ffff880215c83678
Dec 18 17:01:15 x4 kernel: FS:  00007fbcabc8c880(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
Dec 18 17:01:15 x4 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 18 17:01:15 x4 kernel: CR2: 0000000000000024 CR3: 0000000211d07000 CR4: 00000000000007e0
Dec 18 17:01:15 x4 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 18 17:01:15 x4 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 18 17:01:15 x4 kernel: Process X (pid: 157, threadinfo ffff880211dde000, task ffff880211dc0ba0)
Dec 18 17:01:15 x4 kernel: Stack:
Dec 18 17:01:15 x4 kernel: ffffffff8125d2e9 ffff8801f94e1c48 ffffffff8125e909 ffff880216a769b8
Dec 18 17:01:15 x4 kernel: 01ff880200000001 ffff8801f94e1c84 0000000000000001 ffff880216a766b8
Dec 18 17:01:15 x4 kernel: 0000000000000000 ffff880215c83678 ffff8801f94e1c48 ffffffff8125f17c
Dec 18 17:01:15 x4 kernel: Call Trace:
Dec 18 17:01:15 x4 kernel: [<ffffffff8125d2e9>] ? ttm_bo_cleanup_memtype_use+0x19/0x90
Dec 18 17:01:15 x4 kernel: [<ffffffff8125e909>] ? ttm_bo_cleanup_refs_and_unlock+0x139/0x2d0
Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0
Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0
Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340
Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170
Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0
Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110
Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0
Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200
Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40
Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160
Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150
Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0
Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20
Dec 18 17:01:15 x4 kernel: [<ffffffff8111c310>] ? fsnotify_clear_marks_by_inode+0x20/0xd0
Dec 18 17:01:15 x4 kernel: [<ffffffff810fbc35>] ? __destroy_inode+0x15/0x60
Dec 18 17:01:15 x4 kernel: [<ffffffff810de220>] ? kmem_cache_free+0x10/0x90
Dec 18 17:01:15 x4 kernel: [<ffffffff810f8eaf>] ? dput+0x2f/0x300
Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0
Dec 18 17:01:15 x4 kernel: [<ffffffff811005fb>] ? mntput_no_expire+0x7b/0x170
Dec 18 17:01:15 x4 kernel: [<ffffffff8107bb6b>] ? lg_global_unlock+0x3b/0x50
Dec 18 17:01:15 x4 kernel: [<ffffffff81071b9c>] ? task_work_run+0x8c/0xc0
Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0
Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b
Dec 18 17:01:15 x4 kernel: Code: 8b 44 24 04 48 83 c4 08 5b 5d 41 5c c3 66 0f 1f 44 00 00 48 8b 86 f0 01 00 00 48 81 c6 f0 01 00 00 48 39 f0 74 11 0f 1f 44 00 00 <c6> 40 24 00 48 8b 00 48 39 f0 75 f4 f3 c3 66 2e 0f 1f 84 00 00
Dec 18 17:01:15 x4 kernel: RIP  [<ffffffff81296488>] radeon_vm_bo_invalidate+0x18/0x30
Dec 18 17:01:15 x4 kernel: RSP <ffff880211ddfaa8>
Dec 18 17:01:15 x4 kernel: CR2: 0000000000000024
Dec 18 17:01:15 x4 kernel: ---[ end trace 485a2dd5755db51f ]---
Dec 18 17:01:15 x4 kernel: [drm:drm_release] *ERROR* Device busy: 1

-- 
Markus


More information about the dri-devel mailing list