GPU lockup CP stall for more than 10000msec on latest vanilla git

Maarten Lankhorst maarten.lankhorst at canonical.com
Wed Dec 19 06:31:50 PST 2012


Op 19-12-12 15:20, Markus Trippelsdorf schreef:
> On 2012.12.19 at 14:57 +0100, Maarten Lankhorst wrote:
>> Op 18-12-12 17:12, Markus Trippelsdorf schreef:
>>> With your supposed debugging BUG_ONs added I still get:
>>>
>>> Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------
>>> Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40()
>>> Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name
>>> Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174
>>> Dec 18 17:01:15 x4 kernel: Call Trace:
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0
>>> Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b
>> so the kref to fence is null here. This should be impossible and
>> indicates a bug in refcounting somewhere, or possibly memory
>> corruption.
>>
>> Lets first look where things could go wrong..
>>
>> sync_obj member requires fence_lock to be taken, but radeon code in
>> general doesn't do that, hm..
>>
>> I think radeon_cs_sync_rings needs to take fence_lock during the
>> iteration, then taking on a refcount to the fence, and
>> radeon_crtc_page_flip and radeon_move_blit are lacking refcount on
>> fence_lock as well.
>>
>> But that would probably still not explain why it crashes in
>> radeon_vm_bo_invalidate shortly after, so it seems just as likely that
>> it's operating on freed memory there or something.
>>
>> But none of the code touches refcounting for that bo, and I really
>> don't see how I messed up anything there.
>>
>> I seem to be able to reproduce it if I add a hack though, can you test
>> if you get the exact same issues if you apply this patch?
> Your patch doesn't apply unfortunately:
>
> markus at x4 linux % patch -p1 --dry-run < ~/maarten.patch
> checking file drivers/gpu/drm/ttm/ttm_bo.c
> Hunk #1 succeeded at 512 with fuzz 1.
> Hunk #6 FAILED at 814.
> 1 out of 6 hunks FAILED
> markus at x4 linux % git describe
> v3.7-10833-g752451f
> markus at x4 linux % 
It applies on top of the regressed commit. It should probably not be too hard to make it apply
manually on whatever you're using.

But the real fix will be "drm/ttm: fix delayed ttm_bo_cleanup_refs_and_unlock delayed handling",
which I cc'd you on. The patch I posted earlier in this thread will just aggressively stress test the codepath.

~Maarten



More information about the dri-devel mailing list