GPU lockup CP stall for more than 10000msec on latest vanilla git

Wed Dec 19 05:57:56 PST 2012

Op 18-12-12 17:12, Markus Trippelsdorf schreef:
> On 2012.12.18 at 16:24 +0100, Maarten Lankhorst wrote:
>> Op 18-12-12 14:38, Markus Trippelsdorf schreef:
>>> On 2012.12.18 at 12:20 +0100, Michel Dänzer wrote:
>>>> On Mon, 2012-12-17 at 23:55 +0100, Markus Trippelsdorf wrote: 
>>>>> On 2012.12.17 at 23:25 +0100, Markus Trippelsdorf wrote:
>>>>>> On 2012.12.17 at 17:00 -0500, Alex Deucher wrote:
>>>>>>> On Mon, Dec 17, 2012 at 4:48 PM, Markus Trippelsdorf
>>>>>>> <markus at trippelsdorf.de> wrote:
>>>>>>>> On 2012.12.17 at 16:32 -0500, Alex Deucher wrote:
>>>>>>>>> On Mon, Dec 17, 2012 at 1:27 PM, Markus Trippelsdorf
>>>>>>>>> <markus at trippelsdorf.de> wrote:
>>>>>>>>>> As soon as I open the following website:
>>>>>>>>>> http://www.boston.com/bigpicture/2012/12/2012_year_in_pictures_part_i.html
>>>>>>>>>>
>>>>>>>>>> my Radeon RS780 stalls (GPU lockup) leaving the machine unusable:
>>>>>>>>> Is this a regression?  Most likely a 3D driver bug unless you are only
>>>>>>>>> seeing it with specific kernels.  What browser are you using and do
>>>>>>>>> you have hw accelerated webgl, etc. enabled?  If so, what version of
>>>>>>>>> mesa are you using?
>>>>>>>> This is a regression, because it is caused by yesterdays merge of
>>>>>>>> drm-next by Linus. IOW I only see this bug when running a
>>>>>>>> v3.7-9432-g9360b53 kernel.
>>>>>>> Can you bisect?  I'm guessing it may be related to the new DMA rings.  Possibly:
>>>>>>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=2d6cc7296d4ee128ab0fa3b715f0afde511f49c2
>>>>>> Yes, the commit above causes the issue. 
>>>>>>
>>>>>>  2d6cc72  GPU lockups
>>>>> With 2d6cc72 reverted I get:
>>>>>
>>>>> Dec 17 23:09:35 x4 kernel: ------------[ cut here ]------------
>>>> Probably a separate issue, can you bisect this one as well?
>>> Yes. Git-bisect points to:
>>>
>>> 85b144f860176ec18db927d6d9ecdfb24d9c6483 is the first bad commit
>>> commit 85b144f860176ec18db927d6d9ecdfb24d9c6483
>>> Author: Maarten Lankhorst <maarten.lankhorst at canonical.com>
>>> Date:   Thu Nov 29 11:36:54 2012 +0000
>>>
>>>     drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock
>>>     held, v3
>>>
>>> (Please note that this bug is a little bit harder to reproduce. But
>>> when you scroll up and down for ~10 seconds on the webpage mentioned
>>> above it will trigger the oops.
>>> So while I'm not 100% sure that the issue is caused by exactly this
>>> commit, the vicinity should be right)
>>>
>> Those dmesg warnings sound suspicious, looks like something is going
>> very wrong there.
>>
>> Can you revert the one before it? "drm/radeon: allow move_notify to be
>> called without reservation" Reservation should be held at this point,
>> that commit got in accidentally.
>>
>> I doubt not holding a reservation is causing it though, I don't really
>> see how that commit could cause it however, so can you please double
>> check it never happened before that point, and only started at that
>> commit?
>>
>> also slap in a BUG_ON(!ttm_bo_is_reserved(bo)) in
>> ttm_bo_cleanup_refs_and_unlock for good measure, and a
>> BUG_ON(spin_trylock(&bdev->fence_lock)); to ttm_bo_wait.
>>
>> I really don't see how that specific commit can be wrong though, so
>> awaiting your results first before I try to dig more into it.
> I just reran git-bisect just on your commits (from 1a1494def to 97a875cbd)
> and I landed on the same commit as above:
>
> commit 85b144f86 (drm/ttm: call ttm_bo_cleanup_refs with reservation and lru lock held, v3)
>
> So now I'm pretty sure it's specifically this commit that started the
> issue.
>
> With your supposed debugging BUG_ONs added I still get:
>
> Dec 18 17:01:15 x4 kernel: ------------[ cut here ]------------
> Dec 18 17:01:15 x4 kernel: WARNING: at include/linux/kref.h:42 radeon_fence_ref+0x2c/0x40()
> Dec 18 17:01:15 x4 kernel: Hardware name: System Product Name
> Dec 18 17:01:15 x4 kernel: Pid: 157, comm: X Not tainted 3.7.0-rc7-00520-g85b144f-dirty #174
> Dec 18 17:01:15 x4 kernel: Call Trace:
> Dec 18 17:01:15 x4 kernel: [<ffffffff81058c84>] ? warn_slowpath_common+0x74/0xb0
> Dec 18 17:01:15 x4 kernel: [<ffffffff8129273c>] ? radeon_fence_ref+0x2c/0x40
> Dec 18 17:01:15 x4 kernel: [<ffffffff8125e95c>] ? ttm_bo_cleanup_refs_and_unlock+0x18c/0x2d0
> Dec 18 17:01:15 x4 kernel: [<ffffffff8125f17c>] ? ttm_mem_evict_first+0x1dc/0x2a0
> Dec 18 17:01:15 x4 kernel: [<ffffffff81264452>] ? ttm_bo_man_get_node+0x62/0xb0
> Dec 18 17:01:15 x4 kernel: [<ffffffff8125f4ce>] ? ttm_bo_mem_space+0x28e/0x340
> Dec 18 17:01:15 x4 kernel: [<ffffffff8125fb0c>] ? ttm_bo_move_buffer+0xfc/0x170
> Dec 18 17:01:15 x4 kernel: [<ffffffff810de172>] ? kmem_cache_alloc+0xb2/0xc0
> Dec 18 17:01:15 x4 kernel: [<ffffffff8125fc15>] ? ttm_bo_validate+0x95/0x110
> Dec 18 17:01:15 x4 kernel: [<ffffffff8125ff7c>] ? ttm_bo_init+0x2ec/0x3b0
> Dec 18 17:01:15 x4 kernel: [<ffffffff8129419a>] ? radeon_bo_create+0x18a/0x200
> Dec 18 17:01:15 x4 kernel: [<ffffffff81293e80>] ? radeon_bo_clear_va+0x40/0x40
> Dec 18 17:01:15 x4 kernel: [<ffffffff812a5342>] ? radeon_gem_object_create+0x92/0x160
> Dec 18 17:01:15 x4 kernel: [<ffffffff812a575c>] ? radeon_gem_create_ioctl+0x6c/0x150
> Dec 18 17:01:15 x4 kernel: [<ffffffff812a529f>] ? radeon_gem_object_free+0x2f/0x40
> Dec 18 17:01:15 x4 kernel: [<ffffffff81246b60>] ? drm_ioctl+0x420/0x4f0
> Dec 18 17:01:15 x4 kernel: [<ffffffff812a56f0>] ? radeon_gem_pwrite_ioctl+0x20/0x20
> Dec 18 17:01:15 x4 kernel: [<ffffffff810f53a4>] ? do_vfs_ioctl+0x2e4/0x4e0
> Dec 18 17:01:15 x4 kernel: [<ffffffff810e5588>] ? vfs_read+0x118/0x160
> Dec 18 17:01:15 x4 kernel: [<ffffffff810f55ec>] ? sys_ioctl+0x4c/0xa0
> Dec 18 17:01:15 x4 kernel: [<ffffffff810e5851>] ? sys_read+0x51/0xa0
> Dec 18 17:01:15 x4 kernel: [<ffffffff814b0612>] ? system_call_fastpath+0x16/0x1b
so the kref to fence is null here. This should be impossible and indicates a bug in refcounting somewhere, or possibly memory corruption.

Lets first look where things could go wrong..

sync_obj member requires fence_lock to be taken, but radeon code in general doesn't do that, hm..

I think radeon_cs_sync_rings needs to take fence_lock during the iteration, then taking on a refcount to the fence,
and radeon_crtc_page_flip and radeon_move_blit are lacking refcount on fence_lock as well.

But that would probably still not explain why it crashes in radeon_vm_bo_invalidate shortly after,
so it seems just as likely that it's operating on freed memory there or something.

But none of the code touches refcounting for that bo, and I really don't see how I messed up anything there.

I seem to be able to reproduce it if I add a hack though, can you test
if you get the exact same issues if you apply this patch?

I call it "aggressively evict MRU buffer, and never call ddestroy", and for me it triggers by merely starting X. :-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 0bf66f9..9a8f0d8 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -512,6 +512,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo)
 	spin_lock(&glob->lru_lock);
 	ret = ttm_bo_reserve_locked(bo, false, true, false, 0);
 
+	goto skip;
 	spin_lock(&bdev->fence_lock);
 	(void) ttm_bo_wait(bo, false, false, true);
 	if (!ret && !bo->sync_obj && 0) {
@@ -529,6 +530,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo)
 		sync_obj = driver->sync_obj_ref(bo->sync_obj);
 	spin_unlock(&bdev->fence_lock);
 
+skip:
 	if (!ret) {
 		atomic_set(&bo->reserved, 0);
 		wake_up_all(&bo->event_queue);
@@ -542,8 +544,7 @@ static void ttm_bo_cleanup_refs_or_queue(struct ttm_buffer_object *bo)
 		driver->sync_obj_flush(sync_obj);
 		driver->sync_obj_unref(&sync_obj);
 	}
-	schedule_delayed_work(&bdev->wq,
-			      ((HZ / 100) < 1) ? 1 : HZ / 100);
+	schedule_delayed_work(&bdev->wq, HZ * 100);
 }
 
 /**
@@ -699,8 +700,7 @@ static void ttm_bo_delayed_workqueue(struct work_struct *work)
 	    container_of(work, struct ttm_bo_device, wq.work);
 
 	if (ttm_bo_delayed_delete(bdev, false)) {
-		schedule_delayed_work(&bdev->wq,
-				      ((HZ / 100) < 1) ? 1 : HZ / 100);
+		schedule_delayed_work(&bdev->wq, HZ * 100);
 	}
 }
 
@@ -743,8 +743,7 @@ EXPORT_SYMBOL(ttm_bo_lock_delayed_workqueue);
 void ttm_bo_unlock_delayed_workqueue(struct ttm_bo_device *bdev, int resched)
 {
 	if (resched)
-		schedule_delayed_work(&bdev->wq,
-				      ((HZ / 100) < 1) ? 1 : HZ / 100);
+		schedule_delayed_work(&bdev->wq, HZ * 100);
 }
 EXPORT_SYMBOL(ttm_bo_unlock_delayed_workqueue);
 
@@ -815,12 +814,15 @@ static int ttm_mem_evict_first(struct ttm_bo_device *bdev,
 
 retry:
 	spin_lock(&glob->lru_lock);
-	if (list_empty(&man->lru)) {
-		spin_unlock(&glob->lru_lock);
-		return -EBUSY;
-	}
+	if (list_empty(&bdev->ddestroy)) {
+		if (list_empty(&man->lru)) {
+			spin_unlock(&glob->lru_lock);
+			return -EBUSY;
+		}
+		bo = list_entry(man->lru.prev, struct ttm_buffer_object, lru);
+	} else
+		bo = list_entry(bdev->ddestroy.prev, struct ttm_buffer_object, ddestroy);
 
-	bo = list_first_entry(&man->lru, struct ttm_buffer_object, lru);
 	kref_get(&bo->list_kref);
 
 	if (!list_empty(&bo->ddestroy)) {