GPU lockup+blackout on 2.6.35 with r600...

Mon Aug 9 06:41:24 PDT 2010

On 8 August 2010 17:36, Daniel J Blueman <daniel.blueman at gmail.com> wrote:
> On 5 August 2010 16:23, Alex Deucher <alexdeucher at gmail.com> wrote:
>> On Thu, Aug 5, 2010 at 9:50 AM, Jerome Glisse <glisse at freedesktop.org> wrote:
>>> On 08/05/2010 06:52 AM, Daniel J Blueman wrote:
>>>>
>>>> After around 10-30 mins of typical usage (ie cairo rendering to
>>>> surfaces + compiz) on my Dell Studio 15 (model 1557), my R600 radeon
>>>> locks up, giving a blank screen.
>>>>
>>>> The only information I have is a backtrace [1]. 'radeontool regs'
>>>> doesn't give anything - what information, /sys files etc would be
>>>> useful here, and would it be useful to log a freedesktop.org bug
>>>> report at this stage, assuming I haven't found a reliable reproducer?
>>>>
>>>> OS is Ubuntu 10.10 development release, mesa 7.8.2, libdrm2 2.4.21.
>>>>
>>>> Thanks,
>>>>   Daniel
>>>>
>>>> --- [1]
>>>>
>>>> radeon 0000:02:00.0: GPU lockup CP stall for more than 1000msec
>>>> ------------[ cut here ]------------
>>>> WARNING: at
>>>> /home/kernel-ppa/COD/linux/drivers/gpu/drm/radeon/radeon_fence.c:235
>>>> radeon_fence_wait+0x2b7/0x320 [radeon]()
>>>> Hardware name: Studio 1557
>>>> GPU lockup (waiting for 0x00008C35 last fence id 0x00008C33)
>>>> Modules linked in: binfmt_misc parport_pc ppdev kvm_intel kvm
>>>> microcode snd_hda_codec_atihdmi joydev ipt_REJECT xt_limit xt_tcpudp
>>>> ipt_addrtype xt_state snd_hda_codec_idt ip6table_filter ip6_tables
>>>> snd_hda_intel nf_nat_irc nf_conntrack_irc snd_hda_codec nf_nat_ftp
>>>> nf_nat snd_hwdep nf_conntrack_ipv4 snd_seq_midi arc4 snd_pcm
>>>> snd_rawmidi nf_defrag_ipv4 radeon nf_conntrack_ftp nf_conntrack
>>>> iptable_filter snd_seq_midi_event snd_seq ip_tables ttm iwlagn video
>>>> snd_timer uvcvideo x_tables snd_seq_device output drm_kms_helper lp
>>>> dell_laptop iwlcore drm videodev dcdbas dell_wmi v4l1_compat mac80211
>>>> i7core_edac parport v4l2_compat_ioctl32 snd psmouse edac_core
>>>> i2c_algo_bit soundcore cfg80211 snd_page_alloc serio_raw sdhci_pci
>>>> sdhci ahci libahci led_class r8169 mii btrfs zlib_deflate crc32c
>>>> libcrc32c
>>>> Pid: 1624, comm: Xorg Tainted: G      D     2.6.35-020635-generic #020635
>>>> Call Trace:
>>>> ? radeon_fence_wait+0x2b7/0x320 [radeon]
>>>> warn_slowpath_common+0x90/0xc0
>>>> warn_slowpath_fmt+0x6e/0x70
>>>> ? schedule_timeout+0x15a/0x2e0
>>>> ? r600_irq_set+0x27d/0xc00 [radeon]
>>>> ? radeon_ring_commit+0xa3/0xb0 [radeon]
>>>> ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
>>>> ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
>>>> radeon_fence_wait+0x2b7/0x320 [radeon]
>>>> ? autoremove_wake_function+0x0/0x40
>>>> radeon_sync_obj_wait+0x11/0x20 [radeon]
>>>> ttm_bo_wait+0x102/0x1b0 [ttm]
>>>> ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
>>>> radeon_move_blit+0x124/0x170 [radeon]
>>>> radeon_bo_move+0xda/0x1a0 [radeon]
>>>> ttm_bo_handle_move_mem+0xec/0x370 [ttm]
>>>> ttm_bo_evict+0x1cc/0x270 [ttm]
>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>>> ttm_mem_evict_first+0xed/0x180 [ttm]
>>>> ? ttm_bo_man_get_node+0xd5/0xe0 [ttm]
>>>> ttm_bo_mem_force_space+0xab/0x110 [ttm]
>>>> ttm_bo_mem_space+0x305/0x370 [ttm]
>>>> ttm_bo_move_buffer+0xcf/0x140 [ttm]
>>>> ? drm_mm_split_at_start+0x1d/0x80 [drm]
>>>> ttm_bo_validate+0xd5/0x100 [ttm]
>>>> ttm_bo_init+0x1f2/0x240 [ttm]
>>>> radeon_bo_create+0x121/0x240 [radeon]
>>>> ? radeon_ttm_bo_destroy+0x0/0x80 [radeon]
>>>> radeon_gem_object_create+0x89/0x100 [radeon]
>>>> ? drm_gem_handle_create+0x8b/0xa0 [drm]
>>>> radeon_gem_create_ioctl+0x58/0xe0 [radeon]
>>>> drm_ioctl+0x283/0x460 [drm]
>>>> ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
>>>> ? do_readv_writev+0x16b/0x1e0
>>>> vfs_ioctl+0x3a/0xc0
>>>> do_vfs_ioctl+0x6d/0x1f0
>>>> sys_ioctl+0x87/0xa0
>>>> ? sys_writev+0x55/0xb0
>>>> system_call_fastpath+0x16/0x1b
>>>> ---[ end trace c0bc12025fa8386c ]---
>>>> radeon 0000:02:00.0: GPU softreset
>>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0xA0003028
>>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>>> radeon 0000:02:00.0:   R_008020_GRBM_SOFT_RESET=0x00007FEE
>>>> radeon 0000:02:00.0: R_008020_GRBM_SOFT_RESET=0x00000001
>>>> radeon 0000:02:00.0:   R_008010_GRBM_STATUS=0x00003028
>>>> radeon 0000:02:00.0:   R_008014_GRBM_STATUS2=0x00000002
>>>> radeon 0000:02:00.0:   R_000E50_SRBM_STATUS=0x200000C0
>>>> radeon 0000:02:00.0: GPU reset succeed
>>>> Clocks initialized !
>>>> ring test succeeded in 1 usecs
>>>> ib test succeeded in 1 usecs
>>>
>>> SO it keep reseting ? According to log there is a GPU lockup but then
>>> a successfull GPU reset so GPU should resume fine (that's what log says)
>>> Best is to open a bug and attach full dmesg + lspci -vv and context
>>> in which the lockup happen
>>
>> Perhaps the lockup timeout is too short or interrupts are being
>> delivered late or something?  There's a fdo or kernel bugzilla entry
>> where removing the lockup timeout fixed the issue.  Does something
>> like this:
>>
>> --- a/drivers/gpu/drm/radeon/radeon_fence.c
>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c
>> @@ -237,9 +237,11 @@ retry:
>>                         * as signaled for now
>>                         */
>>                        rdev->gpu_lockup = true;
>> +#if 0
>>                        r = radeon_gpu_reset(rdev);
>>                        if (r)
>>                                return r;
>> +#endif
>>                        WREG32(rdev->fence_drv.scratch_reg, fence->seq);
>>                        rdev->gpu_lockup = false;
>>                }
>>
>> fix the issue?
>
> Correction, RV710.
>
> Well, RADEON_FENCE_JIFFIES_TIMEOUT plus 1000ms is already
> excruciatingly long for the command processor to not have responded,
> but the issue I'm really hitting here is that X if left hanging after
> GPU reset. The GPU lockup was being reported at:
>
> [<ffffffffa03b66d7>] ? radeon_fence_wait+0x2b7/0x320 [radeon]
> [<ffffffff81061370>] warn_slowpath_common+0x90/0xc0
> [<ffffffff8106149e>] warn_slowpath_fmt+0x6e/0x70
> [<ffffffff81575f8a>] ? schedule_timeout+0x15a/0x2e0
> [<ffffffffa03f5a9d>] ? r600_irq_set+0x27d/0xc00 [radeon]
> [<ffffffffa03ce1f3>] ? radeon_ring_commit+0xa3/0xb0 [radeon]
> [<ffffffffa03d5412>] ? r100_gpu_cp_is_lockup+0xc2/0xd0 [radeon]
> [<ffffffffa040063b>] ? r600_gpu_is_lockup+0x1cb/0x220 [radeon]
> [<ffffffffa03b66d7>] radeon_fence_wait+0x2b7/0x320 [radeon]
> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
>
> X consistently hangs after GPU reset at (with sysrq-T):
>
> [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170
> [<ffffffff8157647b>] mutex_lock+0x2b/0x50
> [<ffffffffa0415aba>] radeon_pm_set_clocks+0x3a/0x2d0 [radeon]
> [<ffffffff812abc51>] ? idr_get_new_above+0x11/0x40
> [<ffffffffa01679ab>] ? drm_mode_object_get+0x6b/0xc0 [drm]
> [<ffffffff811410ae>] ? kmem_cache_alloc_notrace+0x8e/0xd0
> [<ffffffffa041662d>] radeon_pm_compute_clocks+0xcd/0x240 [radeon]
> [<ffffffffa03a9d21>] atombios_crtc_mode_fixup+0x31/0x60 [radeon]
> [<ffffffffa02ba13a>] drm_crtc_helper_set_mode+0x16a/0x370 [drm_kms_helper]
> [<ffffffffa02ba3bd>] drm_helper_resume_force_mode+0x7d/0x150 [drm_kms_helper]
> [<ffffffffa03a2f35>] radeon_gpu_reset+0x85/0xd0 [radeon]
> [<ffffffffa03b66e7>] radeon_fence_wait+0x2c7/0x320 [radeon]
> [<ffffffff8107fcd0>] ? autoremove_wake_function+0x0/0x40
> [<ffffffffa03b74a1>] radeon_sync_obj_wait+0x11/0x20 [radeon]
> [<ffffffffa02ff332>] ttm_bo_wait+0x102/0x1b0 [ttm]
> [<ffffffffa0302ede>] ttm_bo_move_accel_cleanup+0x19e/0x230 [ttm]
> [<ffffffffa03b75f4>] radeon_move_blit+0x124/0x170 [radeon]
>
> I couldn't reproduce it (but did hit it a number of times over the
> last two weeks), so have been analysing with what we have. The kernel
> the backtrace is against isn't built with debug, so:
>
> radeon_pm_set_clocks():
>   8ca80:       55                      push   %rbp
>   8ca81:       48 89 e5                mov    %rsp,%rbp
>   8ca84:       41 57                   push   %r15
>   8ca86:       41 56                   push   %r14
>   8ca88:       41 55                   push   %r13
>   8ca8a:       41 54                   push   %r12
>   8ca8c:       53                      push   %rbx
>   8ca8d:       48 83 ec 48             sub    $0x48,%rsp
>   8ca91:       e8 00 00 00 00          callq  8ca96 <radeon_pm_set_clocks+0x16>
>   8ca96:       49 89 fc                mov    %rdi,%r12
>   8ca99:       48 8b 7f 08             mov    0x8(%rdi),%rdi
>   8ca9d:       48 83 c7 20             add    $0x20,%rdi
>   8caa1:       e8 00 00 00 00          callq  8caa6 <radeon_pm_set_clocks+0x26>
>   8caa6:       49 8d 84 24 a0 1c 00    lea    0x1ca0(%r12),%rax
>   8caad:       00
>   8caae:       48 89 c7                mov    %rax,%rdi
>   8cab1:       48 89 45 90             mov    %rax,-0x70(%rbp)
>   8cab5:       e8 00 00 00 00          callq  8caba <radeon_pm_set_clocks+0x3a>
> -> 8caba:       49 8d 8c 24 e8 09 00    lea    0x9e8(%r12),%rcx
>   8cac1:       00
>   8cac2:       48 89 cf                mov    %rcx,%rdi
>   8cac5:       48 89 4d 98             mov    %rcx,-0x68(%rbp)
>   8cac9:       e8 00 00 00 00          callq  8cace <radeon_pm_set_clocks+0x4e>
>   8cace:       41 83 7c 24 78 1a       cmpl   $0x1a,0x78(%r12)
>   8cad4:       0f 86 46 01 00 00       jbe    8cc20
> <radeon_pm_set_clocks+0x1a0>
>   8cada:       41 80 bc 24 d0 0d 00    cmpb   $0x0,0xdd0(%r12)
>   8cae1:       00 00
>
> This corresponds to:
>
> static void radeon_pm_set_clocks(struct radeon_device *rdev)
> {
>        int i;
>
>        mutex_lock(&rdev->ddev->struct_mutex);
>        mutex_lock(&rdev->vram_mutex);
> ->      mutex_lock(&rdev->cp.mutex);
>
> We know r12 is the rdev struct pointer (%rdi is the first function
> argument gets loaded into r12), and 0x9e8 is plausable as the offset
> of cp.mutex, as vram_mutex is further out and plausable as 0x1ca0, so
> the code is trying to acquire cp.mutex it seems.
>
> From the kernel log, the command proc ring test completed after the
> start of the GPU reset, which is the last callsite to unlock cp.mutex,
> ie <radeon_fence_wait->radeon_gpu_reset->radeon_resume->rv770_resume->rv770_startup->r600_cp_resume->radeon_ring_test->r600_ring_test->radeon_ring_unlock_commit>
> before the hang at
> <radeon_fence_wait->radeon_gpu_reset->->drm_helper_resume_force_mode->drm_crtc_helper_set_mode->mode_fixup->atombios_crtc_mode_fixup->radeon_pm_compute_clocks->radeon_pm_set_clocks>.
>
> Looking through all the calls between these two spots, I just can't
> see anything that _could_ have locked cp.mutex, unless I'm misreading
> the disassembly of mutex_lock.
>
> Any thoughts?

Well, looking further back down the call chain, radeon_bo_create takes
the vram_mutex across the ttm_bo_init call, so I was duped by the
compiler's instruction scheduling.

I looked at various changes to the locking scheme which did address
the X hang (thus blackouts), but all attempts failed with lockdep
violations somewhere, so there is no minimal fix without revisiting
recent locking scheme changes [1,2,3].

I'll pass on some other minor fixes I've cooked up in the meantime.
Note that I am running with the kernel boot options 'dynpm=1
radeon.dynclks=1' to stop everything getting toasty; let me know if
you need more info for this issue.

Thanks,
  Daniel

[1] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37917
[2] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37923
[3] http://kerneltrap.org/mailarchive/git-commits-head/2010/5/21/37937
-- 
Daniel J Blueman