amdgpu display corruption and hang on AMD A10-9620P
Carlo Caione
carlo at endlessm.com
Thu Jun 15 06:46:16 UTC 2017
On Mon, Jun 12, 2017 at 12:24 PM, Carlo Caione <carlo at endlessm.com> wrote:
> On Tue, May 9, 2017 at 7:03 PM, Deucher, Alexander
> <Alexander.Deucher at amd.com> wrote:
>>> -----Original Message-----
>>> From: Daniel Drake [mailto:drake at endlessm.com]
>>> Sent: Tuesday, May 09, 2017 12:55 PM
>>> To: dri-devel; amd-gfx at lists.freedesktop.org; Deucher, Alexander
>>> Cc: Chris Chiu; Linux Upstreaming Team
>>> Subject: amdgpu display corruption and hang on AMD A10-9620P
>>>
>>> Hi,
>>>
>>> We are working with new laptops that have the AMD Bristol Ridge
>>> chipset with this SoC:
>>>
>>> AMD A10-9620P RADEON R5, 10 COMPUTE CORES 4C+6G
>>>
>>> I think this is the Bristol Ridge chipset.
>>>
>>> During boot, the display becomes unusable at the point where the
>>> amdgpu driver loads. You can see at least two horizontal lines of
>>> garbage at this point. We have reproduced on 4.8, 4.10 and linus
>>> master (early 4.12).
>>>
>>> Photo: http://pasteboard.co/qrC9mh4p.jpg
>>>
>>> Getting logs is tricky because the system appears to freeze at that point.
>>>
>>> Is this a known issue? Anything we can do to help diagnosis?
>>
>> I'm not aware of any specific issues. Please file a bug and attach your logs (https://bugs.freedesktop.org) along with information about the system.
>
> Opened https://bugs.freedesktop.org/show_bug.cgi?id=101387 to trace
> this bug. I also have attached there the full log we get when
> modprobing amdgpu.
> Reporting here only the trace for the sake of documentation (full log
> attached to the bug opened on freedesktop)
>
> [ 80.766937] ---[ end Kernel panic - not syncing: stack-protector:
> Kernel stack is corrupted in: ffffffffc0c88942
> [ 80.766937]
> [ 80.766408] Kernel panic - not syncing: stack-protector: Kernel
> stack is corrupted in: ffffffffc0c88942
> [ 80.766408]
> [ 80.766428] CPU: 1 PID: 1594 Comm: modprobe Not tainted 4.11.3+ #2
> [ 80.766431] Hardware name: Acer Aspire A515-41G/Wartortle_BS, BIOS
> V0.09 04/19/2017
> [ 80.766434] Call Trace:
> [ 80.766445] dump_stack+0x63/0x90
> [ 80.766451] panic+0xe8/0x236
> [ 80.766526] ? amdgpu_atombios_crtc_powergate_init+0x52/0x60 [amdgpu]
> [ 80.766537] __stack_chk_fail+0x1b/0x20
> [ 80.766571] amdgpu_atombios_crtc_powergate_init+0x52/0x60 [amdgpu]
> [ 80.766610] dce_v11_0_hw_init+0x3e/0x2d0 [amdgpu]
> [ 80.766643] amdgpu_device_init+0xe23/0x13c0 [amdgpu]
> [ 80.766647] ? kmalloc_order+0x18/0x40
> [ 80.766650] ? kmalloc_order_trace+0x24/0xa0
> [ 80.766683] amdgpu_driver_load_kms+0x5d/0x240 [amdgpu]
> [ 80.766708] drm_dev_register+0x148/0x1e0 [drm]
> [ 80.766721] drm_get_pci_dev+0xa0/0x160 [drm]
> [ 80.766754] amdgpu_pci_probe+0xb9/0xf0 [amdgpu]
> [ 80.766759] local_pci_probe+0x45/0xa0
> [ 80.766762] pci_device_probe+0xf4/0x150
> [ 80.766768] driver_probe_device+0x2c5/0x470
> [ 80.766772] __driver_attach+0xdf/0xf0
> [ 80.766776] ? driver_probe_device+0x470/0x470
> [ 80.766780] bus_for_each_dev+0x6c/0xc0
> [ 80.766784] driver_attach+0x1e/0x20
> [ 80.766787] bus_add_driver+0x45/0x270
> [ 80.766790] ? 0xffffffffc09a8000
> [ 80.766794] driver_register+0x60/0xe0
> [ 80.766796] ? 0xffffffffc09a8000
> [ 80.766799] __pci_register_driver+0x4c/0x50
> [ 80.766811] drm_pci_init+0xed/0x100 [drm]
> [ 80.766816] ? vga_switcheroo_register_handler+0x6c/0x90
> [ 80.766819] ? 0xffffffffc09a8000
> [ 80.766850] amdgpu_init+0x9b/0xac [amdgpu]
> [ 80.766855] do_one_initcall+0x53/0x1c0
> [ 80.766860] ? __vunmap+0x81/0xd0
> [ 80.766865] ? kmem_cache_alloc_trace+0xdb/0x1b0
> [ 80.766868] ? kfree+0x161/0x170
> [ 80.766876] do_init_module+0x60/0x202
> [ 80.766881] load_module+0x2612/0x29f0
> [ 80.766885] SYSC_finit_module+0xa6/0xf0
> [ 80.766888] ? SYSC_finit_module+0xa6/0xf0
> [ 80.766892] SyS_finit_module+0xe/0x10
> [ 80.766896] entry_SYSCALL_64_fastpath+0x1e/0xad
> [ 80.766899] RIP: 0033:0x7fa525e60709
> [ 80.766902] RSP: 002b:00007fff2f5bbbf8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000139
> [ 80.766905] RAX: ffffffffffffffda RBX: 00007fa526129760 RCX: 00007fa525e60709
> [ 80.766908] RDX: 0000000000000000 RSI: 000055f51f1c9439 RDI: 000000000000000b
> [ 80.766910] RBP: 0000000000000070 R08: 0000000000000000 R09: 000055f51fcd83f0
> [ 80.766913] R10: 000000000000000b R11: 0000000000000246 R12: 000055f51fcd9ff0
> [ 80.766915] R13: 0000000000000007 R14: 00007fa5261297b8 R15: 0000000000002710
> [ 80.766931] Kernel Offset: 0x22800000 from 0xffffffff81000000
> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 80.766937] ---[ end Kernel panic - not syncing: stack-protector:
> Kernel stack is corrupted in: ffffffffc0c88942
Trying to move this discussion here for more visibility. This is what
is happening.
In amdgpu_atombios_crtc_powergate_init() we are declaring
ENABLE_DISP_POWER_GATING_PARAMETERS_V2_1 args as parameter space, this
is 32bytes wide and passed down to the atombios interpreter in
ctx->ps.
When amdgpu_atombios_crtc_powergate_init() is called this triggers the
parsing of the command table with index == 13 [>> execute C5C0 (len
589, WS 0, PS 0)]. During the execution of this table several
CALL_TABLE (op == 82) are executed. More in detail we first jump to
table with index == 78 [>> execute F166 (len 588, WS 0, PS 8)], then
to table with index == 51 [>> execute F446 (len 465, WS 4, PS 4)] and
to table with index == 75 [>> execute F6CC (len 1330, WS 4, PS 0)]
before finally reaching the EOT for table 13. At this point when
returning in amdgpu_atombios_crtc_powergate_init() the stack is
already corrupted.
The corruption is happening during the execution of the code in the
table 75 [>> execute F6CC (len 1330, WS 4, PS 0)]. In this table a
MOVE_PS is executed with a destination index == 1, accessing
ctx->ps[idx] and causing the stack corruption.
My first guess here is that something is wrong in the atombios code.
Table 75 has WS == 4 and PS == 0 and looking at the opcodes in the
table I basically have only *_WS opcodes (MOVE_WS, TEST_WS, ADD_WS,
etc...) and just two *_PS instructions (MOVE_PS and OR_PS) that (guess
what) are the instructions causing the stack corruption. My guess here
is that the opcodes *_PS in the atombios are wrong and they should
actually be *_WS opcodes.
Another possibility is that the atombios interpreter is doing
something wrong. Don't we need to allocate the size of the ps
allocation struct (ctx->ps) for the command table we are going to
execute after a CALL_TABLE matching the ps size in the table header?
IIUC the code in the kernel, when we are jumping to a different table
ctx->ps is not being reallocated.
Thanks,
--
Carlo Caione | +39.340.80.30.096 | Endless
More information about the amd-gfx
mailing list