Stack out of bounds in KFD on Arcturus

Zeng, Oak Oak.Zeng at amd.com
Tue Oct 22 17:46:51 UTC 2019


Sorry I searched my kconfig and I didn't find the stack size configure anymore...Maybe today kernel stack size is not configurable anymore...

Can you try your kernel on vega10 or 20 or navi10? We want to know whether this is mi100 specific issue.

Oak

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com> 
Sent: Tuesday, October 22, 2019 1:28 PM
To: Zeng, Oak <Oak.Zeng at amd.com>; Kuehling, Felix <Felix.Kuehling at amd.com>
Cc: amd-gfx at lists.freedesktop.org
Subject: Re: Stack out of bounds in KFD on Arcturus

I don't know - what Kconfig flag should I look at ?

Andrey

On 10/22/19 1:17 PM, Zeng, Oak wrote:
> Sorry I meant is the kernel stack size 16KB in your kconfig?
>
> Oak
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
> Sent: Tuesday, October 22, 2019 12:49 PM
> To: Zeng, Oak <Oak.Zeng at amd.com>; Kuehling, Felix 
> <Felix.Kuehling at amd.com>
> Cc: amd-gfx at lists.freedesktop.org
> Subject: Re: Stack out of bounds in KFD on Arcturus
>
> On 10/18/19 5:31 PM, Zeng, Oak wrote:
>
>> Hi Andrey,
>>
>> What is your system configuration? I didn’t see this issue before. Also see attached QA's configuration - you can compare to see any difference.
>
> Attached is my lshw
>
>> Also I believe for x86-64, the default kernel stack size is 16kb? Is this your Kconfig?
>
> What do you mean if this is my Kconfig ? Is there particular Kconfig flag you know that i can look for ?
>
> Andrey
>
>
>> Regards,
>> Oak
>>
>> -----Original Message-----
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> On Behalf Of 
>> Kuehling, Felix
>> Sent: Friday, October 18, 2019 4:55 PM
>> To: Grodzovsky, Andrey <Andrey.Grodzovsky at amd.com>
>> Cc: amd-gfx at lists.freedesktop.org
>> Subject: Re: Stack out of bounds in KFD on Arcturus
>>
>> On 2019-10-17 6:38 p.m., Grodzovsky, Andrey wrote:
>>> Not that I aware of, is there a special Kconfig flag to determine 
>>> stack size ?
>> I remember there used to be a Kconfig option to force a 4KB kernel stack. I don't see it in the current kernel any more.
>>
>> I don't have time to work on this myself. I'll create a ticket and see if I can find someone to investigate.
>>
>> Thanks,
>>      Felix
>>
>>
>>> Andrey
>>>
>>> On 10/17/19 5:29 PM, Kuehling, Felix wrote:
>>>> I don't see why this problem would be specific to Arcturus. I don't 
>>>> see any excessive allocations on the stack either. Also the code 
>>>> involved here hasn't changed recently.
>>>>
>>>> Are you using some weird kernel config with a smaller stack? Is it 
>>>> specific to a compiler version or some optimization flags? I've 
>>>> sometimes seen function inlining cause excessive stack usage.
>>>>
>>>> Regards,
>>>>        Felix
>>>>
>>>> On 2019-10-17 4:09 p.m., Grodzovsky, Andrey wrote:
>>>>> He Felix - I see this on boot when working with Arcturus.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>> [  103.602092] kfd kfd: Allocated 3969056 bytes on gart [ 
>>>>> 103.610769] 
>>>>> ==================================================================
>>>>> [  103.611469] BUG: KASAN: stack-out-of-bounds in
>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.611646] 
>>>>> Read of size 4 at addr ffff8883cb19ee38 by task modprobe/1122
>>>>>
>>>>> [  103.611836] CPU: 3 PID: 1122 Comm: modprobe Tainted: G O 
>>>>> 5.3.0-rc3+ #45 [  103.611847] Hardware name: System manufacturer 
>>>>> System Product Name/Z170-PRO, BIOS 1902 06/27/2016 [  103.611856] 
>>>>> Call Trace:
>>>>> [  103.611879]  dump_stack+0x71/0xab [  103.611907]
>>>>> print_address_description+0x1da/0x3c0
>>>>> [  103.612453]  ? kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] 
>>>>> [ 103.612479]  __kasan_report+0x13f/0x1a0 [  103.613022]  ?
>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613580]  ?
>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.613604]
>>>>> kasan_report+0xe/0x20 [  103.614149]
>>>>> kfd_create_vcrat_image_gpu+0x5db/0xb80 [amdgpu] [  103.614762]  ?
>>>>> kfd_fill_gpu_memory_affinity+0x110/0x110 [amdgpu] [  103.614796]  ?
>>>>> __alloc_pages_nodemask+0x2c9/0x560
>>>>> [  103.614824]  ? __alloc_pages_slowpath+0x1390/0x1390
>>>>> [  103.614898]  ? kmalloc_order+0x63/0x70 [  103.615469]
>>>>> kfd_create_crat_image_virtual+0x70c/0x770 [amdgpu] [  103.616054]  ?
>>>>> kfd_create_crat_image_acpi+0x1c0/0x1c0 [amdgpu] [  103.616095]  ?
>>>>> up_write+0x4b/0x70 [  103.616649]
>>>>> kfd_topology_add_device+0x98d/0xb10 [amdgpu] [  103.617207]  ?
>>>>> kfd_topology_shutdown+0x60/0x60 [amdgpu] [  103.617743]  ?
>>>>> start_cpsch+0x2ff/0x3a0 [amdgpu] [  103.617777]  ?
>>>>> mutex_lock_io_nested+0xac0/0xac0 [  103.617807]  ?
>>>>> __mutex_unlock_slowpath+0xda/0x420
>>>>> [  103.617848]  ? __mutex_unlock_slowpath+0xda/0x420
>>>>> [  103.617877]  ? wait_for_completion+0x200/0x200 [  103.618461]  ?
>>>>> start_cpsch+0x38b/0x3a0 [amdgpu] [  103.619011]  ?
>>>>> create_queue_cpsch+0x670/0x670 [amdgpu] [  103.619573]  ?
>>>>> kfd_iommu_device_init+0x92/0x1e0 [amdgpu] [  103.620112]  ?
>>>>> kfd_iommu_resume+0x2c/0x2c0 [amdgpu] [  103.620655]  ?
>>>>> kfd_iommu_check_device+0xf0/0xf0 [amdgpu] [  103.621228]
>>>>> kgd2kfd_device_init+0x474/0x870 [amdgpu] [  103.621781]
>>>>> amdgpu_amdkfd_device_init+0x291/0x390 [amdgpu] [  103.622329]  ?
>>>>> amdgpu_amdkfd_device_probe+0x90/0x90 [amdgpu] [  103.622344]  ?
>>>>> kmsg_dump_rewind_nolock+0x59/0x59 [  103.622895]  ?
>>>>> amdgpu_ras_eeprom_test+0x71/0x90 [amdgpu] [  103.623424]
>>>>> amdgpu_device_init+0x1bbe/0x2f00 [amdgpu] [  103.623819]  ?
>>>>> amdgpu_device_has_dc_support+0x30/0x30 [amdgpu] [  103.623842]  ?
>>>>> __isolate_free_page+0x290/0x290 [  103.623852]  ?
>>>>> fs_reclaim_acquire.part.97+0x5/0x30
>>>>> [  103.623891]  ? __alloc_pages_nodemask+0x2c9/0x560
>>>>> [  103.623912]  ? __alloc_pages_slowpath+0x1390/0x1390
>>>>> [  103.623945]  ? kasan_unpoison_shadow+0x31/0x40 [  103.623970]  ?
>>>>> kmalloc_order+0x63/0x70 [  103.624337]
>>>>> amdgpu_driver_load_kms+0xd9/0x430 [amdgpu] [  103.624690]  ?
>>>>> amdgpu_register_gpu_instance+0xe0/0xe0 [amdgpu] [  103.624756]  ?
>>>>> drm_dev_register+0x19c/0x310 [drm] [  103.624768]  ?
>>>>> __kasan_slab_free+0x133/0x160 [  103.624849]
>>>>> drm_dev_register+0x1f5/0x310 [drm] [  103.625212]
>>>>> amdgpu_pci_probe+0x109/0x1f0 [amdgpu] [  103.625565]  ?
>>>>> amdgpu_pmops_runtime_idle+0xe0/0xe0 [amdgpu] [  103.625580]
>>>>> local_pci_probe+0x74/0xd0 [  103.625603]
>>>>> pci_device_probe+0x1fa/0x310 [  103.625620]  ?
>>>>> pci_device_remove+0x1c0/0x1c0 [  103.625640]  ?
>>>>> sysfs_do_create_link_sd.isra.2+0x74/0xe0
>>>>> [  103.625673]  really_probe+0x367/0x5d0 [  103.625700]
>>>>> driver_probe_device+0x177/0x1b0 [  103.625721]
>>>>> device_driver_attach+0x8a/0x90 [  103.625737]  ?
>>>>> device_driver_attach+0x90/0x90 [  103.625746]
>>>>> __driver_attach+0xeb/0x190 [  103.625765]  ?
>>>>> device_driver_attach+0x90/0x90 [  103.625773]
>>>>> bus_for_each_dev+0xe4/0x160 [  103.625789]  ?
>>>>> subsys_dev_iter_exit+0x10/0x10 [  103.625829]
>>>>> bus_add_driver+0x277/0x330 [  103.625855]
>>>>> driver_register+0xc6/0x1a0 [  103.625866]  ? 0xffffffffa0d88000 [ 
>>>>> 103.625880]  do_one_initcall+0xd3/0x334 [  103.625895]  ?
>>>>> trace_event_raw_event_initcall_finish+0x150/0x150
>>>>> [  103.625911]  ? kasan_unpoison_shadow+0x31/0x40 [  103.625924]  ?
>>>>> __kasan_kmalloc+0xd5/0xf0 [  103.625946]  ?
>>>>> kmem_cache_alloc_trace+0x154/0x300
>>>>> [  103.625955]  ? kasan_unpoison_shadow+0x31/0x40 [  103.625985]
>>>>> do_init_module+0xec/0x354 [  103.626011]  
>>>>> load_module+0x3c91/0x4980 [  103.626118]  ? 
>>>>> module_frob_arch_sections+0x20/0x20
>>>>> [  103.626132]  ? ima_read_file+0x10/0x10 [  103.626142]  ?
>>>>> vfs_read+0x127/0x190 [  103.626163]  ? kernel_read+0x95/0xb0 [ 
>>>>> 103.626187]  ? kernel_read_file+0x1a5/0x340 [  103.626277]  ?
>>>>> __do_sys_finit_module+0x175/0x1b0 [  103.626287]
>>>>> __do_sys_finit_module+0x175/0x1b0 [  103.626301]  ?
>>>>> __ia32_sys_init_module+0x40/0x40 [  103.626338]  ?
>>>>> lock_downgrade+0x390/0x390 [  103.626396]  ?
>>>>> vtime_user_exit+0xc8/0xe0 [  103.626423]  do_syscall_64+0x7d/0x250 
>>>>> [ 103.626440]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>> [  103.626450] RIP: 0033:0x7f09984854d9 [  103.626461] Code: 00 f3
>>>>> c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00
>>>>> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
>>>>> 08 0f
>>>>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8f 29 2c 00 f7 d8 64 89 
>>>>> 01
>>>>> 48 [  103.626468] RSP: 002b:00007ffc42896008 EFLAGS: 00000246 ORIG_RAX:
>>>>> 0000000000000139
>>>>> [  103.626479] RAX: ffffffffffffffda RBX: 0000559a52495400 RCX:
>>>>> 00007f09984854d9
>>>>> [  103.626486] RDX: 0000000000000000 RSI: 0000559a52499900 RDI:
>>>>> 0000000000000006
>>>>> [  103.626493] RBP: 0000559a52499900 R08: 0000000000000000 R09:
>>>>> 0000000000000000
>>>>> [  103.626500] R10: 0000000000000006 R11: 0000000000000246 R12:
>>>>> 0000000000000000
>>>>> [  103.626508] R13: 0000559a52499b30 R14: 0000000000040000 R15:
>>>>> 0000000000000013
>>>>>
>>>>> [  103.626592] The buggy address belongs to the page:
>>>>> [  103.626665] page:ffffea000f2c6780 refcount:0 mapcount:0
>>>>> mapping:0000000000000000 index:0x0 [  103.626675] flags: 
>>>>> 0x2ffff0000000000() [  103.626686] raw:
>>>>> 02ffff0000000000 0000000000000000 ffffea000f2c6788
>>>>> 0000000000000000
>>>>> [  103.626696] raw: 0000000000000000 0000000000000000 
>>>>> 00000000ffffffff
>>>>> 0000000000000000
>>>>> [  103.626702] page dumped because: kasan: bad access detected
>>>>>
>>>>> [  103.626742] addr ffff8883cb19ee38 is located in stack of task
>>>>> modprobe/1122 at offset 264 in frame:
>>>>> [  103.627233]  kfd_create_vcrat_image_gpu+0x0/0xb80 [amdgpu]
>>>>>
>>>>> [  103.627346] this frame has 3 objects:
>>>>> [  103.627405]  [32, 36) 'avail_size'
>>>>> [  103.627410]  [96, 120) 'local_mem_info'
>>>>> [  103.627466]  [160, 264) 'cu_info'
>>>>>
>>>>> [  103.627602] Memory state around the buggy address:
>>>>> [  103.627675]  ffff8883cb19ed00: 00 00 00 00 00 00 f1 f1 f1 f1 04
>>>>> f4 f4
>>>>> f4 f2 f2
>>>>> [  103.627780]  ffff8883cb19ed80: f2 f2 00 00 00 f4 f2 f2 f2 f2 00
>>>>> 00 00
>>>>> 00 00 00
>>>>> [  103.627885] >ffff8883cb19ee00: 00 00 00 00 00 00 00 f4 f4 f4 f3
>>>>> f3 f3
>>>>> f3 00 00
>>>>> [  103.627989]                                         ^ [ 
>>>>> 103.628065]  ffff8883cb19ee80: 00 00 00 00 00 00 00 00 00 00 00 00
>>>>> 00
>>>>> 00 00 00
>>>>> [  103.628169]  ffff8883cb19ef00: f1 f1 f1 f1 00 f4 f4 f4 f3 f3 f3
>>>>> f3 00
>>>>> 00 00 00
>>>>> [  103.628273]
>>>>> ==================================================================
>>>>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


More information about the amd-gfx mailing list