[Mesa-dev] issue about context reference

Zhu Yijun lovemrd at gmail.com
Sat Oct 3 02:30:13 UTC 2020


Now I am using staging/20.1, I will use mesa/master to reproduce it.

Marek Olšák <maraeo at gmail.com> 于2020年10月1日周四 上午2:57写道:

>
> Hi,
>
> Does the issue happen with mesa/master?
>
> Marek
>
>
> On Mon, Sep 28, 2020 at 3:11 AM Zhu Yijun <lovemrd at gmail.com> wrote:
>>
>> hi all,
>>
>> I use qemu/kvm to boot some android guests with virgl and run apks,
>> the host kernel invokes oom after several hours.
>>
>> 1. From /proc/meminfo, I see the 'SUnreclaim' is the largest part.
>>
>> MemTotal: 16553672 kB
>> MemFree: 128688 kB
>> MemAvailable: 34648 kB
>> Slab: 10169908 kB
>> SReclaimable: 64632 kB
>> SUnreclaim: 10105276 kB
>>
>> 2. From slabinfo, 'kmalloc-8192' nearly uses 5G memory which is the
>> largest part of slab.
>>
>> kmalloc-8192 566782 566782 8192 4 8 : tunables 0 0 0 : slabdata 141697 141697 0
>>
>> 3. Then I append 'slub_debug=U,kmalloc-8192' to host kernel command
>> line to reproduce this issue, find the call number of amdgpu_ctx_free
>> is much less than amdgpu_ctx_alloc after running a few minutes test
>> with only one android guest. It is the reason that 'kmalloc-8192' slab
>> memory increases continuously.
>>
>> #cat /sys/kernel/slab/kmalloc-8192/alloc_calls
>>       2 __vring_new_virtqueue+0x64/0x188 [virtio_ring]
>> age=2779779/2779779/2779779 pid=1069 cpus=19 nodes=0
>>       1 rd_alloc_device+0x34/0x48 [target_core_mod] age=2776755
>> pid=1969 cpus=20 nodes=0
>>       2 mb_cache_create+0x7c/0x128 [mbcache]
>> age=2777018/2777221/2777425 pid=1186-1810 cpus=3,36 nodes=0
>>       2 ext4_fill_super+0x128/0x25b0 [ext4]
>> age=2777019/2777222/2777426 pid=1186-1810 cpus=3,36 nodes=0
>>       2 svc_rqst_alloc+0x3c/0x170 [sunrpc] age=2775427/2775462/2775497
>> pid=2346-2636 cpus=36-37 nodes=0
>>       2 cache_create_net+0x4c/0xc0 [sunrpc]
>> age=2737590/2757403/2777217 pid=1280-4987 cpus=20,44 nodes=0
>>       2 rpc_alloc_iostats+0x2c/0x60 [sunrpc]
>> age=2775494/2775495/2775497 pid=2346 cpus=36 nodes=0
>>    1570 amdgpu_ctx_init+0xb4/0x2a0 [amdgpu] age=30110/314435/1914218
>> pid=63167 cpus=1-7,9-10,16-20,23,27,29-35,40-47,52,60,63,95,118,120,122-123
>> nodes=0
>>    1570 amdgpu_ctx_ioctl+0x198/0x2f8 [amdgpu] age=30110/314435/1914218
>> pid=63167 cpus=1-7,9-10,16-20,23,27,29-35,40-47,52,60,63,95,118,120,122-123
>> nodes=0
>>       2 gfx_v8_0_init_microcode+0x290/0x740 [amdgpu]
>> age=2776838/2776924/2777011 pid=660 cpus=64 nodes=0
>>       2 construct+0xe0/0x4b8 [amdgpu] age=2776819/2776901/2776983
>> pid=660 cpus=64 nodes=0
>>       2 mod_freesync_create+0x68/0x1d0 [amdgpu]
>> age=2776819/2776901/2776983 pid=660 cpus=64 nodes=0
>>       1 kvm_set_irq_routing+0xa8/0x2c8 [kvm_arm_0] age=1909635
>> pid=63172 cpus=56 nodes=0
>>       1 fat_fill_super+0x5c/0xc20 [fat] age=2777014 pid=1817 cpus=49 nodes=0
>>      11 cgroup1_mount+0x180/0x4e0 age=2779901/2779901/2779911 pid=1
>> cpus=1 nodes=0
>>      12 kvmalloc_node+0x64/0xa8 age=35454/1370665/2776188
>> pid=2176-63167 cpus=2,23,34,42,44 nodes=0
>>     128 zswap_dstmem_prepare+0x48/0x78 age=2780252/2780252/2780252
>> pid=1 cpus=19 nodes=0
>>       1 register_leaf_sysctl_tables+0x9c/0x1d0 age=2786535 pid=0 cpus=0 nodes=0
>>       2 do_register_framebuffer+0x298/0x300
>> age=2779680/2783032/2786385 pid=1-656 cpus=0,5 nodes=0
>>       1 vc_do_resize+0xb4/0x570 age=2786385 pid=1 cpus=5 nodes=0
>>       5 vc_allocate+0x144/0x218 age=2776216/2776219/2776224 pid=2019
>> cpus=40 nodes=0
>>       8 arm_smmu_device_probe+0x2d8/0x640 age=2780865/2780894/2780924
>> pid=1 cpus=0 nodes=0
>>       4 __usb_create_hcd+0x44/0x258 age=2780467/2780534/2780599
>> pid=5-660 cpus=0,64 nodes=0
>>       2 xhci_alloc_virt_device+0x9c/0x308 age=2780463/2780476/2780489
>> pid=5-656 cpus=0 nodes=0
>>       1 hid_add_field+0x120/0x320 age=2780373 pid=1 cpus=19 nodes=0
>>       2 hid_allocate_device+0x2c/0x100 age=2780345/2780362/2780380
>> pid=1 cpus=19 nodes=0
>>       1 ipv4_sysctl_init_net+0x44/0x148 age=2737590 pid=4987 cpus=44 nodes=0
>>       1 ipv4_sysctl_init_net+0xa8/0x148 age=2737590 pid=4987 cpus=44 nodes=0
>>       1 ipv4_sysctl_init_net+0xf8/0x148 age=2780293 pid=1 cpus=19 nodes=0
>>       1 netlink_proto_init+0x60/0x19c age=2786498 pid=1 cpus=0 nodes=0
>>       1 ip_rt_init+0x3c/0x20c age=2786473 pid=1 cpus=3 nodes=0
>>       1 ip_rt_init+0x6c/0x20c age=2786472 pid=1 cpus=3 nodes=0
>>       1 udp_init+0xa0/0x108 age=2786472 pid=1 cpus=4 nodes=0
>>
>> # cat /sys/kernel/slab/kmalloc-8192
>> #cat /sys/kernel/slab/kmalloc-8192/free_calls
>>    1473 <not-available> age=4297679817 pid=0 cpus=0 nodes=0
>>      46 rpc_free+0x5c/0x80 [sunrpc] age=1760585/1918856/1935279
>> pid=33422-68056 cpus=32,34,38,40-42,48,55,57,59,61-63 nodes=0
>>       1 rpc_free_iostats+0x14/0x20 [sunrpc] age=2776482 pid=2346 cpus=36 nodes=0
>>     122 free_user_work+0x30/0x40 [ipmi_msghandler]
>> age=59465/347716/1905020 pid=781-128311 cpus=32-46,50,52,63 nodes=0
>>     740 amdgpu_ctx_fini+0x98/0xc8 [amdgpu] age=32012/286664/1910687
>> pid=63167-63222
>> cpus=1-11,16-24,27,29-35,40,42-45,47,52,60,63,95,118,120,122-123
>> nodes=0
>>     719 amdgpu_ctx_fini+0xb0/0xc8 [amdgpu] age=31957/287696/1910687
>> pid=63167-63222
>> cpus=1-7,10-11,13,16-24,27,29-35,40-47,52,57,60,63,95,118,120,122-123
>> nodes=0
>>       1 dc_release_state+0x3c/0x48 [amdgpu] age=2777920 pid=660 cpus=64 nodes=0
>>     115 kvfree+0x38/0x40 age=31170/406614/2777214 pid=2026-63167
>> cpus=0-1,6-8,11,22,24-25,27,29-31,34-37,40,42-45,49,63,95,118,123
>> nodes=0
>>       4 cryptomgr_probe+0xe4/0xf0 age=2778011/2781965/2787371
>> pid=727-1808 cpus=6,10,12,17 nodes=0
>>     112 skb_free_head+0x2c/0x38 age=31864/385450/2776417
>> pid=2649-130896 cpus=8,12,22,30,32,36,38-40,42-49,51,54,56,58-62
>> nodes=0
>>      11 do_name+0x68/0x258 age=2787385/2787385/2787385 pid=1 cpus=4 nodes=0
>>       1 unpack_to_rootfs+0x27c/0x2bc age=2787385 pid=1 cpus=4 nodes=0
>>
>> 4. To analyze this issue further, I added some debuginfo in the qemu,
>> virglrenderer, mesa and libdrm. Found that the context is
>> vrend_renderer_create_sub_ctx/vrend_renderer_destroy_sub_ctx from
>> virglrenderer, and all the calls of these two functions seems
>> normal(the call number between create/destroy grows a little and keep
>> nearly
>> constant during test period). However, in mesa(19.3 in my system),
>> when called amdgpu_ctx_destroy, many context's reference is not 1, so
>> it will not go down into the amdgpu driver to free the slab cache.
>>
>> static inline void amdgpu_ctx_unref(struct amdgpu_ctx *ctx)
>> {
>>    if (p_atomic_dec_zero(&ctx->refcount)) {
>>       amdgpu_cs_ctx_free(ctx->ctx);
>>       amdgpu_bo_free(ctx->user_fence_bo);
>>       FREE(ctx);
>>    }
>> }
>>
>> The ctx->refcount in mesa is maintained by amdgpu_fence_create and
>> amdgpu_fence_reference, they are invoked by upper OpenGL command. I'm
>> not familiar with this logic, so hope someone can give some advice
>> about this issue. Thanks!
>>
>> Yijun
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev


More information about the mesa-dev mailing list