[Mesa-dev] issue about context reference
Zhu Yijun
lovemrd at gmail.com
Sun Sep 27 09:57:13 UTC 2020
hi all,
I use qemu/kvm to boot some android guests with virgl and run apks,
the host kernel invokes oom after several hours.
1. From /proc/meminfo, I see the 'SUnreclaim' is the largest part.
MemTotal: 16553672 kB
MemFree: 128688 kB
MemAvailable: 34648 kB
Slab: 10169908 kB
SReclaimable: 64632 kB
SUnreclaim: 10105276 kB
2. From slabinfo, 'kmalloc-8192' nearly uses 5G memory which is the
largest part of slab.
kmalloc-8192 566782 566782 8192 4 8 : tunables 0 0 0 : slabdata 141697 141697 0
3. Then I append 'slub_debug=U,kmalloc-8192' to host kernel command
line to reproduce this issue, find the call number of amdgpu_ctx_free
is much less than amdgpu_ctx_alloc after running a few minutes test
with only one android guest. It is the reason that 'kmalloc-8192' slab
memory increases continuously.
#cat /sys/kernel/slab/kmalloc-8192/alloc_calls
2 __vring_new_virtqueue+0x64/0x188 [virtio_ring]
age=2779779/2779779/2779779 pid=1069 cpus=19 nodes=0
1 rd_alloc_device+0x34/0x48 [target_core_mod] age=2776755
pid=1969 cpus=20 nodes=0
2 mb_cache_create+0x7c/0x128 [mbcache]
age=2777018/2777221/2777425 pid=1186-1810 cpus=3,36 nodes=0
2 ext4_fill_super+0x128/0x25b0 [ext4]
age=2777019/2777222/2777426 pid=1186-1810 cpus=3,36 nodes=0
2 svc_rqst_alloc+0x3c/0x170 [sunrpc] age=2775427/2775462/2775497
pid=2346-2636 cpus=36-37 nodes=0
2 cache_create_net+0x4c/0xc0 [sunrpc]
age=2737590/2757403/2777217 pid=1280-4987 cpus=20,44 nodes=0
2 rpc_alloc_iostats+0x2c/0x60 [sunrpc]
age=2775494/2775495/2775497 pid=2346 cpus=36 nodes=0
1570 amdgpu_ctx_init+0xb4/0x2a0 [amdgpu] age=30110/314435/1914218
pid=63167 cpus=1-7,9-10,16-20,23,27,29-35,40-47,52,60,63,95,118,120,122-123
nodes=0
1570 amdgpu_ctx_ioctl+0x198/0x2f8 [amdgpu] age=30110/314435/1914218
pid=63167 cpus=1-7,9-10,16-20,23,27,29-35,40-47,52,60,63,95,118,120,122-123
nodes=0
2 gfx_v8_0_init_microcode+0x290/0x740 [amdgpu]
age=2776838/2776924/2777011 pid=660 cpus=64 nodes=0
2 construct+0xe0/0x4b8 [amdgpu] age=2776819/2776901/2776983
pid=660 cpus=64 nodes=0
2 mod_freesync_create+0x68/0x1d0 [amdgpu]
age=2776819/2776901/2776983 pid=660 cpus=64 nodes=0
1 kvm_set_irq_routing+0xa8/0x2c8 [kvm_arm_0] age=1909635
pid=63172 cpus=56 nodes=0
1 fat_fill_super+0x5c/0xc20 [fat] age=2777014 pid=1817 cpus=49 nodes=0
11 cgroup1_mount+0x180/0x4e0 age=2779901/2779901/2779911 pid=1
cpus=1 nodes=0
12 kvmalloc_node+0x64/0xa8 age=35454/1370665/2776188
pid=2176-63167 cpus=2,23,34,42,44 nodes=0
128 zswap_dstmem_prepare+0x48/0x78 age=2780252/2780252/2780252
pid=1 cpus=19 nodes=0
1 register_leaf_sysctl_tables+0x9c/0x1d0 age=2786535 pid=0 cpus=0 nodes=0
2 do_register_framebuffer+0x298/0x300
age=2779680/2783032/2786385 pid=1-656 cpus=0,5 nodes=0
1 vc_do_resize+0xb4/0x570 age=2786385 pid=1 cpus=5 nodes=0
5 vc_allocate+0x144/0x218 age=2776216/2776219/2776224 pid=2019
cpus=40 nodes=0
8 arm_smmu_device_probe+0x2d8/0x640 age=2780865/2780894/2780924
pid=1 cpus=0 nodes=0
4 __usb_create_hcd+0x44/0x258 age=2780467/2780534/2780599
pid=5-660 cpus=0,64 nodes=0
2 xhci_alloc_virt_device+0x9c/0x308 age=2780463/2780476/2780489
pid=5-656 cpus=0 nodes=0
1 hid_add_field+0x120/0x320 age=2780373 pid=1 cpus=19 nodes=0
2 hid_allocate_device+0x2c/0x100 age=2780345/2780362/2780380
pid=1 cpus=19 nodes=0
1 ipv4_sysctl_init_net+0x44/0x148 age=2737590 pid=4987 cpus=44 nodes=0
1 ipv4_sysctl_init_net+0xa8/0x148 age=2737590 pid=4987 cpus=44 nodes=0
1 ipv4_sysctl_init_net+0xf8/0x148 age=2780293 pid=1 cpus=19 nodes=0
1 netlink_proto_init+0x60/0x19c age=2786498 pid=1 cpus=0 nodes=0
1 ip_rt_init+0x3c/0x20c age=2786473 pid=1 cpus=3 nodes=0
1 ip_rt_init+0x6c/0x20c age=2786472 pid=1 cpus=3 nodes=0
1 udp_init+0xa0/0x108 age=2786472 pid=1 cpus=4 nodes=0
# cat /sys/kernel/slab/kmalloc-8192
#cat /sys/kernel/slab/kmalloc-8192/free_calls
1473 <not-available> age=4297679817 pid=0 cpus=0 nodes=0
46 rpc_free+0x5c/0x80 [sunrpc] age=1760585/1918856/1935279
pid=33422-68056 cpus=32,34,38,40-42,48,55,57,59,61-63 nodes=0
1 rpc_free_iostats+0x14/0x20 [sunrpc] age=2776482 pid=2346 cpus=36 nodes=0
122 free_user_work+0x30/0x40 [ipmi_msghandler]
age=59465/347716/1905020 pid=781-128311 cpus=32-46,50,52,63 nodes=0
740 amdgpu_ctx_fini+0x98/0xc8 [amdgpu] age=32012/286664/1910687
pid=63167-63222
cpus=1-11,16-24,27,29-35,40,42-45,47,52,60,63,95,118,120,122-123
nodes=0
719 amdgpu_ctx_fini+0xb0/0xc8 [amdgpu] age=31957/287696/1910687
pid=63167-63222
cpus=1-7,10-11,13,16-24,27,29-35,40-47,52,57,60,63,95,118,120,122-123
nodes=0
1 dc_release_state+0x3c/0x48 [amdgpu] age=2777920 pid=660 cpus=64 nodes=0
115 kvfree+0x38/0x40 age=31170/406614/2777214 pid=2026-63167
cpus=0-1,6-8,11,22,24-25,27,29-31,34-37,40,42-45,49,63,95,118,123
nodes=0
4 cryptomgr_probe+0xe4/0xf0 age=2778011/2781965/2787371
pid=727-1808 cpus=6,10,12,17 nodes=0
112 skb_free_head+0x2c/0x38 age=31864/385450/2776417
pid=2649-130896 cpus=8,12,22,30,32,36,38-40,42-49,51,54,56,58-62
nodes=0
11 do_name+0x68/0x258 age=2787385/2787385/2787385 pid=1 cpus=4 nodes=0
1 unpack_to_rootfs+0x27c/0x2bc age=2787385 pid=1 cpus=4 nodes=0
4. To analyze this issue further, I added some debuginfo in the qemu,
virglrenderer, mesa and libdrm. Found that the context is
vrend_renderer_create_sub_ctx/vrend_renderer_destroy_sub_ctx from
virglrenderer, and all the calls of these two functions seems
normal(the call number between create/destroy grows a little and keep
nearly
constant during test period). However, in mesa(19.3 in my system),
when called amdgpu_ctx_destroy, many context's reference is not 1, so
it will not go down into the amdgpu driver to free the slab cache.
static inline void amdgpu_ctx_unref(struct amdgpu_ctx *ctx)
{
if (p_atomic_dec_zero(&ctx->refcount)) {
amdgpu_cs_ctx_free(ctx->ctx);
amdgpu_bo_free(ctx->user_fence_bo);
FREE(ctx);
}
}
The ctx->refcount in mesa is maintained by amdgpu_fence_create and
amdgpu_fence_reference, they are invoked by upper OpenGL command. I'm
not familiar with this logic, so hope someone can give some advice
about this issue. Thanks!
Yijun
More information about the mesa-dev
mailing list