Reporting a use-after-free in amdgpu and null-ptr-deref

Thu Feb 22 02:42:56 UTC 2024

Hi Vitaly,

Thank you for looking into this issue!

We have reproduced this issue with a Radeon RX 580 (Polaris 20)
passthrough-ed to a QEMU (4.0.0) VM by VFIO.
All bugs were reproducible on the recent 6.8-rc4 Linux kernel (
https://github.com/torvalds/linux/tree/v6.8-rc4), which I double checked
right now with previous programs.

Below are the QEMU arguments used, in-VM lspci -vvv and /proc/cpuinfo.
Should you need any more information, please let us know.

*QEMU arguments*
qemu-system-x86_64
-m 2G \
-cpu host \
-kernel $KERNEL \
-append "console=ttyS0 root=/dev/sda earlyprintk=serial net.ifnames=0" \
-drive file=$DRIVE_FILE,format=qcow2 \
-enable-kvm \
-device vfio-pci,host=$PCI_ADDR,id=gpu,multifunction=on,x-vga=on \
-nographic

*root at qemu:~# lspci -vvv*
00:03.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
Ellesmere [Radeon RX 47)
        Subsystem: Gigabyte Technology Co., Ltd Ellesmere [Radeon RX
470/480/570/570X/580/580X/59]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <-
        Latency: 0
        Interrupt: pin A routed to IRQ 24
        Region 0: Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at f0000000 (64-bit, prefetchable) [size=2M]
        Region 4: I/O ports at c000 [size=256]
        Region 5: Memory at feb80000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
<4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr-
TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit
Latency L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s (downgraded), Width x4 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-
NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Not Supported,
ExtFmt+ EETLPPrefix+, Max1
                         EmergencyPowerReduction Not Supported,
EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+
OBFF Disabled,
                         AtomicOpsCtl: ReqEn+
                LnkSta2: Current De-emphasis Level: -3.5dB,
EqualizationComplete- EqualizationPha-
                         EqualizationPhase2- EqualizationPhase3-
LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee01004  Data: 0021
        Kernel driver in use: amdgpu

*root at snapuzz:~# cat /proc/cpuinfo*
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 183
model name      : 13th Gen Intel(R) Core(TM) i9-13900K
stepping        : 1
microcode       : 0x1
cpu MHz         : 2995.200
cache size      : 16384 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 31
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflushs
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only
ept_ad ept_1gb flexprioritg
bugs            : spectre_v1 spectre_v2 spec_store_bypass mds swapgs
itlb_multihit mmio_unknown eb
bogomips        : 5990.40
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

On Thu, Feb 22, 2024 at 10:57 AM vitaly prosyak <vprosyak at amd.com> wrote:

> Hi Joonkyo,
>
> Thanks for your reporting!
>
> I reproduced the first issue with 'amdgpu_gem_userptr_ioctl' when KAZAN
> enabled, but i could not reproduce the other two issues.
>
> Could you indicate what ASIC did you use to reproduce the the issue?
>
> Could you provide details of your system?
>
> Much appreciated for your responce.
>
> Vitaly
>
>
> I placed your findings below to keep the context and track the issues.
>
>
> ===========================================================================================================================
>
> Reporting a slab-use-after-free in amdgpu.eml
>
> Subject:
> Reporting a slab-use-after-free in amdgpu
>
> From:
> Joonkyo Jung <joonkyoj at yonsei.ac.kr> <joonkyoj at yonsei.ac.kr>
>
> Date:
> 2024-02-16, 04:22
>
> To:
> alexander.deucher at amd.com, christian.koenig at amd.com, Xinhui.Pan at amd.com
>
> CC:
> amd-gfx at lists.freedesktop.org, Dokyung Song <dokyungs at yonsei.ac.kr> <dokyungs at yonsei.ac.kr>,  jisoo.jang at yonsei.ac.kr, yw9865 at yonsei.ac.kr
>
> Hello,
>
> We
>  would like to report a slab-use-after-free bug in the AMDGPU DRM driver
>  in the linux kernel v6.8-rc4 that we found with our customized
> Syzkaller.
> The bug can be triggered by sending two ioctls to the AMDGPU DRM driver in succession.
>
> In amdgpu_bo_move, struct ttm_resource *old_mem = bo->resource is assigned.
> As you can see on the alloc & free stack calls, on the same function amdgpu_bo_move,
> amdgpu_move_blit in the end frees bo->resource at ttm_bo_move_accel_cleanup with ttm_bo_wait_free_node(bo, man->use_tt).
> But
>  amdgpu_bo_move continues after that, reaching trace_amdgpu_bo_move(abo,
>  new_mem->mem_type, old_mem->mem_type) at the end, causing the
> use-after-free bug.
>
> Steps to reproduce are as below.
> union drm_amdgpu_gem_create *arg1;
>
> arg1 = malloc(sizeof(union drm_amdgpu_gem_create));
> arg1->in.bo_size = 0x8;
> arg1->in.alignment = 0x0;
> arg1->in.domains = 0x4;
> arg1->in.domain_flags = 0x9;
> ioctl(fd, 0xc0206440, arg1);
>
> arg1->in.bo_size = 0x7fffffff;
> arg1->in.alignment = 0x0;
> arg1->in.domains = 0x4;
> arg1->in.domain_flags = 0x9;
> ioctl(fd, 0xc0206440, arg1);
>
> The KASAN report is as follows:
> ==================================================================
> BUG: KASAN: slab-use-after-free in amdgpu_bo_move+0x1479/0x1550
> Read of size 4 at addr ffff88800f5bee80 by task syz-executor/219
> Call Trace:
>  <TASK>
>  amdgpu_bo_move+0x1479/0x1550
>  ttm_bo_handle_move_mem+0x4d0/0x700
>  ttm_mem_evict_first+0x945/0x1230
>  ttm_bo_mem_space+0x6c7/0x940
>  ttm_bo_validate+0x286/0x650
>  ttm_bo_init_reserved+0x34c/0x490
>  amdgpu_bo_create+0x94b/0x1610
>  amdgpu_bo_create_user+0xa3/0x130
>  amdgpu_gem_create_ioctl+0x4bc/0xc10
>  drm_ioctl_kernel+0x300/0x410
>  drm_ioctl+0x648/0xb30
>  amdgpu_drm_ioctl+0xc8/0x160
>  </TASK>
>
> Allocated by task 219:
>  kmalloc_trace+0x211/0x390
>  amdgpu_vram_mgr_new+0x1d6/0xbe0
>  ttm_resource_alloc+0xfd/0x1e0
>  ttm_bo_mem_space+0x255/0x940
>  ttm_bo_validate+0x286/0x650
>  ttm_bo_init_reserved+0x34c/0x490
>  amdgpu_bo_create+0x94b/0x1610
>  amdgpu_bo_create_user+0xa3/0x130
>  amdgpu_gem_create_ioctl+0x4bc/0xc10
>  drm_ioctl_kernel+0x300/0x410
>  drm_ioctl+0x648/0xb30
>  amdgpu_drm_ioctl+0xc8/0x160
>
> Freed by task 219:
>  kfree+0x111/0x2d0
>  ttm_resource_free+0x17e/0x1e0
>  ttm_bo_move_accel_cleanup+0x77e/0x9b0
>  amdgpu_move_blit+0x3db/0x670
>  amdgpu_bo_move+0xfa2/0x1550
>  ttm_bo_handle_move_mem+0x4d0/0x700
>  ttm_mem_evict_first+0x945/0x1230
>  ttm_bo_mem_space+0x6c7/0x940
>  ttm_bo_validate+0x286/0x650
>  ttm_bo_init_reserved+0x34c/0x490
>  amdgpu_bo_create+0x94b/0x1610
>  amdgpu_bo_create_user+0xa3/0x130
>  amdgpu_gem_create_ioctl+0x4bc/0xc10
>  drm_ioctl_kernel+0x300/0x410
>  drm_ioctl+0x648/0xb30
>  amdgpu_drm_ioctl+0xc8/0x160
>
> The buggy address belongs to the object at ffff88800f5bee70
>  which belongs to the cache kmalloc-96 of size 96
> The buggy address is located 16 bytes inside of
>  freed 96-byte region [ffff88800f5bee70, ffff88800f5beed0)
>
> Should you need any more information, please do not hesitate to contact us.
>
> Best regards,
> Joonkyo Jung
>
> Reporting a null-ptr-deref in amdgpu.eml
>
> Subject:
> Reporting a null-ptr-deref in amdgpu
>
> From:
> Joonkyo Jung <joonkyoj at yonsei.ac.kr> <joonkyoj at yonsei.ac.kr>
>
> Date:
> 2024-02-16, 04:20
>
> To:
> alexander.deucher at amd.com, christian.koenig at amd.com, Xinhui.Pan at amd.com
>
> CC:
> Dokyung Song <dokyungs at yonsei.ac.kr> <dokyungs at yonsei.ac.kr>, jisoo.jang at yonsei.ac.kr, yw9865 at yonsei.ac.kr,  amd-gfx at lists.freedesktop.org
>
> Hello,
>
> We
>  would like to report a null-ptr-deref bug in the AMDGPU DRM driver in
> the linux kernel v6.8-rc4 that we found with our customized Syzkaller.
> The bug can be triggered by sending two ioctls to the AMDGPU DRM driver in succession.
>
> The first ioctl amdgpu_ctx_ioctl will create a ctx, and return ctx_id = 1 to the userspace.
> Second
>  ioctl, actually any ioctl that will eventually call
> amdgpu_ctx_get_entity, carries this ctx_id and passes the context check.
> Here, for example, drm_amdgpu_wait_cs.
> Validations in amdgpu_ctx_get_entity can also be passed by the user-provided values from the ioctl arguments.
> This
>  eventually leads to drm_sched_entity_init, where the null-ptr-deref
> will trigger on RCU_INIT_POINTER(entity->last_scheduled, NULL);
>
> Steps to reproduce are as below.
> union drm_amdgpu_ctx *arg1;
> union drm_amdgpu_wait_cs *arg2;
>
> arg1 = malloc(sizeof(union drm_amdgpu_ctx));
> arg2 = malloc(sizeof(union drm_amdgpu_wait_cs));
>
> arg1->in.op = 0x1;
> ioctl(AMDGPU_renderD128_DEVICE_FILE, 0x140106442, arg1);
>
> arg2->in.handle = 0x0;
> arg2->in.timeout = 0x2000000000000;
> arg2->in.ip_type = 0x9;
> arg2->in.ip_instance = 0x0;
> arg2->in.ring = 0x0;
> arg2->in.ctx_id = 0x1;
> ioctl(AMDGPU_renderD128_DEVICE_FILE, 0xc0206449, arg2);
>
> The KASAN report is as follows:
> general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN NOPTI
> KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
> Call Trace:
>  <TASK>
>  ? drm_sched_entity_init+0x16e/0x650
>  ? drm_sched_entity_init+0x208/0x650
>  amdgpu_ctx_get_entity+0x944/0xc30
>  amdgpu_cs_wait_ioctl+0x13d/0x3f0
>  drm_ioctl_kernel+0x300/0x410
>  drm_ioctl+0x648/0xb30
>  amdgpu_drm_ioctl+0xc8/0x160
>  </TASK>
>
> Should you need any more information, please do not hesitate to contact us.
>
> Best regards,
> Joonkyo Jung
>
> Reporting a use-after-free in amdgpu.eml
>
> Subject:
> Reporting a use-after-free in amdgpu
>
> From:
> 정준교 <joonkyoj at yonsei.ac.kr> <joonkyoj at yonsei.ac.kr>
>
> Date:
> 2024-02-14, 21:34
>
> To:
> alexander.deucher at amd.com, christian.koenig at amd.com, Xinhui.Pan at amd.com
>
> CC:
> amd-gfx at lists.freedesktop.org, Dokyung Song <dokyungs at yonsei.ac.kr> <dokyungs at yonsei.ac.kr>,  jisoo.jang at yonsei.ac.kr, yw9865 at yonsei.ac.kr
>
> Hello,
>
> We
>  would like to report a use-after-free bug in the AMDGPU DRM driver in
> the linux kernel 6.2 that we found with our customized Syzkaller.
> The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver, with invalid addr and size.
> We have tested that this bug can still be triggered in the latest RC kernel, v6.8-rc4.
>
> Steps to reproduce are as below.
>
> struct drm_amdgpu_gem_userptr *arg;
> arg = malloc(sizeof(struct drm_amdgpu_gem_userptr));
> arg->addr = 0xffffffffffff0000;
> arg->size = 0x80000000;
> arg->flags = 0x7;
> ioctl(AMDGPU_renderD128_DEVICE_FILE, 0xc1186451, arg);
>
> The KASAN report is as follows:
> ==================================================================
> BUG: KASAN: use-after-free in switch_mm_irqs_off+0x89d/0xb70
> Read of size 8 at addr ffff88801f17bc00 by task syz-executor/386
> Call Trace:
> <TASK>
> switch_mm_irqs_off+0x89d/0xb70
> __schedule+0xa62/0x2630
> preempt_schedule_common+0x45/0xd0
> vfree+0x4d/0x60
> ttm_tt_fini+0xdf/0x1c0
> amdgpu_ttm_backend_destroy+0x9f/0xe0
> ttm_bo_cleanup_memtype_use+0x142/0x1f0
> ttm_bo_release+0x67d/0xc00
> ttm_bo_put+0x7c/0xa0
> amdgpu_bo_unref+0x3b/0x80
> amdgpu_gem_object_free+0x7f/0xc0
> drm_gem_object_free+0x5d/0x90
> amdgpu_gem_userptr_ioctl+0x452/0x7e0
> drm_ioctl_kernel+0x284/0x500
> drm_ioctl+0x55e/0xa50
> amdgpu_drm_ioctl+0xe3/0x1d0
> </TASK>
>
> Allocated by task 385:
> kmem_cache_alloc+0x174/0x300
> copy_process+0x32d1/0x6640
> kernel_clone+0xcd/0x690
>
> Freed by task 386:
> kmem_cache_free+0x13b/0x550
> mmu_interval_notifier_remove+0x4c8/0x610
> amdgpu_hmm_unregister+0x47/0x90
> amdgpu_gem_object_free+0x75/0xc0
> drm_gem_object_free+0x5d/0x90
> amdgpu_gem_userptr_ioctl+0x452/0x7e0
> drm_ioctl_kernel+0x284/0x500
> drm_ioctl+0x55e/0xa50
> amdgpu_drm_ioctl+0xe3/0x1d0
>
> The buggy address belongs to the object at ffff88801f17bb80
> which belongs to the cache mm_struct of size 2016
> The buggy address is located 128 bytes inside of
> 2016-byte region [ffff88801f17bb80, ffff88801f17c360)
>
> The buggy address belongs to the physical page:
> page:000000002c2a61bd refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1f178
> head:000000002c2a61bd order:3 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0
> memcg:ffff8880141aa301
> flags: 0x100000000010200(slab|head|node=0|zone=1)
> raw: 0100000000010200 ffff88800a44fc80 ffffea00006ca400 dead000000000004
> raw: 0000000000000000 00000000800f000f 00000001ffffffff ffff8880141aa301
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
> ffff88801f17bb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ffff88801f17bb80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >ffff88801f17bc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ^
> ffff88801f17bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88801f17bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==================================================================
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20240222/8d5574a0/attachment-0001.htm>