[PATCH v3 00/30] Introduce GPU SVM and Xe SVM implementation

Gwan-gyeong Mun gwan-gyeong.mun at intel.com
Fri Jan 17 09:47:41 UTC 2025


Hi,
This kernel oops, which I reported before, was caused by my incorrect 
modification (incorrect applying of review comments) of this patch 
"[v3,19/30] drm/xe: Add SVM device memory mirroring"
( the kernel oops occurred because the xe_drm_pagemap_map_dma() and 
xe_devm_add() functions were built in the form of empty functions. )

This issue disappeared after proper patch modifications were applied.
So please ignore the previously reported this kernel oops.

Br,

G.G.

On 1/7/25 2:19 PM, Gwan-gyeong Mun wrote:
> Hi Matthew Brost,
> 
> After applying this patch series and the following to the latest drm- 
> tip, while testing[1] with the mentioned IGT, I faced a kernel oops[3].
> It makes prevent progressing of the mentioned igt tests.
> Could you please check the following oops log?
> 
> (1) apply comments of "[v3,05/30] drm/gpusvm: Add support for GPU Shared 
> Virtual Memory"
> (2) apply comments of "[v3,15/30] drm/xe: Add unbind to SVM garbage 
> collector"
> (3) drop "[v3,27/30] drm/xe: Basic SVM BO eviction" patch
> 
> The kernel config used, the entire dmesg, and detailed information can 
> be found in [2].
> 
> br,
> 
> G.G.
> 
> [1] used igt command: xe_exec_system_allocator --run-subtest once-malloc
> [2] https://gitlab.freedesktop.org/elongbug/drm-tip/-/snippets/7823
> 
> [3] kernel oops dmesg
> [   51.365230] Console: switching to colour VGA+ 80x25
> [   51.367772] [IGT] xe_exec_system_allocator: executing
> [   51.383611] [IGT] xe_exec_system_allocator: starting subtest once-malloc
> [   51.386066] xe 0000:00:04.0: [drm:vm_bind_ioctl_ops_create [xe]] 
> op=0, addr=0x0000000000000000, range=0x0001000000000000, 
> bo_offset_or_userptr=0x0000000000000000
> [   51.386171] xe 0000:00:04.0: [drm:vm_bind_ioctl_ops_create [xe]] MAP: 
> addr=0x0000000000000000, range=0x0001000000000000
> [   51.389429] xe 0000:00:04.0: [drm:xe_svm_handle_pagefault [xe]] PAGE 
> FAULT: asid=1, gpusvm=0xffff8881775e9188, vram=0,0, 
> seqno=9223372036854775807, start=0x005584e8400000, end=0x005584e8410000, 
> size=65536
> [   51.389529] xe 0000:00:04.0: [drm:xe_svm_handle_pagefault [xe]] 
> ALLOCATE VRAM: asid=1, gpusvm=0xffff8881775e9188, vram=0,0, 
> seqno=9223372036854775807, start=0x005584e8400000, end=0x005584e8410000, 
> size=65536
> [   51.389935] xe 0000:00:04.0: [drm:xe_svm_handle_pagefault [xe]] ALLOC 
> VRAM: asid=1, gpusvm=0xffff8881775e9188, pfn=3126960, npages=16
> [   51.390048] xe 0000:00:04.0: [drm:xe_svm_invalidate [xe]] INVALIDATE: 
> asid=1, gpusvm=0xffff8881775e9188, seqno=3, start=0x00005584e8400000, 
> end=0x00005584e8410000, event=6
> [   51.390440] xe 0000:00:04.0: [drm:xe_svm_invalidate [xe]] NOTIFIER: 
> asid=1, gpusvm=0xffff8881775e9188, vram=0,0, seqno=9223372036854775807, 
> start=0x005584e8400000, end=0x005584e8410000, size=65536
> [   51.390948] Oops: general protection fault, probably for non- 
> canonical address 0x3fff88842fc80000: 0000 [#1] PREEMPT SMP NOPTI
> [   51.391624] CPU: 1 UID: 0 PID: 76 Comm: kworker/u17:0 Not tainted 
> 6.13.0-rc4-drm-tip-test+ #48
> [   51.392088] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.15.0-1 04/01/2014
> [   51.392527] Workqueue: xe_gt_page_fault_work_queue pf_queue_work_func 
> [xe]
> [   51.392947] RIP: 0010:zone_device_page_init+0x5d/0x240
> [   51.393228] Code: 04 dd ff e8 d5 d2 a1 00 5a 85 c0 0f 85 ba 00 00 00 
> e8 d7 bb df ff 85 c0 0f 84 9d 01 00 00 48 8b 45 38 a8 03 0f 85 ec 00 00 
> 00 <65> 48 ff 00 e8 aa d2 a1 00 85 c0 0f 85 0d 01 00 00 48 c7 c7 20 cb
> [   51.394247] RSP: 0018:ffffc9000039fb48 EFLAGS: 00010246
> [   51.394570] RAX: 4000000000000000 RBX: ffffea000bedac00 RCX: 
> 0000000000000000
> [   51.394950] RDX: 0000000000000046 RSI: ffffffff824c67b4 RDI: 
> ffffffff824e58f5
> [   51.395328] RBP: ffffea000bedac08 R08: 0000000000000015 R09: 
> 0000000000000004
> [   51.395709] R10: 0000000000000001 R11: 0000000000000004 R12: 
> 0000000000000001
> [   51.396093] R13: ffff888170fd8d40 R14: ffff88817f922640 R15: 
> ffffea000bedac00
> [   51.396472] FS:  0000000000000000(0000) GS:ffff88842fc80000(0000) 
> knlGS:0000000000000000
> [   51.396925] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.397237] CR2: 0000563d1f7ecbe4 CR3: 000000017c212000 CR4: 
> 0000000000750ef0
> [   51.397618] PKRU: 55555554
> [   51.397768] Call Trace:
> [   51.397904]  <TASK>
> [   51.398024]  ? __die_body.cold+0x19/0x26
> [   51.398238]  ? die_addr+0x38/0x60
> [   51.398420]  ? exc_general_protection+0x19e/0x450
> [   51.398678]  ? asm_exc_general_protection+0x22/0x30
> [   51.398942]  ? zone_device_page_init+0x5d/0x240
> [   51.399188]  ? zone_device_page_init+0x49/0x240
> [   51.399433]  drm_gpusvm_migrate_to_devmem+0x379/0x9e0 [drm_gpusvm]
> [   51.399768]  xe_svm_handle_pagefault+0x62c/0xa60 [xe]
> [   51.400110]  ? xe_vm_find_overlapping_vma+0xa4/0x1d0 [xe]
> [   51.400475]  pf_queue_work_func+0x1ba/0x450 [xe]
> [   51.400777]  process_one_work+0x1fe/0x580
> [   51.400996]  worker_thread+0x1d1/0x3b0
> [   51.401201]  ? __pfx_worker_thread+0x10/0x10
> [   51.401433]  kthread+0xeb/0x120
> [   51.401609]  ? __pfx_kthread+0x10/0x10
> [   51.401813]  ret_from_fork+0x2d/0x50
> [   51.402008]  ? __pfx_kthread+0x10/0x10
> [   51.402211]  ret_from_fork_asm+0x1a/0x30
> [   51.402427]  </TASK>
> [   51.402551] Modules linked in: xe drm_ttm_helper gpu_sched 
> drm_suballoc_helper drm_gpuvm drm_exec drm_gpusvm i2c_algo_bit drm_buddy 
> video wmi ttm drm_display_helper drm_kms_helper crct10dif_pclmul 
> crc32_pclmul e1000 ghash_clmulni_intel i2c_piix4 i2c_smbus fuse
> [   51.403779] ---[ end trace 0000000000000000 ]---
> [   51.404106] RIP: 0010:zone_device_page_init+0x5d/0x240
> [   51.404393] Code: 04 dd ff e8 d5 d2 a1 00 5a 85 c0 0f 85 ba 00 00 00 
> e8 d7 bb df ff 85 c0 0f 84 9d 01 00 00 48 8b 45 38 a8 03 0f 85 ec 00 00 
> 00 <65> 48 ff 00 e8 aa d2 a1 00 85 c0 0f 85 0d 01 00 00 48 c7 c7 20 cb
> [   51.405408] RSP: 0018:ffffc9000039fb48 EFLAGS: 00010246
> [   51.405725] RAX: 4000000000000000 RBX: ffffea000bedac00 RCX: 
> 0000000000000000
> [   51.406110] RDX: 0000000000000046 RSI: ffffffff824c67b4 RDI: 
> ffffffff824e58f5
> [   51.406518] RBP: ffffea000bedac08 R08: 0000000000000015 R09: 
> 0000000000000004
> [   51.406905] R10: 0000000000000001 R11: 0000000000000004 R12: 
> 0000000000000001
> [   51.407312] R13: ffff888170fd8d40 R14: ffff88817f922640 R15: 
> ffffea000bedac00
> [   51.407691] FS:  0000000000000000(0000) GS:ffff88842fc80000(0000) 
> knlGS:0000000000000000
> [   51.408135] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.408484] CR2: 0000563d1f7ecbe4 CR3: 000000017c212000 CR4: 
> 0000000000750ef0
> [   51.408877] PKRU: 55555554
> [   51.409047] BUG: sleeping function called from invalid context at ./ 
> include/linux/percpu-rwsem.h:49
> [   51.409528] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 
> 76, name: kworker/u17:0
> [   51.409976] preempt_count: 0, expected: 0
> [   51.410212] RCU nest depth: 1, expected: 0
> [   51.410435] INFO: lockdep is turned off.
> [   51.410648] CPU: 1 UID: 0 PID: 76 Comm: kworker/u17:0 Tainted: G 
> D            6.13.0-rc4-drm-tip-test+ #48
> [   51.411180] Tainted: [D]=DIE
> [   51.411338] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.15.0-1 04/01/2014
> [   51.411859] Workqueue: xe_gt_page_fault_work_queue pf_queue_work_func 
> [xe]
> [   51.412269] Call Trace:
> [   51.412404]  <TASK>
> [   51.412525]  dump_stack_lvl+0x69/0xa0
> [   51.412724]  __might_resched.cold+0xe5/0x120
> [   51.412956]  exit_signals+0x1a/0x360
> [   51.413150]  do_exit+0x122/0xbd0
> [   51.413328]  ? __pfx_worker_thread+0x10/0x10
> [   51.413562]  make_task_dead+0x88/0x90
> [   51.413783]  rewind_stack_and_make_dead+0x16/0x20
> [   51.414045] RIP: 0000:0x0
> [   51.414191] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
> [   51.414595] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 
> 0000000000000000
> [   51.414993] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
> 0000000000000000
> [   51.415369] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
> 0000000000000000
> [   51.415746] RBP: 0000000000000000 R08: 0000000000000000 R09: 
> 0000000000000000
> [   51.416123] R10: 0000000000000000 R11: 0000000000000000 R12: 
> 0000000000000000
> [   51.416501] R13: 0000000000000000 R14: 0000000000000000 R15: 
> 0000000000000000
> [   51.416899]  </TASK>
> 
> 
> On 12/18/24 1:33 AM, Matthew Brost wrote:
>> Version 3 of GPU SVM has been promoted to the proper series from an RFC.
>> Thanks to everyone (especially Sima and Thomas) for their numerous
>> reviews on revision 1, 2 and for helping to address many design issues.
>>
>> This version has been tested with IGT [1] on PVC, BMG, and LNL. Also
>> tested with level0 (UMD) PR [2].
>>
>> Major changes in v2:
>> - Dropped mmap write abuse
>> - core MM locking and retry loops instead of driver locking to avoid 
>> races
>> - Removed physical to virtual references
>> - Embedded structure/ops for drm_gpusvm_devmem
>> - Fixed mremap and fork issues
>> - Added DRM pagemap
>> - Included RFC documentation in the kernel doc
>>
>> Major changes in v3:
>> - Move GPU SVM and DRM pagemap to DRM level
>> - Mostly addresses Thomas's feedback, lots of small changes documented
>>    in each individual patch change log
>>
>> Known issues in v3:
>> - Check pages still exists, changed to threshold in this version which
>>    is better but still need to root cause cross process page finding on
>>    small user allocations.
>> - Dropped documentation patch, fairly large rewrite and will send out
>>    independently once finished.
>>
>> Matt
>>
>> [1] https://patchwork.freedesktop.org/series/137545/#rev3
>> [2] https://github.com/intel/compute-runtime/pull/782
>>
>> Matthew Brost (27):
>>    drm/xe: Retry BO allocation
>>    mm/migrate: Add migrate_device_pfns
>>    mm/migrate: Trylock device page in do_swap_page
>>    drm/gpusvm: Add support for GPU Shared Virtual Memory
>>    drm/xe: Select DRM_GPUSVM Kconfig
>>    drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR flag
>>    drm/xe: Add SVM init / close / fini to faulting VMs
>>    drm/xe: Nuke VM's mapping upon close
>>    drm/xe: Add SVM range invalidation and page fault handler
>>    drm/gpuvm: Add DRM_GPUVA_OP_DRIVER
>>    drm/xe: Add (re)bind to SVM page fault handler
>>    drm/xe: Add SVM garbage collector
>>    drm/xe: Add unbind to SVM garbage collector
>>    drm/xe: Do not allow CPU address mirror VMA unbind if the GPU has
>>      bindings
>>    drm/xe: Enable CPU address mirror uAPI
>>    drm/xe: Add migrate layer functions for SVM support
>>    drm/xe: Add SVM device memory mirroring
>>    drm/xe: Add drm_gpusvm_devmem to xe_bo
>>    drm/xe: Add GPUSVM device memory copy vfunc functions
>>    drm/xe: Add Xe SVM populate_devmem_pfn GPU SVM vfunc
>>    drm/xe: Add Xe SVM devmem_release GPU SVM vfunc
>>    drm/xe: Add BO flags required for SVM
>>    drm/xe: Add SVM VRAM migration
>>    drm/xe: Basic SVM BO eviction
>>    drm/xe: Add SVM debug
>>    drm/xe: Add modparam for SVM notifier size
>>    drm/xe: Add always_migrate_to_vram modparam
>>
>> Thomas Hellström (3):
>>    drm/pagemap: Add DRM pagemap
>>    drm/xe: Add dma_addr res cursor
>>    drm/xe: Add drm_pagemap ops to SVM
>>
>>   drivers/gpu/drm/Kconfig                     |    8 +
>>   drivers/gpu/drm/Makefile                    |    1 +
>>   drivers/gpu/drm/drm_gpusvm.c                | 2220 +++++++++++++++++++
>>   drivers/gpu/drm/xe/Kconfig                  |   10 +
>>   drivers/gpu/drm/xe/Makefile                 |    1 +
>>   drivers/gpu/drm/xe/xe_bo.c                  |   20 +-
>>   drivers/gpu/drm/xe/xe_bo.h                  |    1 +
>>   drivers/gpu/drm/xe/xe_bo_types.h            |    4 +
>>   drivers/gpu/drm/xe/xe_device_types.h        |   15 +
>>   drivers/gpu/drm/xe/xe_gt_pagefault.c        |   17 +-
>>   drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c |   24 +
>>   drivers/gpu/drm/xe/xe_gt_tlb_invalidation.h |    2 +
>>   drivers/gpu/drm/xe/xe_migrate.c             |  175 ++
>>   drivers/gpu/drm/xe/xe_migrate.h             |   10 +
>>   drivers/gpu/drm/xe/xe_module.c              |    7 +
>>   drivers/gpu/drm/xe/xe_module.h              |    2 +
>>   drivers/gpu/drm/xe/xe_pt.c                  |  393 +++-
>>   drivers/gpu/drm/xe/xe_pt.h                  |    5 +
>>   drivers/gpu/drm/xe/xe_pt_types.h            |    2 +
>>   drivers/gpu/drm/xe/xe_res_cursor.h          |  116 +-
>>   drivers/gpu/drm/xe/xe_svm.c                 |  948 ++++++++
>>   drivers/gpu/drm/xe/xe_svm.h                 |   83 +
>>   drivers/gpu/drm/xe/xe_tile.c                |    5 +
>>   drivers/gpu/drm/xe/xe_vm.c                  |  375 +++-
>>   drivers/gpu/drm/xe/xe_vm.h                  |   15 +-
>>   drivers/gpu/drm/xe/xe_vm_types.h            |   57 +
>>   include/drm/drm_gpusvm.h                    |  445 ++++
>>   include/drm/drm_gpuvm.h                     |    5 +
>>   include/drm/drm_pagemap.h                   |  103 +
>>   include/linux/migrate.h                     |    1 +
>>   include/uapi/drm/xe_drm.h                   |   19 +-
>>   mm/memory.c                                 |   13 +-
>>   mm/migrate_device.c                         |  116 +-
>>   33 files changed, 5061 insertions(+), 157 deletions(-)
>>   create mode 100644 drivers/gpu/drm/drm_gpusvm.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_svm.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_svm.h
>>   create mode 100644 include/drm/drm_gpusvm.h
>>   create mode 100644 include/drm/drm_pagemap.h
>>
> 



More information about the dri-devel mailing list