[PATCH 00/21] GPU debug support (eudebug)

Gwan-gyeong Mun gwan-gyeong.mun at intel.com
Mon Jul 29 08:27:29 UTC 2024



On 7/27/24 8:23 AM, Matthew Brost wrote:
> On Fri, Jul 26, 2024 at 05:07:57PM +0300, Mika Kuoppala wrote:
>> Hi,
>>
>> We (Intel eudebug kernel team) would like to submit this
>> patchset to enable debug support for Intel GPU devices.
>>
>> The aim is to allow Level-Zero + GDB (or some other tool)
>> to attach to xe driver in order to receive information
>> about relevant driver resources, hardware events and to allow debug
>> related hardware control. End goal is full debug capability
>> of supported Intel hardware, see [4].
>>
>> Debugger first opens a connection to a device through
>> drm ioctl with debug target process as a pid. This will
>> return a dedicated file descriptor used for debugging
>> for further events and control.
>>
>> Xe internal resources that are considered essential
>> to debugger functionality are relayed as events to the
>> debugger. On debugger connection, all existing resources
>> are relayed to debugger (discovery) and from that
>> point onwards, as they are created/destroyed.
>>
>> uapi is extended to allow an application/lib to provide
>> debug metadata information. These are relayed as events
>> to the debugger so it can decode the program state.
>>
>> Along with the resource and metadata events, an event for
>> hardware exceptions, called EU attention, is provided.
>> The debugger, with the assistance of an exception handling
>> program called System Routine (short: SIP) provided
>> with the pipeline setup, can determine which specific
>> EU/thread and instruction encountered the breakpoint
>> or other exceptions.
>>
>> EU controlling ioctl interface is also introduced where
>> debugger can manipulate individual threads of the currently
>> active workload. This interface enables the debugger to
>> interrupt all threads on demand, check their current state
>> and resume them individually.
>>
>> The intent is to provide a similar but not API compatible
>> functionality as in out-of-tree i915 debugger support:
>> https://dgpu-docs.intel.com/driver/gpu-debugging.html
>>
>> For xe the aim is to have all support merged in upstream,
>> starting with this series. With Lunarlake being first targetted
>> hardware.
>>
>> I have split the events into xe_drm_eudebug.h instead
>> pushing everything into xe_drm.h, in order to help
>> distinguish what is controlled by which descriptor.
>> If it's through the original xe fd, it is in xe_drm.h and
>> if it's through the opened debugger connection fd, it
>> is in xe_drm_eudebug.h.
>>
> 
> Looking through the series, I do have question wrt to GPU fault and
> eudebug. I don't see any interaction there. Without knowing eudebug
> works, it seems like setting a break point on a GPU access to virtual
> address is something a debugger would want. On a faulting device, this
> is something we should be able to support. This really comes into play
> once we have SVM as the UMD won't be issuing binds either. Curious about
> your thoughts here.
> 
> If this something that required, in particular with SVM, this something
> the SVM and eudebug teams need to collaborate on early to make sure both
> designs work with each other.
> 
Hi Matt,

Here's a quick scenario of what I understand to happen when a breakpoint 
is set on an EU thread.

The breakpoint behavior of eudebug is that if the breakpoint bit is on 
for an EU instruction, the EU thread will jump from AIP to SIP mode with 
a breakpoint exception before that instruction is executed.
The SIP shader can load/store the ARF (and GRF) registers of the EU 
thread, including the AIP of the EU thread, in memory that communicates 
with the debugger.
And when the SIP shader of the EU thread executes the sync.host 
instruction, it sets the bit of the corresponding thread in the TD_ATT 
(MMIOed) register and the eu thread stops.
The KMD periodically checks the TD_ATT register and notifies the debug UMD.
To resume the EU thread, simply unset the set bit of TD_ATT.
When debug UMD sends the eu thread resume event to the KMD, the KMD 
unsets the TD_ATT register.

It is my understanding that this EU thread breakpoint scenario does not 
conflict with the recoverable pagefault behavior scenario used by (HMM 
based)SVM, and that they can work together without any additional design 
changes.

Apart from this, the unrecoverable pagefault that occurs when an EU 
thread accesses an unallocated PPGTT virtual address requires additional 
implementation.

Mika, Joonas, Dominik, Jonathan, if I misunderstand/missed something or 
have any additional thoughts, could you please share your thought?

Br,

G.G.
> Matt
> 
>> Latest code can be found in:
>> [1] https://gitlab.freedesktop.org/miku/kernel/-/tree/eudebug-dev
>>
>> With the associated IGT tests:
>> [2] https://gitlab.freedesktop.org/cmanszew/igt-gpu-tools/-/tree/eudebug-dev
>>
>> The user for this uapi:
>> [3] https://github.com/intel/compute-runtime
>> Event loop and thread control interaction can be found at:
>> https://github.com/intel/compute-runtime/tree/master/level_zero/tools/source/debug/linux/xe
>> And the wrappers in:
>> https://github.com/intel/compute-runtime/tree/master/shared/source/os_interface/linux/xe
>> https://github.com/intel/compute-runtime/blob/master/shared/source/os_interface/linux/xe/ioctl_helper_xe_debugger.cpp
>> Note that the XE support is disabled by default and you will need
>> NEO_ENABLE_XE_EU_DEBUG_SUPPORT enabled in order to test.
>>
>> GDB support:
>> [4]: https://sourceware.org/pipermail/gdb-patches/2024-July/210264.html
>>
>> Thank you in advance for any comments and insight.
>>
>>
>> Andrzej Hajda (1):
>>    drm/xe/eudebug: implement userptr_vma access
>>
>> Christoph Manszewski (3):
>>    drm/xe/eudebug: Add vm bind and vm bind ops
>>    drm/xe/eudebug: Dynamically toggle debugger functionality
>>    drm/xe/eudebug_test: Introduce xe_eudebug wa kunit test
>>
>> Dominik Grzegorzek (10):
>>    drm/xe: Export xe_hw_engine's mmio accessors
>>    drm/xe: Move and export xe_hw_engine lookup.
>>    drm/xe/eudebug: Introduce exec_queue events
>>    drm/xe/eudebug: hw enablement for eudebug
>>    drm/xe: Add EUDEBUG_ENABLE exec queue property
>>    drm/xe/eudebug: Introduce per device attention scan worker
>>    drm/xe/eudebug: Introduce EU control interface
>>    drm/xe: Debug metadata create/destroy ioctls
>>    drm/xe: Attach debug metadata to vma
>>    drm/xe/eudebug: Add debug metadata support for xe_eudebug
>>
>> Jonathan Cavitt (1):
>>    drm/xe/eudebug: Use ptrace_may_access for xe_eudebug_attach
>>
>> Mika Kuoppala (6):
>>    drm/xe/eudebug: Introduce eudebug support
>>    kernel: export ptrace_may_access
>>    drm/xe/eudebug: Introduce discovery for resources
>>    drm/xe/eudebug: Add UFENCE events with acks
>>    drm/xe/eudebug: vm open/pread/pwrite
>>    drm/xe/eudebug: Implement vm_bind_op discovery
>>
>>   drivers/gpu/drm/xe/Makefile                  |    5 +-
>>   drivers/gpu/drm/xe/regs/xe_engine_regs.h     |    8 +
>>   drivers/gpu/drm/xe/regs/xe_gt_regs.h         |   43 +
>>   drivers/gpu/drm/xe/tests/xe_eudebug.c        |  170 +
>>   drivers/gpu/drm/xe/tests/xe_live_test_mod.c  |    2 +
>>   drivers/gpu/drm/xe/xe_debug_metadata.c       |  125 +
>>   drivers/gpu/drm/xe/xe_debug_metadata.h       |   25 +
>>   drivers/gpu/drm/xe/xe_debug_metadata_types.h |   28 +
>>   drivers/gpu/drm/xe/xe_device.c               |   47 +-
>>   drivers/gpu/drm/xe/xe_device_types.h         |   45 +
>>   drivers/gpu/drm/xe/xe_eudebug.c              | 3841 ++++++++++++++++++
>>   drivers/gpu/drm/xe/xe_eudebug.h              |   51 +
>>   drivers/gpu/drm/xe/xe_eudebug_types.h        |  326 ++
>>   drivers/gpu/drm/xe/xe_exec.c                 |    2 +-
>>   drivers/gpu/drm/xe/xe_exec_queue.c           |   80 +-
>>   drivers/gpu/drm/xe/xe_exec_queue_types.h     |    7 +
>>   drivers/gpu/drm/xe/xe_gt_debug.c             |  152 +
>>   drivers/gpu/drm/xe/xe_gt_debug.h             |   27 +
>>   drivers/gpu/drm/xe/xe_hw_engine.c            |   39 +-
>>   drivers/gpu/drm/xe/xe_hw_engine.h            |   11 +
>>   drivers/gpu/drm/xe/xe_lrc.c                  |   16 +-
>>   drivers/gpu/drm/xe/xe_lrc.h                  |    4 +-
>>   drivers/gpu/drm/xe/xe_reg_sr.c               |   21 +-
>>   drivers/gpu/drm/xe/xe_reg_sr.h               |    4 +-
>>   drivers/gpu/drm/xe/xe_rtp.c                  |    2 +-
>>   drivers/gpu/drm/xe/xe_rtp_types.h            |    1 +
>>   drivers/gpu/drm/xe/xe_sync.c                 |   49 +-
>>   drivers/gpu/drm/xe/xe_sync.h                 |    8 +-
>>   drivers/gpu/drm/xe/xe_sync_types.h           |   26 +-
>>   drivers/gpu/drm/xe/xe_vm.c                   |  227 +-
>>   drivers/gpu/drm/xe/xe_vm_types.h             |   26 +
>>   include/uapi/drm/xe_drm.h                    |   96 +-
>>   include/uapi/drm/xe_drm_eudebug.h            |  226 ++
>>   kernel/ptrace.c                              |    1 +
>>   34 files changed, 5655 insertions(+), 86 deletions(-)
>>   create mode 100644 drivers/gpu/drm/xe/tests/xe_eudebug.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_debug_metadata.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_debug_metadata.h
>>   create mode 100644 drivers/gpu/drm/xe/xe_debug_metadata_types.h
>>   create mode 100644 drivers/gpu/drm/xe/xe_eudebug.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_eudebug.h
>>   create mode 100644 drivers/gpu/drm/xe/xe_eudebug_types.h
>>   create mode 100644 drivers/gpu/drm/xe/xe_gt_debug.c
>>   create mode 100644 drivers/gpu/drm/xe/xe_gt_debug.h
>>   create mode 100644 include/uapi/drm/xe_drm_eudebug.h
>>
>> -- 
>> 2.34.1
>>


More information about the Intel-xe mailing list