<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Hi</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Some suggestions in addition to ones from Rodrigo</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> Add an initial documentation page for user mode queues.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> </p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> Signed-off-by: Alex Deucher <alexander.deucher@amd.com></p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> ---</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> Documentation/gpu/amdgpu/index.rst | 1 +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> Documentation/gpu/amdgpu/userq.rst | 196 +++++++++++++++++++++++++++++</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> 2 files changed, 197 insertions(+)</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> create mode 100644 Documentation/gpu/amdgpu/userq.rst</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> </p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> diff --git a/Documentation/gpu/amdgpu/index.rst b/Documentation/gpu/amdgpu/index.rst</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> index bb2894b5edaf2..45523e9860fc5 100644</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> --- a/Documentation/gpu/amdgpu/index.rst</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +++ b/Documentation/gpu/amdgpu/index.rst</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> @@ -12,6 +12,7 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> module-parameters</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> gc/index</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> display/index</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> + userq</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> flashing</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> xgmi</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> ras</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> diff --git a/Documentation/gpu/amdgpu/userq.rst b/Documentation/gpu/amdgpu/userq.rst</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> new file mode 100644</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> index 0000000000000..53e6b053f652f</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> --- /dev/null</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +++ b/Documentation/gpu/amdgpu/userq.rst</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> @@ -0,0 +1,196 @@</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +==================</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> + User Mode Queues</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +==================</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Introduction</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +============</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Similar to the KFD, GPU engine queues move into userspace. The idea is to let</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +user processes manage their submissions to the GPU engines directly, bypassing</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +IOCTL calls to the driver to submit work. This reduces overhead and also allows</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +the GPU to submit work to itself. Applications can set up work graphs of jobs</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +across multiple GPU engines without needing trips through the CPU.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +UMDs directly interface with firmware via per application shared memory areas.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +The main vehicle for this is queue. A queue is a ring buffer with a read</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +pointer (rptr) and a write pointer (wptr). The UMD writes IP specific packets</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +into the queue and the firmware processes those packets, kicking off work on the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +GPU engines. The CPU in the application (or another queue or device) updates</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +the wptr to tell the firmware how far into the ring buffer to process packets</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +and the rtpr provides feedback to the UMD on how far the firmware has progressed</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +in executing those packets. When the wptr and the rptr are equal, the queue is</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +idle.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Theory of Operation</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +===================</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +The various engines on modern AMD GPUs support multiple queues per engine with a</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +scheduling firmware which handles dynamically scheduling user queues on the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +available hardware queue slots. When the number of user queues outnumbers the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +available hardware queue slots, the scheduling firmware dynamically maps and</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +unmaps queues based on priority and time quanta. The state of each user queue</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +is managed in the kernel driver in an MQD (Memory Queue Descriptor). This is a</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +buffer in GPU accessible memory that stores the state of a user queue. The</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +scheduling firmware uses the MQD to load the queue state into an HQD (Hardware</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Queue Descriptor) when a user queue is mapped. Each user queue requires a</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +number of additional buffers which represent the ring buffer and any metadata</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +needed by the engine for runtime operation. On most engines this consists of</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +the ring buffer itself, a rptr buffer (where the firmware will shadow the rptr</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +to userspace), a wrptr buffer (where the application will write the wptr for the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">wrptr -> wptr</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +firmware to fetch it), and a doorbell. A doorbell is a piece of the device's</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +MMIO BAR which can be mapped to specific user queues. Writing to the doorbell</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +wakes the firmware and causes it to fetch the wptr and start processing the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +packets in the queue. Each 4K page of the doorbell BAR supports specific offset</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +ranges for specific engines. The doorbell of a queue most be mapped into the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +aperture aligned to the IP used by the queue (e.g., GFX, VCN, SDMA, etc.).</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +These doorbell apertures are set up via NBIO registers. Doorbells are 32 bit or</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +64 bit (depending on the engine) chunks of the doorbell BAR. A 4K doorbell page</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +provides 512 64-bit doorbells for up to 512 user queues. A subset of each page</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +is reserved for each IP type supported on the device. The user can query the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +doorbell ranges for each IP via the INFO IOCTL.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +When an application wants to create a user queue, it allocates the the necessary</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">the the -> the</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +buffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +These can be separate buffers or all part of one larger buffer. The application</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +would map the buffer(s) into its GPUVM and use the GPU virtual addresses of for</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +the areas of memory they want t use for the user queue. They would also</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +allocate a doorbell page for the doorbells used by the user queues. The</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +application would then populate the MQD in the USERQ IOCTL structure with the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +GPU virtual addresses and doorbell index they want to use. The user can also</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +specify the attributes for the user queue (priority, whether the queue is secure</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +for protected content, etc.). The application would then call the USERQ</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +create IOCTL to create the queue from using the specified MQD. The</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">create IOCTL -> CREATE IOCTL</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +kernel driver then validates the MQD provided by the application and translates</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +the MQD into the engine specific MQD format for the IP. The IP specific MQD</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +would be allocated and the queue would be added to the run list maintained by</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +the scheduling firmware. Once the queue has been created, the application can</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +write packets directly into the queue, update the wptr, and write to the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +doorbell offset to kick off work in the user queue.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +When the application is done with the user queue, it would call the USERQ</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +FREE IOCTL to destroy it. The kernel driver would preempt the queue and</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +remove it from the scheduling firmware's run list. Then the IP specific MQD</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +would be freed and the user queue state would be cleaned up.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Some engines may require the aggregated doorbell to if the engine does not</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +support doorbells from unmapped queues. The aggregated doorbell is a special</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +page of doorbell space which wakes the scheduler. In cases where the engine may</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +be oversubscribed, some queues may not be mapped. If the doorbell is rung when</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +the queue is not mapped, the engine firmware may miss the request. Some</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +scheduling firmware may work around this my polling wptr shadows when the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +hardware is oversubscribed, other engines may support doorbell updates from</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +unmapped queues. In the event that one of these options is not available, the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +kernel driver will map a page of aggregated doorbell space into each GPUVM</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +space. The UMD will then update the doorbell and wptr as normal and then write</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +to the aggregated doorbell as well.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Special Packets</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +---------------</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +In order to support legacy implicit synchronization, as well as mixed user and</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +kernel queues, we need a synchronization mechanism that is secure. Because</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +kernel queues or memory management tasks depend on kernel fences, we need a way</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +for user queues to update memory that the kernel can use for a fence, that can't</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +be messed with by a bad actor. To support this, we've added protected fence</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +packet. This packet works by writing the a monotonically increasing value to</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">the a -> a</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +a memory location that is only the privileged clients have write access to.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">is only -> only</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +User queues only have read access. When this packet is executed, the memory</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +location is updated and other queues (kernel or user) can see the results.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Memory Management</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +=================</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +It is assumed that all buffers mapped into the GPUVM space for the process are</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +valid when engines on the GPU are running. The kernel driver will only allow</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +user queues to run when all buffers are mapped. If there is a memory event that</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +requires buffer migration, the kernel driver will preempt the user queues,</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +migrate buffers to where they need to be, update the GPUVM page tables and</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +invaldidate the TLB, and then resume the user queues.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Interaction with Kernel Queues</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +==============================</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Depending on the IP and the scheduling firmware, you can enable kernel queues</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +and user queues at the same time, However, you are limited by the HQD slots.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Kernel queues are always mapped so any work the goes into kernel queues will</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">the goes -> that goes</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +take priority. This limits the available HQD slots for user queues.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Not all IPs will support user queues on all GPUs. As such, UMDs will need to</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +support both user queues and kernel queues depending on the IP. For example, a</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +GPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG,</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +and VPE. UMDs need to support both. The kernel driver provides a way to</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +determine if user queues and kernel queues are supported on a per IP basis.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +UMDs can query this information via the INFO IOCTL and determine whether to use</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +kernel queues or user queues for each IP.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Queue Resets</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +============</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +For most engines, queues can be reset individually. GFX, compute, and SDMA</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +queues can be reset individually. When a hung queue is detected, it can be</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +reset either via the scheduling firmware or MMIO. Since there are no kernel</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +fences for most user queues, they will usually only be detected when some other</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +event happens; e.g., a memory event which requires migration of buffers. When</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +the queues are preempted, if the queue is hung, the preemption will fail.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Driver will them look up the queues that failed to preempt and reset them and</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">them -> then</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +record which queues are hung.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +On the UMD side, we will add an USERQ QUERY_STATUS IOCTL to query the queue</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">an USERQ -> a USERQ</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +status. UMD will provide the queue id in the IOCTL and the kernel driver</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +will check if it has already recorded the queue as hung (e.g., due to failed</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +peemption) and report back the status.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +IOCTL Interfaces</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +================</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +GPU virtual addresses used for queues and related data (rptrs, wptrs, context</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +save areas, etc.) should be validated by the kernel mode driver to prevent the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +user from specifying invalid GPU virtual addresses. If the user provides</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +invalid GPU virtual addresses or doorbell indicies, the IOCTL should return an</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +error message. These buffers should also be tracked in the kernel driver so</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +that if the user attempts to unmap the buffer(s) from the GPUVM, the umap call</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +would return an error.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +INFO</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +----</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +There are several new INFO queries related to user queues in order to query the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +size of user queue meta data needed for a user queue (e.g., context save areas</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +or shadow buffers), and whether kernel or user queues or both are supported</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +for each IP type.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +USERQ</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +-----</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +The USERQ IOCTL is used for creating, freeing, and querying the status of user</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +queues. It supports 3 opcodes:</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +1. CREATE - Create a user queue. The application provides a MQD-like structure</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> + that devices the type of queue and associated metadata and flags for that</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">devices -> describes/defines</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> + queue type. Returns the queue id.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +2. FREE - Free a user queue.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +3. QUERY_STATRUS - Query that status of a queue. Used to check if the queue is</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> + healthy or not. E.g., if the queue has been reset. (WIP)</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +USERQ_SIGNAL</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +------------</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +USERQ_WAIT</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +----------</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +The USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Kernel and User Queues</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +======================</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +In order to properly validate and test performance, we have a driver option to</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +select what type of queues are enabled (kernel queues, user queues or both).</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +The user_queue driver parameter allows you to enable kernel queues only (0),</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +user queues and kernel queues (1), and user queues only (2). Enabling user</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +queues only will free up static queue assignments that would otherwise be used</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +by kernel queues for use by the scheduling firmware. Some kernel queues are</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +required for kernel driver operation and they will always be created. When the</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +kernel queues are not enabled, they are not registered with the drm scheduler</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +and the CS IOCTL will reject any incoming command submissions which target those</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +queue types. Kernel queues only mirrors the behavior on all existing GPUs.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +Enabling both queues allows for backwards compatibility with old userspace while</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> +still supporting user queues.</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">> </p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Have a great time,</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Jure Repinc</p>
<br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">-- </p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;"> Jabber/XMPP: JLP@jabber.org</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;"> Matrix: @jlp:matrix.org</p>
<p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Mastodon/ActivityPub: @JRepin@mstdn.io</p>
<br /></body>
</html>