[PATCH] Documentation: add initial documenation for user queues

Mon Jun 30 19:11:19 UTC 2025

On Tue, May 27, 2025 at 4:46 PM Rodrigo Siqueira <siqueira at igalia.com> wrote:
>
> Hi Alex,
>
> Follow some comments and questions.
>
> On 05/02, Alex Deucher wrote:
> > Add an initial documentation page for user mode queues.
> >
> > Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> > ---
> >  Documentation/gpu/amdgpu/index.rst |   1 +
> >  Documentation/gpu/amdgpu/userq.rst | 196 +++++++++++++++++++++++++++++
> >  2 files changed, 197 insertions(+)
> >  create mode 100644 Documentation/gpu/amdgpu/userq.rst
> >
> > diff --git a/Documentation/gpu/amdgpu/index.rst b/Documentation/gpu/amdgpu/index.rst
> > index bb2894b5edaf2..45523e9860fc5 100644
> > --- a/Documentation/gpu/amdgpu/index.rst
> > +++ b/Documentation/gpu/amdgpu/index.rst
> > @@ -12,6 +12,7 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
> >     module-parameters
> >     gc/index
> >     display/index
> > +   userq
> >     flashing
> >     xgmi
> >     ras
> > diff --git a/Documentation/gpu/amdgpu/userq.rst b/Documentation/gpu/amdgpu/userq.rst
> > new file mode 100644
> > index 0000000000000..53e6b053f652f
> > --- /dev/null
> > +++ b/Documentation/gpu/amdgpu/userq.rst
> > @@ -0,0 +1,196 @@
> > +==================
> > + User Mode Queues
> > +==================
> > +
> > +Introduction
> > +============
> > +
> > +Similar to the KFD, GPU engine queues move into userspace.  The idea is to let
> > +user processes manage their submissions to the GPU engines directly, bypassing
> > +IOCTL calls to the driver to submit work.  This reduces overhead and also allows
> > +the GPU to submit work to itself.  Applications can set up work graphs of jobs
> > +across multiple GPU engines without needing trips through the CPU.
> > +
> > +UMDs directly interface with firmware via per application shared memory areas.
> > +The main vehicle for this is queue.  A queue is a ring buffer with a read
> > +pointer (rptr) and a write pointer (wptr).  The UMD writes IP specific packets
> > +into the queue and the firmware processes those packets, kicking off work on the
> > +GPU engines.  The CPU in the application (or another queue or device) updates
> > +the wptr to tell the firmware how far into the ring buffer to process packets
> > +and the rtpr provides feedback to the UMD on how far the firmware has progressed
> > +in executing those packets.  When the wptr and the rptr are equal, the queue is
> > +idle.
> > +
> > +Theory of Operation
> > +===================
> > +
> > +The various engines on modern AMD GPUs support multiple queues per engine with a
> > +scheduling firmware which handles dynamically scheduling user queues on the
> > +available hardware queue slots.  When the number of user queues outnumbers the
> > +available hardware queue slots, the scheduling firmware dynamically maps and
> > +unmaps queues based on priority and time quanta.  The state of each user queue
> > +is managed in the kernel driver in an MQD (Memory Queue Descriptor).  This is a
> > +buffer in GPU accessible memory that stores the state of a user queue.  The
> > +scheduling firmware uses the MQD to load the queue state into an HQD (Hardware
> > +Queue Descriptor) when a user queue is mapped.  Each user queue requires a
> > +number of additional buffers which represent the ring buffer and any metadata
> > +needed by the engine for runtime operation.  On most engines this consists of
> > +the ring buffer itself, a rptr buffer (where the firmware will shadow the rptr
> > +to userspace), a wrptr buffer (where the application will write the wptr for the
> > +firmware to fetch it), and a doorbell.  A doorbell is a piece of the device's
>
> In this part, you started to explain about the doorbell; consider adding
> a new paragraph here.

Added some additional info here.

>
> Another idea could be to create a dedicated page to explain doorbells
> and move all the general doorbell information from this patch to the new
> page. I think there is no kernel-doc about amdgpu doorbells.
>
> > +MMIO BAR which can be mapped to specific user queues.  Writing to the doorbell
> > +wakes the firmware and causes it to fetch the wptr and start processing the
> > +packets in the queue. Each 4K page of the doorbell BAR supports specific offset
> > +ranges for specific engines.  The doorbell of a queue most be mapped into the
>
> /most/must/

Fixed.

>
> > +aperture aligned to the IP used by the queue (e.g., GFX, VCN, SDMA, etc.).
> > +These doorbell apertures are set up via NBIO registers.  Doorbells are 32 bit or
> > +64 bit (depending on the engine) chunks of the doorbell BAR.  A 4K doorbell page
> > +provides 512 64-bit doorbells for up to 512 user queues.  A subset of each page
> > +is reserved for each IP type supported on the device.  The user can query the
> > +doorbell ranges for each IP via the INFO IOCTL.
>
> The first time that I read this, I was confused about the IOCTL part;
> however, at the end of this patch, I noticed that you explained the
> IOCTL part. Perhaps add a mention in parenthesis so the reader can see
> more details about this info in the "IOCTL Interfaces" section.

Updated.

>
> > +
> > +When an application wants to create a user queue, it allocates the the necessary
> > +buffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).
> > +These can be separate buffers or all part of one larger buffer.  The application
> > +would map the buffer(s) into its GPUVM and use the GPU virtual addresses of for
> > +the areas of memory they want t use for the user queue.  They would also
>
> /t/to/

Fixed.

>
> > +allocate a doorbell page for the doorbells used by the user queues.  The
> > +application would then populate the MQD in the USERQ IOCTL structure with the
> > +GPU virtual addresses and doorbell index they want to use.  The user can also
> > +specify the attributes for the user queue (priority, whether the queue is secure
> > +for protected content, etc.).  The application would then call the USERQ
> > +create IOCTL to create the queue from using the specified MQD.  The
> > +kernel driver then validates the MQD provided by the application and translates
> > +the MQD into the engine specific MQD format for the IP.  The IP specific MQD
> > +would be allocated and the queue would be added to the run list maintained by
> > +the scheduling firmware.  Once the queue has been created, the application can
> > +write packets directly into the queue, update the wptr, and write to the
> > +doorbell offset to kick off work in the user queue.
> > +
> > +When the application is done with the user queue, it would call the USERQ
> > +FREE IOCTL to destroy it.  The kernel driver would preempt the queue and
> > +remove it from the scheduling firmware's run list.  Then the IP specific MQD
> > +would be freed and the user queue state would be cleaned up.
>
> Is it possible to add some pseudo-code that summarizes the programming
> model described here?

I'm not sure I understand what you are asking for here.

>
> > +
> > +Some engines may require the aggregated doorbell to if the engine does not
>
> /to/too/ or /to//?

Fixed.

>
> Do you know which engines requires the aggreted doorbell? Can this
> information be retrieved via IOCTL?  I think this information can be
> helpful for userspace implementation.

No IPs which currently support user queues require the aggregated
doorbell.  VCN likely will be the first IP
that needs it.

>
> > +support doorbells from unmapped queues.  The aggregated doorbell is a special
> > +page of doorbell space which wakes the scheduler.  In cases where the engine may
> > +be oversubscribed, some queues may not be mapped.  If the doorbell is rung when
> > +the queue is not mapped, the engine firmware may miss the request.  Some
> > +scheduling firmware may work around this my polling wptr shadows when the
> > +hardware is oversubscribed, other engines may support doorbell updates from
> > +unmapped queues.  In the event that one of these options is not available, the
> > +kernel driver will map a page of aggregated doorbell space into each GPUVM
> > +space.  The UMD will then update the doorbell and wptr as normal and then write
> > +to the aggregated doorbell as well.
> > +
> > +Special Packets
> > +---------------
> > +
> > +In order to support legacy implicit synchronization, as well as mixed user and
> > +kernel queues, we need a synchronization mechanism that is secure.  Because
> > +kernel queues or memory management tasks depend on kernel fences, we need a way
> > +for user queues to update memory that the kernel can use for a fence, that can't
> > +be messed with by a bad actor.  To support this, we've added protected fence
> > +packet.  This packet works by writing the a monotonically increasing value to
> > +a memory location that is only the privileged clients have write access to.
> > +User queues only have read access.  When this packet is executed, the memory
> > +location is updated and other queues (kernel or user) can see the results.
>
> Does the driver handle this packet? I mean, does the driver insert it
> without the userspace request? What is the packet name? How can I find
> it in the kernel?

The actual packet format varies from IP to IP (GFX/Compute, SDMA, VCN,
etc.), but the behavior is the same.  The packet submission is handled
in userspace.  The kernel driver just sets up the privileged memory
used for each user queue when it sets them up when the application
creates them.

>
> > +
> > +Memory Management
> > +=================
> > +
> > +It is assumed that all buffers mapped into the GPUVM space for the process are
> > +valid when engines on the GPU are running.  The kernel driver will only allow
> > +user queues to run when all buffers are mapped.  If there is a memory event that
> > +requires buffer migration, the kernel driver will preempt the user queues,
> > +migrate buffers to where they need to be, update the GPUVM page tables and
> > +invaldidate the TLB, and then resume the user queues.
> > +
> > +Interaction with Kernel Queues
> > +==============================
> > +
> > +Depending on the IP and the scheduling firmware, you can enable kernel queues
> > +and user queues at the same time,  However, you are limited by the HQD slots.
>
> /However/however/

Fixed.

>
> > +Kernel queues are always mapped so any work the goes into kernel queues will
> > +take priority.  This limits the available HQD slots for user queues.
> > +
> > +Not all IPs will support user queues on all GPUs.  As such, UMDs will need to
> > +support both user queues and kernel queues depending on the IP.  For example, a
> > +GPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG,
> > +and VPE.  UMDs need to support both.  The kernel driver provides a way to
> > +determine if user queues and kernel queues are supported on a per IP basis.
> > +UMDs can query this information via the INFO IOCTL and determine whether to use
> > +kernel queues or user queues for each IP.
> > +
> > +Queue Resets
> > +============
> > +
> > +For most engines, queues can be reset individually.  GFX, compute, and SDMA
> > +queues can be reset individually.  When a hung queue is detected, it can be
> > +reset either via the scheduling firmware or MMIO.  Since there are no kernel
> > +fences for most user queues, they will usually only be detected when some other
> > +event happens; e.g., a memory event which requires migration of buffers.  When
> > +the queues are preempted, if the queue is hung, the preemption will fail.
> > +Driver will them look up the queues that failed to preempt and reset them and
> > +record which queues are hung.
> > +
> > +
> > +On the UMD side, we will add an USERQ QUERY_STATUS IOCTL to query the queue
> > +status.  UMD will provide the queue id in the IOCTL and the kernel driver
> > +will check if it has already recorded the queue as hung (e.g., due to failed
> > +peemption) and report back the status.
> > +
> > +IOCTL Interfaces
> > +================
> > +
> > +GPU virtual addresses used for queues and related data (rptrs, wptrs, context
> > +save areas, etc.) should be validated by the kernel mode driver to prevent the
> > +user from specifying invalid GPU virtual addresses.  If the user provides
> > +invalid GPU virtual addresses or doorbell indicies, the IOCTL should return an
> > +error message.  These buffers should also be tracked in the kernel driver so
> > +that if the user attempts to unmap the buffer(s) from the GPUVM, the umap call
> > +would return an error.
> > +
> > +INFO
> > +----
> > +There are several new INFO queries related to user queues in order to query the
> > +size of user queue meta data needed for a user queue (e.g., context save areas
> > +or shadow buffers), and whether kernel or user queues or both are supported
> > +for each IP type.
> > +
> > +USERQ
> > +-----
> > +The USERQ IOCTL is used for creating, freeing, and querying the status of user
> > +queues.  It supports 3 opcodes:
> > +
> > +1. CREATE - Create a user queue.  The application provides a MQD-like structure
> > +   that devices the type of queue and associated metadata and flags for that
> > +   queue type.  Returns the queue id.
> > +2. FREE - Free a user queue.
> > +3. QUERY_STATRUS - Query that status of a queue.  Used to check if the queue is
>
> /QUERY_STATRUS/QUERY_STATUS/?

Fixed.

Thanks,

Alex

>
> Thanks
> Siqueira
>
> > +   healthy or not.  E.g., if the queue has been reset. (WIP)
> > +
> > +USERQ_SIGNAL
> > +------------
> > +The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled.
> > +
> > +USERQ_WAIT
> > +----------
> > +The USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on.
> > +
> > +Kernel and User Queues
> > +======================
> > +
> > +In order to properly validate and test performance, we have a driver option to
> > +select what type of queues are enabled (kernel queues, user queues or both).
> > +The user_queue driver parameter allows you to enable kernel queues only (0),
> > +user queues and kernel queues (1), and user queues only (2).  Enabling user
> > +queues only will free up static queue assignments that would otherwise be used
> > +by kernel queues for use by the scheduling firmware.  Some kernel queues are
> > +required for kernel driver operation and they will always be created.  When the
> > +kernel queues are not enabled, they are not registered with the drm scheduler
> > +and the CS IOCTL will reject any incoming command submissions which target those
> > +queue types.  Kernel queues only mirrors the behavior on all existing GPUs.
> > +Enabling both queues allows for backwards compatibility with old userspace while
> > +still supporting user queues.
> > --
> > 2.49.0
> >
>
> --
> Rodrigo Siqueira