[PATCH] Documentation: add initial documenation for user queues

Thu Jul 3 21:06:05 UTC 2025

On 06/30, Alex Deucher wrote:
> On Tue, May 27, 2025 at 4:46 PM Rodrigo Siqueira <siqueira at igalia.com> wrote:
> >
> > Hi Alex,
> >
> > Follow some comments and questions.
> >
> > On 05/02, Alex Deucher wrote:
> > > Add an initial documentation page for user mode queues.
> > >
> > > Signed-off-by: Alex Deucher <alexander.deucher at amd.com>
> > > ---
> > >  Documentation/gpu/amdgpu/index.rst |   1 +
> > >  Documentation/gpu/amdgpu/userq.rst | 196 +++++++++++++++++++++++++++++
> > >  2 files changed, 197 insertions(+)
> > >  create mode 100644 Documentation/gpu/amdgpu/userq.rst
> > >
> > > diff --git a/Documentation/gpu/amdgpu/index.rst b/Documentation/gpu/amdgpu/index.rst
> > > index bb2894b5edaf2..45523e9860fc5 100644
> > > --- a/Documentation/gpu/amdgpu/index.rst
> > > +++ b/Documentation/gpu/amdgpu/index.rst
> > > @@ -12,6 +12,7 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
> > >     module-parameters
> > >     gc/index
> > >     display/index
> > > +   userq
> > >     flashing
> > >     xgmi
> > >     ras
> > > diff --git a/Documentation/gpu/amdgpu/userq.rst b/Documentation/gpu/amdgpu/userq.rst
> > > new file mode 100644
> > > index 0000000000000..53e6b053f652f
> > > --- /dev/null
> > > +++ b/Documentation/gpu/amdgpu/userq.rst
> > > @@ -0,0 +1,196 @@
> > > +==================
> > > + User Mode Queues
> > > +==================
> > > +
> > > +Introduction
> > > +============
> > > +
> > > +Similar to the KFD, GPU engine queues move into userspace.  The idea is to let
> > > +user processes manage their submissions to the GPU engines directly, bypassing
> > > +IOCTL calls to the driver to submit work.  This reduces overhead and also allows
> > > +the GPU to submit work to itself.  Applications can set up work graphs of jobs
> > > +across multiple GPU engines without needing trips through the CPU.
> > > +
> > > +UMDs directly interface with firmware via per application shared memory areas.
> > > +The main vehicle for this is queue.  A queue is a ring buffer with a read
> > > +pointer (rptr) and a write pointer (wptr).  The UMD writes IP specific packets
> > > +into the queue and the firmware processes those packets, kicking off work on the
> > > +GPU engines.  The CPU in the application (or another queue or device) updates
> > > +the wptr to tell the firmware how far into the ring buffer to process packets
> > > +and the rtpr provides feedback to the UMD on how far the firmware has progressed
> > > +in executing those packets.  When the wptr and the rptr are equal, the queue is
> > > +idle.
> > > +
> > > +Theory of Operation
> > > +===================
> > > +
> > > +The various engines on modern AMD GPUs support multiple queues per engine with a
> > > +scheduling firmware which handles dynamically scheduling user queues on the
> > > +available hardware queue slots.  When the number of user queues outnumbers the
> > > +available hardware queue slots, the scheduling firmware dynamically maps and
> > > +unmaps queues based on priority and time quanta.  The state of each user queue
> > > +is managed in the kernel driver in an MQD (Memory Queue Descriptor).  This is a
> > > +buffer in GPU accessible memory that stores the state of a user queue.  The
> > > +scheduling firmware uses the MQD to load the queue state into an HQD (Hardware
> > > +Queue Descriptor) when a user queue is mapped.  Each user queue requires a
> > > +number of additional buffers which represent the ring buffer and any metadata
> > > +needed by the engine for runtime operation.  On most engines this consists of
> > > +the ring buffer itself, a rptr buffer (where the firmware will shadow the rptr
> > > +to userspace), a wrptr buffer (where the application will write the wptr for the
> > > +firmware to fetch it), and a doorbell.  A doorbell is a piece of the device's
> >
> > In this part, you started to explain about the doorbell; consider adding
> > a new paragraph here.
> 
> Added some additional info here.
> 
> >
> > Another idea could be to create a dedicated page to explain doorbells
> > and move all the general doorbell information from this patch to the new
> > page. I think there is no kernel-doc about amdgpu doorbells.
> >
> > > +MMIO BAR which can be mapped to specific user queues.  Writing to the doorbell
> > > +wakes the firmware and causes it to fetch the wptr and start processing the
> > > +packets in the queue. Each 4K page of the doorbell BAR supports specific offset
> > > +ranges for specific engines.  The doorbell of a queue most be mapped into the
> >
> > /most/must/
> 
> Fixed.
> 
> >
> > > +aperture aligned to the IP used by the queue (e.g., GFX, VCN, SDMA, etc.).
> > > +These doorbell apertures are set up via NBIO registers.  Doorbells are 32 bit or
> > > +64 bit (depending on the engine) chunks of the doorbell BAR.  A 4K doorbell page
> > > +provides 512 64-bit doorbells for up to 512 user queues.  A subset of each page
> > > +is reserved for each IP type supported on the device.  The user can query the
> > > +doorbell ranges for each IP via the INFO IOCTL.
> >
> > The first time that I read this, I was confused about the IOCTL part;
> > however, at the end of this patch, I noticed that you explained the
> > IOCTL part. Perhaps add a mention in parenthesis so the reader can see
> > more details about this info in the "IOCTL Interfaces" section.
> 
> Updated.
> 
> >
> > > +
> > > +When an application wants to create a user queue, it allocates the the necessary
> > > +buffers for the queue (ring buffer, wptr and rptr, context save areas, etc.).
> > > +These can be separate buffers or all part of one larger buffer.  The application
> > > +would map the buffer(s) into its GPUVM and use the GPU virtual addresses of for
> > > +the areas of memory they want t use for the user queue.  They would also
> >
> > /t/to/
> 
> Fixed.
> 
> >
> > > +allocate a doorbell page for the doorbells used by the user queues.  The
> > > +application would then populate the MQD in the USERQ IOCTL structure with the
> > > +GPU virtual addresses and doorbell index they want to use.  The user can also
> > > +specify the attributes for the user queue (priority, whether the queue is secure
> > > +for protected content, etc.).  The application would then call the USERQ
> > > +create IOCTL to create the queue from using the specified MQD.  The
> > > +kernel driver then validates the MQD provided by the application and translates
> > > +the MQD into the engine specific MQD format for the IP.  The IP specific MQD
> > > +would be allocated and the queue would be added to the run list maintained by
> > > +the scheduling firmware.  Once the queue has been created, the application can
> > > +write packets directly into the queue, update the wptr, and write to the
> > > +doorbell offset to kick off work in the user queue.
> > > +
> > > +When the application is done with the user queue, it would call the USERQ
> > > +FREE IOCTL to destroy it.  The kernel driver would preempt the queue and
> > > +remove it from the scheduling firmware's run list.  Then the IP specific MQD
> > > +would be freed and the user queue state would be cleaned up.
> >
> > Is it possible to add some pseudo-code that summarizes the programming
> > model described here?
> 
> I'm not sure I understand what you are asking for here.

Hi Alex,

You can ignore my question.

> 
> >
> > > +
> > > +Some engines may require the aggregated doorbell to if the engine does not
> >
> > /to/too/ or /to//?
> 
> Fixed.
> 
> >
> > Do you know which engines requires the aggreted doorbell? Can this
> > information be retrieved via IOCTL?  I think this information can be
> > helpful for userspace implementation.
> 
> No IPs which currently support user queues require the aggregated
> doorbell.  VCN likely will be the first IP
> that needs it.
> 
> >
> > > +support doorbells from unmapped queues.  The aggregated doorbell is a special
> > > +page of doorbell space which wakes the scheduler.  In cases where the engine may
> > > +be oversubscribed, some queues may not be mapped.  If the doorbell is rung when
> > > +the queue is not mapped, the engine firmware may miss the request.  Some
> > > +scheduling firmware may work around this my polling wptr shadows when the

/my/by/ ?

> > > +hardware is oversubscribed, other engines may support doorbell updates from
> > > +unmapped queues.  In the event that one of these options is not available, the
> > > +kernel driver will map a page of aggregated doorbell space into each GPUVM
> > > +space.  The UMD will then update the doorbell and wptr as normal and then write
> > > +to the aggregated doorbell as well.
> > > +
> > > +Special Packets
> > > +---------------
> > > +
> > > +In order to support legacy implicit synchronization, as well as mixed user and
> > > +kernel queues, we need a synchronization mechanism that is secure.  Because
> > > +kernel queues or memory management tasks depend on kernel fences, we need a way
> > > +for user queues to update memory that the kernel can use for a fence, that can't
> > > +be messed with by a bad actor.  To support this, we've added protected fence
> > > +packet.  This packet works by writing the a monotonically increasing value to
> > > +a memory location that is only the privileged clients have write access to.
> > > +User queues only have read access.  When this packet is executed, the memory
> > > +location is updated and other queues (kernel or user) can see the results.
> >
> > Does the driver handle this packet? I mean, does the driver insert it
> > without the userspace request? What is the packet name? How can I find
> > it in the kernel?
> 
> The actual packet format varies from IP to IP (GFX/Compute, SDMA, VCN,
> etc.), but the behavior is the same.  The packet submission is handled
> in userspace.  The kernel driver just sets up the privileged memory
> used for each user queue when it sets them up when the application
> creates them.

Could you include this additional information in the new version?

Thanks

> 
> >
> > > +
> > > +Memory Management
> > > +=================
> > > +
> > > +It is assumed that all buffers mapped into the GPUVM space for the process are
> > > +valid when engines on the GPU are running.  The kernel driver will only allow
> > > +user queues to run when all buffers are mapped.  If there is a memory event that
> > > +requires buffer migration, the kernel driver will preempt the user queues,
> > > +migrate buffers to where they need to be, update the GPUVM page tables and
> > > +invaldidate the TLB, and then resume the user queues.
> > > +
> > > +Interaction with Kernel Queues
> > > +==============================
> > > +
> > > +Depending on the IP and the scheduling firmware, you can enable kernel queues
> > > +and user queues at the same time,  However, you are limited by the HQD slots.
> >
> > /However/however/
> 
> Fixed.
> 
> >
> > > +Kernel queues are always mapped so any work the goes into kernel queues will
> > > +take priority.  This limits the available HQD slots for user queues.
> > > +
> > > +Not all IPs will support user queues on all GPUs.  As such, UMDs will need to
> > > +support both user queues and kernel queues depending on the IP.  For example, a
> > > +GPU may support user queues for GFX, compute, and SDMA, but not for VCN, JPEG,
> > > +and VPE.  UMDs need to support both.  The kernel driver provides a way to
> > > +determine if user queues and kernel queues are supported on a per IP basis.
> > > +UMDs can query this information via the INFO IOCTL and determine whether to use
> > > +kernel queues or user queues for each IP.
> > > +
> > > +Queue Resets
> > > +============
> > > +
> > > +For most engines, queues can be reset individually.  GFX, compute, and SDMA
> > > +queues can be reset individually.  When a hung queue is detected, it can be
> > > +reset either via the scheduling firmware or MMIO.  Since there are no kernel
> > > +fences for most user queues, they will usually only be detected when some other
> > > +event happens; e.g., a memory event which requires migration of buffers.  When
> > > +the queues are preempted, if the queue is hung, the preemption will fail.
> > > +Driver will them look up the queues that failed to preempt and reset them and
> > > +record which queues are hung.
> > > +
> > > +
> > > +On the UMD side, we will add an USERQ QUERY_STATUS IOCTL to query the queue
> > > +status.  UMD will provide the queue id in the IOCTL and the kernel driver
> > > +will check if it has already recorded the queue as hung (e.g., due to failed
> > > +peemption) and report back the status.
> > > +
> > > +IOCTL Interfaces
> > > +================
> > > +
> > > +GPU virtual addresses used for queues and related data (rptrs, wptrs, context
> > > +save areas, etc.) should be validated by the kernel mode driver to prevent the
> > > +user from specifying invalid GPU virtual addresses.  If the user provides
> > > +invalid GPU virtual addresses or doorbell indicies, the IOCTL should return an
> > > +error message.  These buffers should also be tracked in the kernel driver so
> > > +that if the user attempts to unmap the buffer(s) from the GPUVM, the umap call
> > > +would return an error.
> > > +
> > > +INFO
> > > +----
> > > +There are several new INFO queries related to user queues in order to query the
> > > +size of user queue meta data needed for a user queue (e.g., context save areas
> > > +or shadow buffers), and whether kernel or user queues or both are supported
> > > +for each IP type.
> > > +
> > > +USERQ
> > > +-----
> > > +The USERQ IOCTL is used for creating, freeing, and querying the status of user
> > > +queues.  It supports 3 opcodes:
> > > +
> > > +1. CREATE - Create a user queue.  The application provides a MQD-like structure
> > > +   that devices the type of queue and associated metadata and flags for that
> > > +   queue type.  Returns the queue id.
> > > +2. FREE - Free a user queue.
> > > +3. QUERY_STATRUS - Query that status of a queue.  Used to check if the queue is
> >
> > /QUERY_STATRUS/QUERY_STATUS/?
> 
> Fixed.
> 
> Thanks,
> 
> Alex
> 
> >
> > Thanks
> > Siqueira
> >
> > > +   healthy or not.  E.g., if the queue has been reset. (WIP)
> > > +
> > > +USERQ_SIGNAL
> > > +------------
> > > +The USERQ_SIGNAL IOCTL is used to provide a list of sync objects to be signaled.
> > > +
> > > +USERQ_WAIT
> > > +----------
> > > +The USERQ_WAIT IOCTL is used to provide a list of sync object to be waited on.
> > > +
> > > +Kernel and User Queues
> > > +======================
> > > +
> > > +In order to properly validate and test performance, we have a driver option to
> > > +select what type of queues are enabled (kernel queues, user queues or both).
> > > +The user_queue driver parameter allows you to enable kernel queues only (0),
> > > +user queues and kernel queues (1), and user queues only (2).  Enabling user
> > > +queues only will free up static queue assignments that would otherwise be used
> > > +by kernel queues for use by the scheduling firmware.  Some kernel queues are
> > > +required for kernel driver operation and they will always be created.  When the
> > > +kernel queues are not enabled, they are not registered with the drm scheduler
> > > +and the CS IOCTL will reject any incoming command submissions which target those
> > > +queue types.  Kernel queues only mirrors the behavior on all existing GPUs.
> > > +Enabling both queues allows for backwards compatibility with old userspace while
> > > +still supporting user queues.
> > > --
> > > 2.49.0
> > >
> >
> > --
> > Rodrigo Siqueira

-- 
Rodrigo Siqueira