[RFC] Mechanism for high priority scheduling in amdgpu

Pierre-Loup A. Griffais pgriffais at valvesoftware.com
Sat Dec 17 22:05:58 UTC 2016


Hi Serguei,

I'm also working on the bringing up our VR runtime on top of amgpu; see 
replies inline.

On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
> Andres,
>
>>  For current VR workloads we have 3 separate processes running actually:
> So we could have potential memory overcommit case or do you do partitioning
> on your own?  I would think that there is need to avoid overcomit in VR case to
> prevent any BO migration.

You're entirely correct; currently the VR runtime is setting up 
prioritized CPU scheduling for its VR compositor, we're working on 
prioritized GPU scheduling and pre-emption (eg. this thread), and in the 
future it will make sense to do work in order to make sure that its 
memory allocations do not get evicted, to prevent any unwelcome 
additional latency in the event of needing to perform just-in-time 
reprojection.

> BTW: Do you mean __real__ processes or threads?
> Based on my understanding sharing BOs between different processes
> could introduce additional synchronization constrains.  btw: I am not sure
> if we are able to share Vulkan sync. object cross-process boundary.

They are different processes; it is important for the compositor that is 
responsible for quality-of-service features such as consistently 
presenting distorted frames with the right latency, reprojection, etc, 
to be separate from the main application.

Currently we are using unreleased cross-process memory and semaphore 
extensions to fetch updated eye images from the client application, but 
the just-in-time reprojection discussed here does not actually have any 
direct interactions with cross-process resource sharing, since it's 
achieved by using whatever is the latest, most up-to-date eye images 
that have already been sent by the client application, which are already 
available to use without additional synchronization.

>
>>    3) System compositor (we are looking at approaches to remove this overhead)
> Yes,  IMHO the best is to run in  "full screen mode".

Yes, we are working on mechanisms to present directly to the headset 
display without any intermediaries as a separate effort.

>
>>  The latency is our main concern,
> I would assume that this is the known problem (at least for compute usage).
> It looks like that amdgpu / kernel submission is rather CPU intensive (at least
> in the default configuration).

As long as it's a consistent cost, it shouldn't an issue. However, if 
there's high degrees of variance then that would be troublesome and we 
would need to account for the worst case.

Hopefully the requirements and approach we described make sense, we're 
looking forward to your feedback and suggestions.

Thanks!
  - Pierre-Loup

>
> Sincerely yours,
> Serguei Sagalovitch
>
>
> From: Andres Rodriguez <andresr at valvesoftware.com>
> Sent: December 16, 2016 10:00 PM
> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Hey Serguei,
>
>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I understand (by simplifying)
>> some scheduling is per pipe.  I know about the current allocation scheme but I do not think
>> that it is  ideal.  I would assume that we need  to switch to dynamical partition
>> of resources  based on the workload otherwise we will have resource conflict
>> between Vulkan compute and  OpenCL.
>
> I agree the partitioning isn't ideal. I'm hoping we can start with a solution that assumes that
> only pipe0 has any work and the other pipes are idle (no HSA/ROCm running on the system).
>
> This should be more or less the use case we expect from VR users.
>
> I agree the split is currently not ideal, but I'd like to consider that a separate task, because
> making it dynamic is not straight forward :P
>
>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will be not
>> involved.  I would assume that in the case of VR we will have one main
>> application ("console" mode(?)) so we could temporally "ignore"
>> OpenCL/ROCm needs when VR is running.
>
> Correct, this is why we want to enable the high priority compute queue through
> libdrm-amdgpu, so that we can expose it through Vulkan later.
>
> For current VR workloads we have 3 separate processes running actually:
>     1) Game process
>     2) VR Compositor (this is the process that will require high priority queue)
>     3) System compositor (we are looking at approaches to remove this overhead)
>
> For now I think it is okay to assume no OpenCL/ROCm running simultaneously, but
> I would also like to be able to address this case in the future (cross-pipe priorities).
>
>> [Serguei]  The problem with pre-emption of graphics task:  (a) it may take time so
>> latency may suffer
>
> The latency is our main concern, we want something that is predictable. A good
> illustration of what the reprojection scheduling looks like can be found here:
> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>
>> (b) to preempt we need to have different "context" - we want
>> to guarantee that submissions from the same context will be executed in order.
>
> This is okay, as the reprojection work doesn't have dependencies on the game context, and it
> even happens in a separate process.
>
>> BTW: (a) Do you want  "preempt" and later resume or do you want "preempt" and
>> "cancel/abort"
>
> Preempt the game with the compositor task and then resume it.
>
>> (b) Vulkan is generic API and could be used for graphics as well as
>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>
> Yeah, the plan is to use vulkan compute. But if you figure out a way for us to get
> a guaranteed execution time using vulkan graphics, then I'll take you out for a beer :)
>
> Regards,
> Andres
> ________________________________________
> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
> Sent: Friday, December 16, 2016 9:13 PM
> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Hi Andres,
>
> Please see inline (as [Serguei])
>
> Sincerely yours,
> Serguei Sagalovitch
>
>
> From: Andres Rodriguez <andresr at valvesoftware.com>
> Sent: December 16, 2016 8:29 PM
> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Hi Serguei,
>
> Thanks for the feedback. Answers inline as [AR].
>
> Regards,
> Andres
>
> ________________________________________
> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
> Sent: Friday, December 16, 2016 8:15 PM
> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Andres,
>
>
> Quick comments:
>
> 1) To minimize "bubbles", etc. we need to "force" CU assignments/binding
> to high-priority queue  when it will be in use and "free" them later
> (we  do not want forever take CUs from e.g. graphic task to degrade graphics
> performance).
>
> Otherwise we could have scenario when long graphics task (or low-priority
> compute) will took all (extra) CUs and high--priority will wait for needed resources.
> It will not be visible on "NOP " but only when you submit "real" compute task
> so I would recommend  not to use "NOP" packets at all for testing.
>
> It (CU assignment) could be relatively easy done when everything is going via kernel
> (e.g. as part of frame submission) but I must admit that I am not sure
> about the best way for user level submissions (amdkfd).
>
> [AR] I wasn't aware of this part of the programming sequence. Thanks for the heads up!
> Is this similar to the CU masking programming?
> [Serguei] Yes. To simplify: the problem is that "scheduler" when deciding which
> queue to  run will check if there is enough resources and if not then it will begin
> to check other queues with lower priority.
>
> 2) I would recommend to dedicate the whole pipe to high-priority queue and have
> nothing their except it.
>
> [AR] I'm guessing in this context you mean pipe = queue? (as opposed to the MEC definition
> of pipe, which is a grouping of queues). I say this because amdgpu only has access to 1 pipe,
> and the rest are statically partitioned for amdkfd usage.
>
> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I understand (by simplifying)
> some scheduling is per pipe.  I know about the current allocation scheme but I do not think
> that it is  ideal.  I would assume that we need  to switch to dynamical partition
> of resources  based on the workload otherwise we will have resource conflict
> between Vulkan compute and  OpenCL.
>
>
> BTW: Which user level API do you want to use for compute: Vulkan or OpenCL?
>
> [AR] Vulkan
>
> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will be not
> involved.  I would assume that in the case of VR we will have one main
> application ("console" mode(?)) so we could temporally "ignore"
> OpenCL/ROCm needs when VR is running.
>
>>  we will not be able to provide a solution compatible with GFX worloads.
> I assume that you are talking about graphics? Am I right?
>
> [AR] Yeah, my understanding is that pre-empting the currently running graphics job and scheduling in
> something else using mid-buffer pre-emption has some cases where it doesn't work well. But if with
> polaris10 it starts working well, it might be a better solution for us (because the whole reprojection
> work uses the vulkan graphics stack at the moment, and porting it to compute is not trivial).
>
> [Serguei]  The problem with pre-emption of graphics task:  (a) it may take time so
> latency may suffer (b) to preempt we need to have different "context" - we want
> to guarantee that submissions from the same context will be executed in order.
> BTW: (a) Do you want  "preempt" and later resume or do you want "preempt" and
> "cancel/abort"?  (b) Vulkan is generic API and could be used
> for graphics as well as for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>
>
> Sincerely yours,
> Serguei Sagalovitch
>
>
>
> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of Andres Rodriguez <andresr at valvesoftware.com>
> Sent: December 16, 2016 6:15 PM
> To: amd-gfx at lists.freedesktop.org
> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Hi Everyone,
>
> This RFC is also available as a gist here:
> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>
>
>
> [RFC] Mechanism for high priority scheduling in amdgpu
> gist.github.com
> [RFC] Mechanism for high priority scheduling in amdgpu
>
>
>
> [RFC] Mechanism for high priority scheduling in amdgpu
> gist.github.com
> [RFC] Mechanism for high priority scheduling in amdgpu
>
>
>
>
> [RFC] Mechanism for high priority scheduling in amdgpu
> gist.github.com
> [RFC] Mechanism for high priority scheduling in amdgpu
>
>
> We are interested in feedback for a mechanism to effectively schedule high
> priority VR reprojection tasks (also referred to as time-warping) for Polaris10
> running on the amdgpu kernel driver.
>
> Brief context:
> --------------
>
> The main objective of reprojection is to avoid motion sickness for VR users in
> scenarios where the game or application would fail to finish rendering a new
> frame in time for the next VBLANK. When this happens, the user's head movements
> are not reflected on the Head Mounted Display (HMD) for the duration of an
> extra frame. This extended mismatch between the inner ear and the eyes may
> cause the user to experience motion sickness.
>
> The VR compositor deals with this problem by fabricating a new frame using the
> user's updated head position in combination with the previous frames. This
> avoids a prolonged mismatch between the HMD output and the inner ear.
>
> Because of the adverse effects on the user, we require high confidence that the
> reprojection task will complete before the VBLANK interval. Even if the GFX pipe
> is currently full of work from the game/application (which is most likely the case).
>
> For more details and illustrations, please refer to the following document:
> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>
>
> Gaming: Asynchronous Shaders Evolved | Community
> community.amd.com
> One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...
>
>
>
> Gaming: Asynchronous Shaders Evolved | Community
> community.amd.com
> One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...
>
>
>
> Gaming: Asynchronous Shaders Evolved | Community
> community.amd.com
> One of the most exciting new developments in GPU technology over the past year has been the adoption of asynchronous shaders, which can make more efficient use of ...
>
>
> Requirements:
> -------------
>
> The mechanism must expose the following functionaility:
>
>     * Job round trip time must be predictable, from submission to fence signal
>
>     * The mechanism must support compute workloads.
>
> Goals:
> ------
>
>     * The mechanism should provide low submission latencies
>
> Test: submitting a NOP packet through the mechanism on busy hardware should
> be equivalent to submitting a NOP on idle hardware.
>
> Nice to have:
> -------------
>
>     * The mechanism should also support GFX workloads.
>
> My understanding is that with the current hardware capabilities in Polaris10 we
> will not be able to provide a solution compatible with GFX worloads.
>
> But I would love to hear otherwise. So if anyone has an idea, approach or
> suggestion that will also be compatible with the GFX ring, please let us know
> about it.
>
>     * The above guarantees should also be respected by amdkfd workloads
>
> Would be good to have for consistency, but not strictly necessary as users running
> games are not traditionally running HPC workloads in the background.
>
> Proposed approach:
> ------------------
>
> Similar to the windows driver, we could expose a high priority compute queue to
> userspace.
>
> Submissions to this compute queue will be scheduled with high priority, and may
> acquire hardware resources previously in use by other queues.
>
> This can be achieved by taking advantage of the 'priority' field in the HQDs
> and could be programmed by amdgpu or the amdgpu scheduler. The relevant
> register fields are:
>         * mmCP_HQD_PIPE_PRIORITY
>         * mmCP_HQD_QUEUE_PRIORITY
>
> Implementation approach 1 - static partitioning:
> ------------------------------------------------
>
> The amdgpu driver currently controls 8 compute queues from pipe0. We can
> statically partition these as follows:
>         * 7x regular
>         * 1x high priority
>
> The relevant priorities can be set so that submissions to the high priority
> ring will starve the other compute rings and the GFX ring.
>
> The amdgpu scheduler will only place jobs into the high priority rings if the
> context is marked as high priority. And a corresponding priority should be
> added to keep track of this information:
>      * AMD_SCHED_PRIORITY_KERNEL
>      * -> AMD_SCHED_PRIORITY_HIGH
>      * AMD_SCHED_PRIORITY_NORMAL
>
> The user will request a high priority context by setting an appropriate flag
> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>
> The setting is in a per context level so that we can:
>     * Maintain a consistent FIFO ordering of all submissions to a context
>     * Create high priority and non-high priority contexts in the same process
>
> Implementation approach 2 - dynamic priority programming:
> ---------------------------------------------------------
>
> Similar to the above, but instead of programming the priorities and
> amdgpu_init() time, the SW scheduler will reprogram the queue priorities
> dynamically when scheduling a task.
>
> This would involve having a hardware specific callback from the scheduler to
> set the appropriate queue priority: set_priority(int ring, int index, int priority)
>
> During this callback we would have to grab the SRBM mutex to perform the appropriate
> HW programming, and I'm not really sure if that is something we should be doing from
> the scheduler.
>
> On the positive side, this approach would allow us to program a range of
> priorities for jobs instead of a single "high priority" value", achieving
> something similar to the niceness API available for CPU scheduling.
>
> I'm not sure if this flexibility is something that we would need for our use
> case, but it might be useful in other scenarios (multiple users sharing compute
> time on a server).
>
> This approach would require a new int field in drm_amdgpu_ctx_in, or repurposing
> of the flags field.
>
> Known current obstacles:
> ------------------------
>
> The SQ is currently programmed to disregard the HQD priorities, and instead it picks
> jobs at random. Settings from the shader itself are also disregarded as this is
> considered a privileged field.
>
> Effectively we can get our compute wavefront launched ASAP, but we might not get the
> time we need on the SQ.
>
> The current programming would have to be changed to allow priority propagation
> from the HQD into the SQ.
>
> Generic approach for all HW IPs:
> --------------------------------
>
> For consistency purposes, the high priority context can be enabled for all HW IPs
> with support of the SW scheduler. This will function similarly to the current
> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of anything not
> commited to the HW queue.
>
> The benefits of requesting a high priority context for a non-compute queue will
> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in front of
> you), but having the API in place will allow us to easily improve the implementation
> in the future as new features become available in new hardware.
>
> Future steps:
> -------------
>
> Once we have an approach settled, I can take care of the implementation.
>
> Also, once the interface is mostly decided, we can start thinking about
> exposing the high priority queue through radv.
>
> Request for feedback:
> ---------------------
>
> We aren't married to any of the approaches outlined above. Our goal is to
> obtain a mechanism that will allow us to complete the reprojection job within a
> predictable amount of time. So if anyone anyone has any suggestions for
> improvements or alternative strategies we are more than happy to hear them.
>
> If any of the technical information above is also incorrect, feel free to point
> out my misunderstandings.
>
> Looking forward to hearing from you.
>
> Regards,
> Andres
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
>
> amd-gfx Info Page - lists.freedesktop.org
> lists.freedesktop.org
> To see the collection of prior postings to the list, visit the amd-gfx Archives. Using amd-gfx: To post a message to all the list members, send email ...
>
>
>
> amd-gfx Info Page - lists.freedesktop.org
> lists.freedesktop.org
> To see the collection of prior postings to the list, visit the amd-gfx Archives. Using amd-gfx: To post a message to all the list members, send email ...
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>



More information about the amd-gfx mailing list