[RFC] Mechanism for high priority scheduling in amdgpu

Mon Dec 19 14:37:52 UTC 2016

Hi Pierre-Loup,

 > As long as it's a consistent cost, it shouldn't an issue.
But I assume that you still have some threshold not
to go lower?

 > Hopefully the requirements
Do you have any numbers in mind?

BTW: My understanding is that resolution should be 2160x1200
and refresh rate 90Hz. Am I right?

 > the user, we require high confidence that the reprojection task
 > will complete before the VBLANK interval.
So could we assume (to simplify) that the requirement is
the following:  Have opportunity to submit task and execute
it in time less than VBLANK interval from the moment Vulkan
calls kernel driver and GPU h/w finish processing?

Sincerely yours,
Serguei Sagalovitch

On 2016-12-17 05:05 PM, Pierre-Loup A. Griffais wrote:
> Hi Serguei,
>
> I'm also working on the bringing up our VR runtime on top of amgpu; 
> see replies inline.
>
> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>> Andres,
>>
>>>  For current VR workloads we have 3 separate processes running 
>>> actually:
>> So we could have potential memory overcommit case or do you do 
>> partitioning
>> on your own?  I would think that there is need to avoid overcomit in 
>> VR case to
>> prevent any BO migration.
>
> You're entirely correct; currently the VR runtime is setting up 
> prioritized CPU scheduling for its VR compositor, we're working on 
> prioritized GPU scheduling and pre-emption (eg. this thread), and in 
> the future it will make sense to do work in order to make sure that 
> its memory allocations do not get evicted, to prevent any unwelcome 
> additional latency in the event of needing to perform just-in-time 
> reprojection.
>
>> BTW: Do you mean __real__ processes or threads?
>> Based on my understanding sharing BOs between different processes
>> could introduce additional synchronization constrains.  btw: I am not 
>> sure
>> if we are able to share Vulkan sync. object cross-process boundary.
>
> They are different processes; it is important for the compositor that 
> is responsible for quality-of-service features such as consistently 
> presenting distorted frames with the right latency, reprojection, etc, 
> to be separate from the main application.
>
> Currently we are using unreleased cross-process memory and semaphore 
> extensions to fetch updated eye images from the client application, 
> but the just-in-time reprojection discussed here does not actually 
> have any direct interactions with cross-process resource sharing, 
> since it's achieved by using whatever is the latest, most up-to-date 
> eye images that have already been sent by the client application, 
> which are already available to use without additional synchronization.
>
>>
>>>    3) System compositor (we are looking at approaches to remove this 
>>> overhead)
>> Yes,  IMHO the best is to run in  "full screen mode".
>
> Yes, we are working on mechanisms to present directly to the headset 
> display without any intermediaries as a separate effort.
>
>>
>>>  The latency is our main concern,
>> I would assume that this is the known problem (at least for compute 
>> usage).
>> It looks like that amdgpu / kernel submission is rather CPU intensive 
>> (at least
>> in the default configuration).
>
> As long as it's a consistent cost, it shouldn't an issue. However, if 
> there's high degrees of variance then that would be troublesome and we 
> would need to account for the worst case.
>
> Hopefully the requirements and approach we described make sense, we're 
> looking forward to your feedback and suggestions.
>
> Thanks!
>  - Pierre-Loup
>
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>>
>> From: Andres Rodriguez <andresr at valvesoftware.com>
>> Sent: December 16, 2016 10:00 PM
>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hey Serguei,
>>
>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I 
>>> understand (by simplifying)
>>> some scheduling is per pipe.  I know about the current allocation 
>>> scheme but I do not think
>>> that it is  ideal.  I would assume that we need  to switch to 
>>> dynamical partition
>>> of resources  based on the workload otherwise we will have resource 
>>> conflict
>>> between Vulkan compute and  OpenCL.
>>
>> I agree the partitioning isn't ideal. I'm hoping we can start with a 
>> solution that assumes that
>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm 
>> running on the system).
>>
>> This should be more or less the use case we expect from VR users.
>>
>> I agree the split is currently not ideal, but I'd like to consider 
>> that a separate task, because
>> making it dynamic is not straight forward :P
>>
>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd 
>>> will be not
>>> involved.  I would assume that in the case of VR we will have one main
>>> application ("console" mode(?)) so we could temporally "ignore"
>>> OpenCL/ROCm needs when VR is running.
>>
>> Correct, this is why we want to enable the high priority compute 
>> queue through
>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>
>> For current VR workloads we have 3 separate processes running actually:
>>     1) Game process
>>     2) VR Compositor (this is the process that will require high 
>> priority queue)
>>     3) System compositor (we are looking at approaches to remove this 
>> overhead)
>>
>> For now I think it is okay to assume no OpenCL/ROCm running 
>> simultaneously, but
>> I would also like to be able to address this case in the future 
>> (cross-pipe priorities).
>>
>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it 
>>> may take time so
>>> latency may suffer
>>
>> The latency is our main concern, we want something that is 
>> predictable. A good
>> illustration of what the reprojection scheduling looks like can be 
>> found here:
>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>
>>
>>> (b) to preempt we need to have different "context" - we want
>>> to guarantee that submissions from the same context will be executed 
>>> in order.
>>
>> This is okay, as the reprojection work doesn't have dependencies on 
>> the game context, and it
>> even happens in a separate process.
>>
>>> BTW: (a) Do you want  "preempt" and later resume or do you want 
>>> "preempt" and
>>> "cancel/abort"
>>
>> Preempt the game with the compositor task and then resume it.
>>
>>> (b) Vulkan is generic API and could be used for graphics as well as
>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>
>> Yeah, the plan is to use vulkan compute. But if you figure out a way 
>> for us to get
>> a guaranteed execution time using vulkan graphics, then I'll take you 
>> out for a beer :)
>>
>> Regards,
>> Andres
>> ________________________________________
>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>> Sent: Friday, December 16, 2016 9:13 PM
>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hi Andres,
>>
>> Please see inline (as [Serguei])
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>>
>> From: Andres Rodriguez <andresr at valvesoftware.com>
>> Sent: December 16, 2016 8:29 PM
>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hi Serguei,
>>
>> Thanks for the feedback. Answers inline as [AR].
>>
>> Regards,
>> Andres
>>
>> ________________________________________
>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>> Sent: Friday, December 16, 2016 8:15 PM
>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Andres,
>>
>>
>> Quick comments:
>>
>> 1) To minimize "bubbles", etc. we need to "force" CU assignments/binding
>> to high-priority queue  when it will be in use and "free" them later
>> (we  do not want forever take CUs from e.g. graphic task to degrade 
>> graphics
>> performance).
>>
>> Otherwise we could have scenario when long graphics task (or 
>> low-priority
>> compute) will took all (extra) CUs and high--priority will wait for 
>> needed resources.
>> It will not be visible on "NOP " but only when you submit "real" 
>> compute task
>> so I would recommend  not to use "NOP" packets at all for testing.
>>
>> It (CU assignment) could be relatively easy done when everything is 
>> going via kernel
>> (e.g. as part of frame submission) but I must admit that I am not sure
>> about the best way for user level submissions (amdkfd).
>>
>> [AR] I wasn't aware of this part of the programming sequence. Thanks 
>> for the heads up!
>> Is this similar to the CU masking programming?
>> [Serguei] Yes. To simplify: the problem is that "scheduler" when 
>> deciding which
>> queue to  run will check if there is enough resources and if not then 
>> it will begin
>> to check other queues with lower priority.
>>
>> 2) I would recommend to dedicate the whole pipe to high-priority 
>> queue and have
>> nothing their except it.
>>
>> [AR] I'm guessing in this context you mean pipe = queue? (as opposed 
>> to the MEC definition
>> of pipe, which is a grouping of queues). I say this because amdgpu 
>> only has access to 1 pipe,
>> and the rest are statically partitioned for amdkfd usage.
>>
>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I 
>> understand (by simplifying)
>> some scheduling is per pipe.  I know about the current allocation 
>> scheme but I do not think
>> that it is  ideal.  I would assume that we need  to switch to 
>> dynamical partition
>> of resources  based on the workload otherwise we will have resource 
>> conflict
>> between Vulkan compute and  OpenCL.
>>
>>
>> BTW: Which user level API do you want to use for compute: Vulkan or 
>> OpenCL?
>>
>> [AR] Vulkan
>>
>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will 
>> be not
>> involved.  I would assume that in the case of VR we will have one main
>> application ("console" mode(?)) so we could temporally "ignore"
>> OpenCL/ROCm needs when VR is running.
>>
>>>  we will not be able to provide a solution compatible with GFX 
>>> worloads.
>> I assume that you are talking about graphics? Am I right?
>>
>> [AR] Yeah, my understanding is that pre-empting the currently running 
>> graphics job and scheduling in
>> something else using mid-buffer pre-emption has some cases where it 
>> doesn't work well. But if with
>> polaris10 it starts working well, it might be a better solution for 
>> us (because the whole reprojection
>> work uses the vulkan graphics stack at the moment, and porting it to 
>> compute is not trivial).
>>
>> [Serguei]  The problem with pre-emption of graphics task:  (a) it may 
>> take time so
>> latency may suffer (b) to preempt we need to have different "context" 
>> - we want
>> to guarantee that submissions from the same context will be executed 
>> in order.
>> BTW: (a) Do you want  "preempt" and later resume or do you want 
>> "preempt" and
>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>> for graphics as well as for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>
>>
>> Sincerely yours,
>> Serguei Sagalovitch
>>
>>
>>
>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of 
>> Andres Rodriguez <andresr at valvesoftware.com>
>> Sent: December 16, 2016 6:15 PM
>> To: amd-gfx at lists.freedesktop.org
>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>
>> Hi Everyone,
>>
>> This RFC is also available as a gist here:
>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>
>>
>>
>> [RFC] Mechanism for high priority scheduling in amdgpu
>> gist.github.com
>> [RFC] Mechanism for high priority scheduling in amdgpu
>>
>>
>>
>> [RFC] Mechanism for high priority scheduling in amdgpu
>> gist.github.com
>> [RFC] Mechanism for high priority scheduling in amdgpu
>>
>>
>>
>>
>> [RFC] Mechanism for high priority scheduling in amdgpu
>> gist.github.com
>> [RFC] Mechanism for high priority scheduling in amdgpu
>>
>>
>> We are interested in feedback for a mechanism to effectively schedule 
>> high
>> priority VR reprojection tasks (also referred to as time-warping) for 
>> Polaris10
>> running on the amdgpu kernel driver.
>>
>> Brief context:
>> --------------
>>
>> The main objective of reprojection is to avoid motion sickness for VR 
>> users in
>> scenarios where the game or application would fail to finish 
>> rendering a new
>> frame in time for the next VBLANK. When this happens, the user's head 
>> movements
>> are not reflected on the Head Mounted Display (HMD) for the duration 
>> of an
>> extra frame. This extended mismatch between the inner ear and the 
>> eyes may
>> cause the user to experience motion sickness.
>>
>> The VR compositor deals with this problem by fabricating a new frame 
>> using the
>> user's updated head position in combination with the previous frames. 
>> This
>> avoids a prolonged mismatch between the HMD output and the inner ear.
>>
>> Because of the adverse effects on the user, we require high 
>> confidence that the
>> reprojection task will complete before the VBLANK interval. Even if 
>> the GFX pipe
>> is currently full of work from the game/application (which is most 
>> likely the case).
>>
>> For more details and illustrations, please refer to the following 
>> document:
>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>
>>
>>
>> Gaming: Asynchronous Shaders Evolved | Community
>> community.amd.com
>> One of the most exciting new developments in GPU technology over the 
>> past year has been the adoption of asynchronous shaders, which can 
>> make more efficient use of ...
>>
>>
>>
>> Gaming: Asynchronous Shaders Evolved | Community
>> community.amd.com
>> One of the most exciting new developments in GPU technology over the 
>> past year has been the adoption of asynchronous shaders, which can 
>> make more efficient use of ...
>>
>>
>>
>> Gaming: Asynchronous Shaders Evolved | Community
>> community.amd.com
>> One of the most exciting new developments in GPU technology over the 
>> past year has been the adoption of asynchronous shaders, which can 
>> make more efficient use of ...
>>
>>
>> Requirements:
>> -------------
>>
>> The mechanism must expose the following functionaility:
>>
>>     * Job round trip time must be predictable, from submission to 
>> fence signal
>>
>>     * The mechanism must support compute workloads.
>>
>> Goals:
>> ------
>>
>>     * The mechanism should provide low submission latencies
>>
>> Test: submitting a NOP packet through the mechanism on busy hardware 
>> should
>> be equivalent to submitting a NOP on idle hardware.
>>
>> Nice to have:
>> -------------
>>
>>     * The mechanism should also support GFX workloads.
>>
>> My understanding is that with the current hardware capabilities in 
>> Polaris10 we
>> will not be able to provide a solution compatible with GFX worloads.
>>
>> But I would love to hear otherwise. So if anyone has an idea, 
>> approach or
>> suggestion that will also be compatible with the GFX ring, please let 
>> us know
>> about it.
>>
>>     * The above guarantees should also be respected by amdkfd workloads
>>
>> Would be good to have for consistency, but not strictly necessary as 
>> users running
>> games are not traditionally running HPC workloads in the background.
>>
>> Proposed approach:
>> ------------------
>>
>> Similar to the windows driver, we could expose a high priority 
>> compute queue to
>> userspace.
>>
>> Submissions to this compute queue will be scheduled with high 
>> priority, and may
>> acquire hardware resources previously in use by other queues.
>>
>> This can be achieved by taking advantage of the 'priority' field in 
>> the HQDs
>> and could be programmed by amdgpu or the amdgpu scheduler. The relevant
>> register fields are:
>>         * mmCP_HQD_PIPE_PRIORITY
>>         * mmCP_HQD_QUEUE_PRIORITY
>>
>> Implementation approach 1 - static partitioning:
>> ------------------------------------------------
>>
>> The amdgpu driver currently controls 8 compute queues from pipe0. We can
>> statically partition these as follows:
>>         * 7x regular
>>         * 1x high priority
>>
>> The relevant priorities can be set so that submissions to the high 
>> priority
>> ring will starve the other compute rings and the GFX ring.
>>
>> The amdgpu scheduler will only place jobs into the high priority 
>> rings if the
>> context is marked as high priority. And a corresponding priority 
>> should be
>> added to keep track of this information:
>>      * AMD_SCHED_PRIORITY_KERNEL
>>      * -> AMD_SCHED_PRIORITY_HIGH
>>      * AMD_SCHED_PRIORITY_NORMAL
>>
>> The user will request a high priority context by setting an 
>> appropriate flag
>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>
>>
>> The setting is in a per context level so that we can:
>>     * Maintain a consistent FIFO ordering of all submissions to a 
>> context
>>     * Create high priority and non-high priority contexts in the same 
>> process
>>
>> Implementation approach 2 - dynamic priority programming:
>> ---------------------------------------------------------
>>
>> Similar to the above, but instead of programming the priorities and
>> amdgpu_init() time, the SW scheduler will reprogram the queue priorities
>> dynamically when scheduling a task.
>>
>> This would involve having a hardware specific callback from the 
>> scheduler to
>> set the appropriate queue priority: set_priority(int ring, int index, 
>> int priority)
>>
>> During this callback we would have to grab the SRBM mutex to perform 
>> the appropriate
>> HW programming, and I'm not really sure if that is something we 
>> should be doing from
>> the scheduler.
>>
>> On the positive side, this approach would allow us to program a range of
>> priorities for jobs instead of a single "high priority" value", 
>> achieving
>> something similar to the niceness API available for CPU scheduling.
>>
>> I'm not sure if this flexibility is something that we would need for 
>> our use
>> case, but it might be useful in other scenarios (multiple users 
>> sharing compute
>> time on a server).
>>
>> This approach would require a new int field in drm_amdgpu_ctx_in, or 
>> repurposing
>> of the flags field.
>>
>> Known current obstacles:
>> ------------------------
>>
>> The SQ is currently programmed to disregard the HQD priorities, and 
>> instead it picks
>> jobs at random. Settings from the shader itself are also disregarded 
>> as this is
>> considered a privileged field.
>>
>> Effectively we can get our compute wavefront launched ASAP, but we 
>> might not get the
>> time we need on the SQ.
>>
>> The current programming would have to be changed to allow priority 
>> propagation
>> from the HQD into the SQ.
>>
>> Generic approach for all HW IPs:
>> --------------------------------
>>
>> For consistency purposes, the high priority context can be enabled 
>> for all HW IPs
>> with support of the SW scheduler. This will function similarly to the 
>> current
>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of 
>> anything not
>> commited to the HW queue.
>>
>> The benefits of requesting a high priority context for a non-compute 
>> queue will
>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in 
>> front of
>> you), but having the API in place will allow us to easily improve the 
>> implementation
>> in the future as new features become available in new hardware.
>>
>> Future steps:
>> -------------
>>
>> Once we have an approach settled, I can take care of the implementation.
>>
>> Also, once the interface is mostly decided, we can start thinking about
>> exposing the high priority queue through radv.
>>
>> Request for feedback:
>> ---------------------
>>
>> We aren't married to any of the approaches outlined above. Our goal 
>> is to
>> obtain a mechanism that will allow us to complete the reprojection 
>> job within a
>> predictable amount of time. So if anyone anyone has any suggestions for
>> improvements or alternative strategies we are more than happy to hear 
>> them.
>>
>> If any of the technical information above is also incorrect, feel 
>> free to point
>> out my misunderstandings.
>>
>> Looking forward to hearing from you.
>>
>> Regards,
>> Andres
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>>
>> amd-gfx Info Page - lists.freedesktop.org
>> lists.freedesktop.org
>> To see the collection of prior postings to the list, visit the 
>> amd-gfx Archives. Using amd-gfx: To post a message to all the list 
>> members, send email ...
>>
>>
>>
>> amd-gfx Info Page - lists.freedesktop.org
>> lists.freedesktop.org
>> To see the collection of prior postings to the list, visit the 
>> amd-gfx Archives. Using amd-gfx: To post a message to all the list 
>> members, send email ...
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>

Sincerely yours,
Serguei Sagalovitch