[RFC] Mechanism for high priority scheduling in amdgpu

Mon Dec 19 05:11:49 UTC 2016


On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
> We're currently working with the open stack; I assume that a mechanism 
> could be exposed by both open and Pro Vulkan userspace drivers and 
> that the amdgpu kernel interface improvements we would pursue 
> following this discussion would let both drivers take advantage of the 
> feature, correct?
Of course.
Does open stack have Vulkan support?

Regards,
David Zhou
>
> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>> By the way, are you using all-open driver or amdgpu-pro driver?
>>
>> +David Mao, who is working on our Vulkan driver.
>>
>> Regards,
>> David Zhou
>>
>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>> Hi Serguei,
>>>
>>> I'm also working on the bringing up our VR runtime on top of amgpu;
>>> see replies inline.
>>>
>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>> Andres,
>>>>
>>>>>  For current VR workloads we have 3 separate processes running
>>>>> actually:
>>>> So we could have potential memory overcommit case or do you do
>>>> partitioning
>>>> on your own?  I would think that there is need to avoid overcomit in
>>>> VR case to
>>>> prevent any BO migration.
>>>
>>> You're entirely correct; currently the VR runtime is setting up
>>> prioritized CPU scheduling for its VR compositor, we're working on
>>> prioritized GPU scheduling and pre-emption (eg. this thread), and in
>>> the future it will make sense to do work in order to make sure that
>>> its memory allocations do not get evicted, to prevent any unwelcome
>>> additional latency in the event of needing to perform just-in-time
>>> reprojection.
>>>
>>>> BTW: Do you mean __real__ processes or threads?
>>>> Based on my understanding sharing BOs between different processes
>>>> could introduce additional synchronization constrains.  btw: I am not
>>>> sure
>>>> if we are able to share Vulkan sync. object cross-process boundary.
>>>
>>> They are different processes; it is important for the compositor that
>>> is responsible for quality-of-service features such as consistently
>>> presenting distorted frames with the right latency, reprojection, etc,
>>> to be separate from the main application.
>>>
>>> Currently we are using unreleased cross-process memory and semaphore
>>> extensions to fetch updated eye images from the client application,
>>> but the just-in-time reprojection discussed here does not actually
>>> have any direct interactions with cross-process resource sharing,
>>> since it's achieved by using whatever is the latest, most up-to-date
>>> eye images that have already been sent by the client application,
>>> which are already available to use without additional synchronization.
>>>
>>>>
>>>>>    3) System compositor (we are looking at approaches to remove this
>>>>> overhead)
>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>
>>> Yes, we are working on mechanisms to present directly to the headset
>>> display without any intermediaries as a separate effort.
>>>
>>>>
>>>>>  The latency is our main concern,
>>>> I would assume that this is the known problem (at least for compute
>>>> usage).
>>>> It looks like that amdgpu / kernel submission is rather CPU intensive
>>>> (at least
>>>> in the default configuration).
>>>
>>> As long as it's a consistent cost, it shouldn't an issue. However, if
>>> there's high degrees of variance then that would be troublesome and we
>>> would need to account for the worst case.
>>>
>>> Hopefully the requirements and approach we described make sense, we're
>>> looking forward to your feedback and suggestions.
>>>
>>> Thanks!
>>>  - Pierre-Loup
>>>
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>> Sent: December 16, 2016 10:00 PM
>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>> Hey Serguei,
>>>>
>>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>>> understand (by simplifying)
>>>>> some scheduling is per pipe.  I know about the current allocation
>>>>> scheme but I do not think
>>>>> that it is  ideal.  I would assume that we need  to switch to
>>>>> dynamical partition
>>>>> of resources  based on the workload otherwise we will have resource
>>>>> conflict
>>>>> between Vulkan compute and  OpenCL.
>>>>
>>>> I agree the partitioning isn't ideal. I'm hoping we can start with a
>>>> solution that assumes that
>>>> only pipe0 has any work and the other pipes are idle (no HSA/ROCm
>>>> running on the system).
>>>>
>>>> This should be more or less the use case we expect from VR users.
>>>>
>>>> I agree the split is currently not ideal, but I'd like to consider
>>>> that a separate task, because
>>>> making it dynamic is not straight forward :P
>>>>
>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd
>>>>> will be not
>>>>> involved.  I would assume that in the case of VR we will have one 
>>>>> main
>>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>>> OpenCL/ROCm needs when VR is running.
>>>>
>>>> Correct, this is why we want to enable the high priority compute
>>>> queue through
>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>
>>>> For current VR workloads we have 3 separate processes running 
>>>> actually:
>>>>     1) Game process
>>>>     2) VR Compositor (this is the process that will require high
>>>> priority queue)
>>>>     3) System compositor (we are looking at approaches to remove this
>>>> overhead)
>>>>
>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>> simultaneously, but
>>>> I would also like to be able to address this case in the future
>>>> (cross-pipe priorities).
>>>>
>>>>> [Serguei]  The problem with pre-emption of graphics task:  (a) it
>>>>> may take time so
>>>>> latency may suffer
>>>>
>>>> The latency is our main concern, we want something that is
>>>> predictable. A good
>>>> illustration of what the reprojection scheduling looks like can be
>>>> found here:
>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>
>>>>
>>>>
>>>>> (b) to preempt we need to have different "context" - we want
>>>>> to guarantee that submissions from the same context will be executed
>>>>> in order.
>>>>
>>>> This is okay, as the reprojection work doesn't have dependencies on
>>>> the game context, and it
>>>> even happens in a separate process.
>>>>
>>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>>> "preempt" and
>>>>> "cancel/abort"
>>>>
>>>> Preempt the game with the compositor task and then resume it.
>>>>
>>>>> (b) Vulkan is generic API and could be used for graphics as well as
>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>
>>>> Yeah, the plan is to use vulkan compute. But if you figure out a way
>>>> for us to get
>>>> a guaranteed execution time using vulkan graphics, then I'll take you
>>>> out for a beer :)
>>>>
>>>> Regards,
>>>> Andres
>>>> ________________________________________
>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>> Hi Andres,
>>>>
>>>> Please see inline (as [Serguei])
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>> Sent: December 16, 2016 8:29 PM
>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>> Subject: RE: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>> Hi Serguei,
>>>>
>>>> Thanks for the feedback. Answers inline as [AR].
>>>>
>>>> Regards,
>>>> Andres
>>>>
>>>> ________________________________________
>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>> Subject: Re: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>> Andres,
>>>>
>>>>
>>>> Quick comments:
>>>>
>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>> assignments/binding
>>>> to high-priority queue  when it will be in use and "free" them later
>>>> (we  do not want forever take CUs from e.g. graphic task to degrade
>>>> graphics
>>>> performance).
>>>>
>>>> Otherwise we could have scenario when long graphics task (or
>>>> low-priority
>>>> compute) will took all (extra) CUs and high--priority will wait for
>>>> needed resources.
>>>> It will not be visible on "NOP " but only when you submit "real"
>>>> compute task
>>>> so I would recommend  not to use "NOP" packets at all for testing.
>>>>
>>>> It (CU assignment) could be relatively easy done when everything is
>>>> going via kernel
>>>> (e.g. as part of frame submission) but I must admit that I am not sure
>>>> about the best way for user level submissions (amdkfd).
>>>>
>>>> [AR] I wasn't aware of this part of the programming sequence. Thanks
>>>> for the heads up!
>>>> Is this similar to the CU masking programming?
>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" when
>>>> deciding which
>>>> queue to  run will check if there is enough resources and if not then
>>>> it will begin
>>>> to check other queues with lower priority.
>>>>
>>>> 2) I would recommend to dedicate the whole pipe to high-priority
>>>> queue and have
>>>> nothing their except it.
>>>>
>>>> [AR] I'm guessing in this context you mean pipe = queue? (as opposed
>>>> to the MEC definition
>>>> of pipe, which is a grouping of queues). I say this because amdgpu
>>>> only has access to 1 pipe,
>>>> and the rest are statically partitioned for amdkfd usage.
>>>>
>>>> [Serguei] No. I mean pipe :-)  as MEC define it.  As far as I
>>>> understand (by simplifying)
>>>> some scheduling is per pipe.  I know about the current allocation
>>>> scheme but I do not think
>>>> that it is  ideal.  I would assume that we need  to switch to
>>>> dynamical partition
>>>> of resources  based on the workload otherwise we will have resource
>>>> conflict
>>>> between Vulkan compute and  OpenCL.
>>>>
>>>>
>>>> BTW: Which user level API do you want to use for compute: Vulkan or
>>>> OpenCL?
>>>>
>>>> [AR] Vulkan
>>>>
>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so amdkfd will
>>>> be not
>>>> involved.  I would assume that in the case of VR we will have one main
>>>> application ("console" mode(?)) so we could temporally "ignore"
>>>> OpenCL/ROCm needs when VR is running.
>>>>
>>>>>  we will not be able to provide a solution compatible with GFX
>>>>> worloads.
>>>> I assume that you are talking about graphics? Am I right?
>>>>
>>>> [AR] Yeah, my understanding is that pre-empting the currently running
>>>> graphics job and scheduling in
>>>> something else using mid-buffer pre-emption has some cases where it
>>>> doesn't work well. But if with
>>>> polaris10 it starts working well, it might be a better solution for
>>>> us (because the whole reprojection
>>>> work uses the vulkan graphics stack at the moment, and porting it to
>>>> compute is not trivial).
>>>>
>>>> [Serguei]  The problem with pre-emption of graphics task: (a) it may
>>>> take time so
>>>> latency may suffer (b) to preempt we need to have different "context"
>>>> - we want
>>>> to guarantee that submissions from the same context will be executed
>>>> in order.
>>>> BTW: (a) Do you want  "preempt" and later resume or do you want
>>>> "preempt" and
>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>> for graphics as well as for plain compute tasks 
>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>>
>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
>>>> Andres Rodriguez <andresr at valvesoftware.com>
>>>> Sent: December 16, 2016 6:15 PM
>>>> To: amd-gfx at lists.freedesktop.org
>>>> Subject: [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>> Hi Everyone,
>>>>
>>>> This RFC is also available as a gist here:
>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
>>>>
>>>>
>>>>
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>> gist.github.com
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>>
>>>>
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>> gist.github.com
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>>
>>>>
>>>>
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>> gist.github.com
>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>
>>>>
>>>> We are interested in feedback for a mechanism to effectively schedule
>>>> high
>>>> priority VR reprojection tasks (also referred to as time-warping) for
>>>> Polaris10
>>>> running on the amdgpu kernel driver.
>>>>
>>>> Brief context:
>>>> --------------
>>>>
>>>> The main objective of reprojection is to avoid motion sickness for VR
>>>> users in
>>>> scenarios where the game or application would fail to finish
>>>> rendering a new
>>>> frame in time for the next VBLANK. When this happens, the user's head
>>>> movements
>>>> are not reflected on the Head Mounted Display (HMD) for the duration
>>>> of an
>>>> extra frame. This extended mismatch between the inner ear and the
>>>> eyes may
>>>> cause the user to experience motion sickness.
>>>>
>>>> The VR compositor deals with this problem by fabricating a new frame
>>>> using the
>>>> user's updated head position in combination with the previous frames.
>>>> This
>>>> avoids a prolonged mismatch between the HMD output and the inner ear.
>>>>
>>>> Because of the adverse effects on the user, we require high
>>>> confidence that the
>>>> reprojection task will complete before the VBLANK interval. Even if
>>>> the GFX pipe
>>>> is currently full of work from the game/application (which is most
>>>> likely the case).
>>>>
>>>> For more details and illustrations, please refer to the following
>>>> document:
>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>
>>>>
>>>>
>>>>
>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>> community.amd.com
>>>> One of the most exciting new developments in GPU technology over the
>>>> past year has been the adoption of asynchronous shaders, which can
>>>> make more efficient use of ...
>>>>
>>>>
>>>>
>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>> community.amd.com
>>>> One of the most exciting new developments in GPU technology over the
>>>> past year has been the adoption of asynchronous shaders, which can
>>>> make more efficient use of ...
>>>>
>>>>
>>>>
>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>> community.amd.com
>>>> One of the most exciting new developments in GPU technology over the
>>>> past year has been the adoption of asynchronous shaders, which can
>>>> make more efficient use of ...
>>>>
>>>>
>>>> Requirements:
>>>> -------------
>>>>
>>>> The mechanism must expose the following functionaility:
>>>>
>>>>     * Job round trip time must be predictable, from submission to
>>>> fence signal
>>>>
>>>>     * The mechanism must support compute workloads.
>>>>
>>>> Goals:
>>>> ------
>>>>
>>>>     * The mechanism should provide low submission latencies
>>>>
>>>> Test: submitting a NOP packet through the mechanism on busy hardware
>>>> should
>>>> be equivalent to submitting a NOP on idle hardware.
>>>>
>>>> Nice to have:
>>>> -------------
>>>>
>>>>     * The mechanism should also support GFX workloads.
>>>>
>>>> My understanding is that with the current hardware capabilities in
>>>> Polaris10 we
>>>> will not be able to provide a solution compatible with GFX worloads.
>>>>
>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>> approach or
>>>> suggestion that will also be compatible with the GFX ring, please let
>>>> us know
>>>> about it.
>>>>
>>>>     * The above guarantees should also be respected by amdkfd 
>>>> workloads
>>>>
>>>> Would be good to have for consistency, but not strictly necessary as
>>>> users running
>>>> games are not traditionally running HPC workloads in the background.
>>>>
>>>> Proposed approach:
>>>> ------------------
>>>>
>>>> Similar to the windows driver, we could expose a high priority
>>>> compute queue to
>>>> userspace.
>>>>
>>>> Submissions to this compute queue will be scheduled with high
>>>> priority, and may
>>>> acquire hardware resources previously in use by other queues.
>>>>
>>>> This can be achieved by taking advantage of the 'priority' field in
>>>> the HQDs
>>>> and could be programmed by amdgpu or the amdgpu scheduler. The 
>>>> relevant
>>>> register fields are:
>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>
>>>> Implementation approach 1 - static partitioning:
>>>> ------------------------------------------------
>>>>
>>>> The amdgpu driver currently controls 8 compute queues from pipe0. 
>>>> We can
>>>> statically partition these as follows:
>>>>         * 7x regular
>>>>         * 1x high priority
>>>>
>>>> The relevant priorities can be set so that submissions to the high
>>>> priority
>>>> ring will starve the other compute rings and the GFX ring.
>>>>
>>>> The amdgpu scheduler will only place jobs into the high priority
>>>> rings if the
>>>> context is marked as high priority. And a corresponding priority
>>>> should be
>>>> added to keep track of this information:
>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>
>>>> The user will request a high priority context by setting an
>>>> appropriate flag
>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>
>>>>
>>>>
>>>> The setting is in a per context level so that we can:
>>>>     * Maintain a consistent FIFO ordering of all submissions to a
>>>> context
>>>>     * Create high priority and non-high priority contexts in the same
>>>> process
>>>>
>>>> Implementation approach 2 - dynamic priority programming:
>>>> ---------------------------------------------------------
>>>>
>>>> Similar to the above, but instead of programming the priorities and
>>>> amdgpu_init() time, the SW scheduler will reprogram the queue 
>>>> priorities
>>>> dynamically when scheduling a task.
>>>>
>>>> This would involve having a hardware specific callback from the
>>>> scheduler to
>>>> set the appropriate queue priority: set_priority(int ring, int index,
>>>> int priority)
>>>>
>>>> During this callback we would have to grab the SRBM mutex to perform
>>>> the appropriate
>>>> HW programming, and I'm not really sure if that is something we
>>>> should be doing from
>>>> the scheduler.
>>>>
>>>> On the positive side, this approach would allow us to program a 
>>>> range of
>>>> priorities for jobs instead of a single "high priority" value",
>>>> achieving
>>>> something similar to the niceness API available for CPU scheduling.
>>>>
>>>> I'm not sure if this flexibility is something that we would need for
>>>> our use
>>>> case, but it might be useful in other scenarios (multiple users
>>>> sharing compute
>>>> time on a server).
>>>>
>>>> This approach would require a new int field in drm_amdgpu_ctx_in, or
>>>> repurposing
>>>> of the flags field.
>>>>
>>>> Known current obstacles:
>>>> ------------------------
>>>>
>>>> The SQ is currently programmed to disregard the HQD priorities, and
>>>> instead it picks
>>>> jobs at random. Settings from the shader itself are also disregarded
>>>> as this is
>>>> considered a privileged field.
>>>>
>>>> Effectively we can get our compute wavefront launched ASAP, but we
>>>> might not get the
>>>> time we need on the SQ.
>>>>
>>>> The current programming would have to be changed to allow priority
>>>> propagation
>>>> from the HQD into the SQ.
>>>>
>>>> Generic approach for all HW IPs:
>>>> --------------------------------
>>>>
>>>> For consistency purposes, the high priority context can be enabled
>>>> for all HW IPs
>>>> with support of the SW scheduler. This will function similarly to the
>>>> current
>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump ahead of
>>>> anything not
>>>> commited to the HW queue.
>>>>
>>>> The benefits of requesting a high priority context for a non-compute
>>>> queue will
>>>> be lesser (e.g. up to 10s of wait time if a GFX command is stuck in
>>>> front of
>>>> you), but having the API in place will allow us to easily improve the
>>>> implementation
>>>> in the future as new features become available in new hardware.
>>>>
>>>> Future steps:
>>>> -------------
>>>>
>>>> Once we have an approach settled, I can take care of the 
>>>> implementation.
>>>>
>>>> Also, once the interface is mostly decided, we can start thinking 
>>>> about
>>>> exposing the high priority queue through radv.
>>>>
>>>> Request for feedback:
>>>> ---------------------
>>>>
>>>> We aren't married to any of the approaches outlined above. Our goal
>>>> is to
>>>> obtain a mechanism that will allow us to complete the reprojection
>>>> job within a
>>>> predictable amount of time. So if anyone anyone has any suggestions 
>>>> for
>>>> improvements or alternative strategies we are more than happy to hear
>>>> them.
>>>>
>>>> If any of the technical information above is also incorrect, feel
>>>> free to point
>>>> out my misunderstandings.
>>>>
>>>> Looking forward to hearing from you.
>>>>
>>>> Regards,
>>>> Andres
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>> amd-gfx Info Page - lists.freedesktop.org
>>>> lists.freedesktop.org
>>>> To see the collection of prior postings to the list, visit the
>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>> members, send email ...
>>>>
>>>>
>>>>
>>>> amd-gfx Info Page - lists.freedesktop.org
>>>> lists.freedesktop.org
>>>> To see the collection of prior postings to the list, visit the
>>>> amd-gfx Archives. Using amd-gfx: To post a message to all the list
>>>> members, send email ...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx at lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>
>