[RFC] Mechanism for high priority scheduling in amdgpu

Thu Dec 22 16:41:58 UTC 2016

Andres,

Did you measure  latency, etc. impact of __any__ compositor?

My understanding is that VR has pretty strict requirements related to QoS.

Sincerely yours,
Serguei Sagalovitch

On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
> Hey Christian,
>
> We are currently interested in X, but with some distros switching to 
> other compositors by default, we also need to consider those.
>
> We agree, running the full vrcompositor in root isn't something that 
> we want to do. Too many security concerns. Having a small root helper 
> that does the privilege escalation for us is the initial idea.
>
> For a long term approach, Pierre-Loup and Dave are working on dealing 
> with the "two compositors" scenario a little better in DRM+X. 
> Fullscreen isn't really a sufficient approach, since we don't want the 
> HMD to be used as part of the Desktop environment when a VR app is not 
> in use (this is extremely annoying).
>
> When the above is settled, we should have an auth mechanism besides 
> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the 
> HMD permanently away from X. Re-using that auth method to gate this 
> IOCTL is probably going to be the final solution.
>
> I propose to start with ROOT_ONLY since it should allow us to respect 
> kernel IOCTL compatibility guidelines with the most flexibility. Going 
> from a restrictive to a more flexible permission model would be 
> inclusive, but going from a general to a restrictive model may exclude 
> some apps that used to work.
>
> Regards,
> Andres
>
> On 12/22/2016 6:42 AM, Christian König wrote:
>> Hi Andres,
>>
>> well using root might cause stability and security problems as well. 
>> We worked quite hard to avoid exactly this for X.
>>
>> We could make this feature depend on the compositor being DRM master, 
>> but for example with X the X server is master (and e.g. can change 
>> resolutions etc..) and not the compositor.
>>
>> So another question is also what windowing system (if any) are you 
>> planning to use? X, Wayland, Flinger or something completely different ?
>>
>> Regards,
>> Christian.
>>
>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>> Hi Christian,
>>>
>>> That is definitely a concern. What we are currently thinking is to 
>>> make the high priority queues accessible to root only.
>>>
>>> Therefore is a non-root user attempts to set the high priority flag 
>>> on context allocation, we would fail the call and return ENOPERM.
>>>
>>> Regards,
>>> Andres
>>>
>>>
>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>> to solve it?
>>>> Yeah, that problem came to my mind as well.
>>>>
>>>> Basically we need to restrict those high priority submissions to 
>>>> the VR compositor or otherwise any malfunctioning application could 
>>>> use it.
>>>>
>>>> Just think about some WebGL suddenly taking all our rendering away 
>>>> and we won't get anything drawn any more.
>>>>
>>>> Alex or Michel any ideas on that?
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>> > If compute queue is occupied only by you, the efficiency
>>>>> > is equal with setting job queue to high priority I think.
>>>>> The only risk is the situation when graphics will take all
>>>>> needed CUs. But in any case it should be very good test.
>>>>>
>>>>> Andres/Pierre-Loup,
>>>>>
>>>>> Did you try to do it or it is a lot of work for you?
>>>>>
>>>>>
>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>> to solve it?
>>>>>
>>>>> Sincerely yours,
>>>>> Serguei Sagalovitch
>>>>>
>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>> Do you encounter the priority issue for compute queue with 
>>>>>> current driver?
>>>>>>
>>>>>> If compute queue is occupied only by you, the efficiency is equal 
>>>>>> with setting job queue to high priority I think.
>>>>>>
>>>>>> Regards,
>>>>>> David Zhou
>>>>>>
>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>
>>>>>>> I'm not sure if I'm asking for too much, but if we can 
>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the 
>>>>>>> vulkan level that would be great.
>>>>>>>
>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>
>>>>>>> - Andres
>>>>>>>
>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>> We're currently working with the open stack; I assume that a 
>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan 
>>>>>>>>> userspace drivers and that the amdgpu kernel interface 
>>>>>>>>> improvements we would pursue following this discussion would 
>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>> Of course.
>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> David Zhou
>>>>>>>>>
>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>>>>>
>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> David Zhou
>>>>>>>>>>
>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>
>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of 
>>>>>>>>>>> amgpu;
>>>>>>>>>>> see replies inline.
>>>>>>>>>>>
>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>> Andres,
>>>>>>>>>>>>
>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes 
>>>>>>>>>>>>> running
>>>>>>>>>>>>> actually:
>>>>>>>>>>>> So we could have potential memory overcommit case or do you do
>>>>>>>>>>>> partitioning
>>>>>>>>>>>> on your own?  I would think that there is need to avoid 
>>>>>>>>>>>> overcomit in
>>>>>>>>>>>> VR case to
>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>
>>>>>>>>>>> You're entirely correct; currently the VR runtime is setting up
>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're 
>>>>>>>>>>> working on
>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this 
>>>>>>>>>>> thread), and in
>>>>>>>>>>> the future it will make sense to do work in order to make 
>>>>>>>>>>> sure that
>>>>>>>>>>> its memory allocations do not get evicted, to prevent any 
>>>>>>>>>>> unwelcome
>>>>>>>>>>> additional latency in the event of needing to perform 
>>>>>>>>>>> just-in-time
>>>>>>>>>>> reprojection.
>>>>>>>>>>>
>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>> Based on my understanding sharing BOs between different 
>>>>>>>>>>>> processes
>>>>>>>>>>>> could introduce additional synchronization constrains. btw: 
>>>>>>>>>>>> I am not
>>>>>>>>>>>> sure
>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process 
>>>>>>>>>>>> boundary.
>>>>>>>>>>>
>>>>>>>>>>> They are different processes; it is important for the 
>>>>>>>>>>> compositor that
>>>>>>>>>>> is responsible for quality-of-service features such as 
>>>>>>>>>>> consistently
>>>>>>>>>>> presenting distorted frames with the right latency, 
>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>
>>>>>>>>>>> Currently we are using unreleased cross-process memory and 
>>>>>>>>>>> semaphore
>>>>>>>>>>> extensions to fetch updated eye images from the client 
>>>>>>>>>>> application,
>>>>>>>>>>> but the just-in-time reprojection discussed here does not 
>>>>>>>>>>> actually
>>>>>>>>>>> have any direct interactions with cross-process resource 
>>>>>>>>>>> sharing,
>>>>>>>>>>> since it's achieved by using whatever is the latest, most 
>>>>>>>>>>> up-to-date
>>>>>>>>>>> eye images that have already been sent by the client 
>>>>>>>>>>> application,
>>>>>>>>>>> which are already available to use without additional 
>>>>>>>>>>> synchronization.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to 
>>>>>>>>>>>>> remove this
>>>>>>>>>>>>> overhead)
>>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>>
>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the 
>>>>>>>>>>> headset
>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>> I would assume that this is the known problem (at least for 
>>>>>>>>>>>> compute
>>>>>>>>>>>> usage).
>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU 
>>>>>>>>>>>> intensive
>>>>>>>>>>>> (at least
>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>
>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue. 
>>>>>>>>>>> However, if
>>>>>>>>>>> there's high degrees of variance then that would be 
>>>>>>>>>>> troublesome and we
>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully the requirements and approach we described make 
>>>>>>>>>>> sense, we're
>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling 
>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>
>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current 
>>>>>>>>>>>>> allocation
>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>>>>>> resource
>>>>>>>>>>>>> conflict
>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>
>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can 
>>>>>>>>>>>> start with a
>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no 
>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>
>>>>>>>>>>>> This should be more or less the use case we expect from VR 
>>>>>>>>>>>> users.
>>>>>>>>>>>>
>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to 
>>>>>>>>>>>> consider
>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>
>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>> will be not
>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will 
>>>>>>>>>>>>> have one main
>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally 
>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>
>>>>>>>>>>>> Correct, this is why we want to enable the high priority 
>>>>>>>>>>>> compute
>>>>>>>>>>>> queue through
>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>>>>>
>>>>>>>>>>>> For current VR workloads we have 3 separate processes 
>>>>>>>>>>>> running actually:
>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>     2) VR Compositor (this is the process that will require 
>>>>>>>>>>>> high
>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>     3) System compositor (we are looking at approaches to 
>>>>>>>>>>>> remove this
>>>>>>>>>>>> overhead)
>>>>>>>>>>>>
>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>> I would also like to be able to address this case in the 
>>>>>>>>>>>> future
>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>
>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:  
>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>
>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>> illustration of what the reprojection scheduling looks like 
>>>>>>>>>>>> can be
>>>>>>>>>>>> found here:
>>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>>>>> to guarantee that submissions from the same context will 
>>>>>>>>>>>>> be executed
>>>>>>>>>>>>> in order.
>>>>>>>>>>>>
>>>>>>>>>>>> This is okay, as the reprojection work doesn't have 
>>>>>>>>>>>> dependencies on
>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>
>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you 
>>>>>>>>>>>>> want
>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>
>>>>>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>>>>>
>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics 
>>>>>>>>>>>>> as well as
>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure 
>>>>>>>>>>>> out a way
>>>>>>>>>>>> for us to get
>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then 
>>>>>>>>>>>> I'll take you
>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Andres
>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling 
>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>
>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>
>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling 
>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Andres
>>>>>>>>>>>>
>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling 
>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>> Andres,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>
>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU 
>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>> to high-priority queue  when it will be in use and "free" 
>>>>>>>>>>>> them later
>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to 
>>>>>>>>>>>> degrade
>>>>>>>>>>>> graphics
>>>>>>>>>>>> performance).
>>>>>>>>>>>>
>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>>> low-priority
>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will 
>>>>>>>>>>>> wait for
>>>>>>>>>>>> needed resources.
>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit 
>>>>>>>>>>>> "real"
>>>>>>>>>>>> compute task
>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for 
>>>>>>>>>>>> testing.
>>>>>>>>>>>>
>>>>>>>>>>>> It (CU assignment) could be relatively easy done when 
>>>>>>>>>>>> everything is
>>>>>>>>>>>> going via kernel
>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I 
>>>>>>>>>>>> am not sure
>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>
>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming 
>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler" 
>>>>>>>>>>>> when
>>>>>>>>>>>> deciding which
>>>>>>>>>>>> queue to  run will check if there is enough resources and 
>>>>>>>>>>>> if not then
>>>>>>>>>>>> it will begin
>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>
>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to 
>>>>>>>>>>>> high-priority
>>>>>>>>>>>> queue and have
>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>
>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue? 
>>>>>>>>>>>> (as opposed
>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because 
>>>>>>>>>>>> amdgpu
>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>
>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>> some scheduling is per pipe.  I know about the current 
>>>>>>>>>>>> allocation
>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>> of resources  based on the workload otherwise we will have 
>>>>>>>>>>>> resource
>>>>>>>>>>>> conflict
>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> BTW: Which user level API do you want to use for compute: 
>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>
>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>
>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so 
>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>> be not
>>>>>>>>>>>> involved.  I would assume that in the case of VR we will 
>>>>>>>>>>>> have one main
>>>>>>>>>>>> application ("console" mode(?)) so we could temporally 
>>>>>>>>>>>> "ignore"
>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>
>>>>>>>>>>>>>  we will not be able to provide a solution compatible with 
>>>>>>>>>>>>> GFX
>>>>>>>>>>>>> worloads.
>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>
>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the 
>>>>>>>>>>>> currently running
>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases 
>>>>>>>>>>>> where it
>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>> polaris10 it starts working well, it might be a better 
>>>>>>>>>>>> solution for
>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and 
>>>>>>>>>>>> porting it to
>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>
>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task: 
>>>>>>>>>>>> (a) it may
>>>>>>>>>>>> take time so
>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different 
>>>>>>>>>>>> "context"
>>>>>>>>>>>> - we want
>>>>>>>>>>>> to guarantee that submissions from the same context will be 
>>>>>>>>>>>> executed
>>>>>>>>>>>> in order.
>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you 
>>>>>>>>>>>> want
>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>>> for graphics as well as for plain compute tasks 
>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on 
>>>>>>>>>>>> behalf of
>>>>>>>>>>>> Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>> To: amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in 
>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> We are interested in feedback for a mechanism to 
>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>> high
>>>>>>>>>>>> priority VR reprojection tasks (also referred to as 
>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>> Polaris10
>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>
>>>>>>>>>>>> Brief context:
>>>>>>>>>>>> --------------
>>>>>>>>>>>>
>>>>>>>>>>>> The main objective of reprojection is to avoid motion 
>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>> users in
>>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>>> rendering a new
>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the 
>>>>>>>>>>>> user's head
>>>>>>>>>>>> movements
>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the 
>>>>>>>>>>>> duration
>>>>>>>>>>>> of an
>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear 
>>>>>>>>>>>> and the
>>>>>>>>>>>> eyes may
>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>
>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a 
>>>>>>>>>>>> new frame
>>>>>>>>>>>> using the
>>>>>>>>>>>> user's updated head position in combination with the 
>>>>>>>>>>>> previous frames.
>>>>>>>>>>>> This
>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the 
>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>
>>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>>> confidence that the
>>>>>>>>>>>> reprojection task will complete before the VBLANK interval. 
>>>>>>>>>>>> Even if
>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>> is currently full of work from the game/application (which 
>>>>>>>>>>>> is most
>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>
>>>>>>>>>>>> For more details and illustrations, please refer to the 
>>>>>>>>>>>> following
>>>>>>>>>>>> document:
>>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>>>> over the
>>>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>>>> which can
>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>>>> over the
>>>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>>>> which can
>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>> One of the most exciting new developments in GPU technology 
>>>>>>>>>>>> over the
>>>>>>>>>>>> past year has been the adoption of asynchronous shaders, 
>>>>>>>>>>>> which can
>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Requirements:
>>>>>>>>>>>> -------------
>>>>>>>>>>>>
>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>
>>>>>>>>>>>>     * Job round trip time must be predictable, from 
>>>>>>>>>>>> submission to
>>>>>>>>>>>> fence signal
>>>>>>>>>>>>
>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>
>>>>>>>>>>>> Goals:
>>>>>>>>>>>> ------
>>>>>>>>>>>>
>>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>>
>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy 
>>>>>>>>>>>> hardware
>>>>>>>>>>>> should
>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>
>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>> -------------
>>>>>>>>>>>>
>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>
>>>>>>>>>>>> My understanding is that with the current hardware 
>>>>>>>>>>>> capabilities in
>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>> will not be able to provide a solution compatible with GFX 
>>>>>>>>>>>> worloads.
>>>>>>>>>>>>
>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>>>>>> approach or
>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring, 
>>>>>>>>>>>> please let
>>>>>>>>>>>> us know
>>>>>>>>>>>> about it.
>>>>>>>>>>>>
>>>>>>>>>>>>     * The above guarantees should also be respected by 
>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>
>>>>>>>>>>>> Would be good to have for consistency, but not strictly 
>>>>>>>>>>>> necessary as
>>>>>>>>>>>> users running
>>>>>>>>>>>> games are not traditionally running HPC workloads in the 
>>>>>>>>>>>> background.
>>>>>>>>>>>>
>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>> ------------------
>>>>>>>>>>>>
>>>>>>>>>>>> Similar to the windows driver, we could expose a high priority
>>>>>>>>>>>> compute queue to
>>>>>>>>>>>> userspace.
>>>>>>>>>>>>
>>>>>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>>>>>> priority, and may
>>>>>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>>>>>
>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority' 
>>>>>>>>>>>> field in
>>>>>>>>>>>> the HQDs
>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler. 
>>>>>>>>>>>> The relevant
>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>
>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from 
>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>
>>>>>>>>>>>> The relevant priorities can be set so that submissions to 
>>>>>>>>>>>> the high
>>>>>>>>>>>> priority
>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>
>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high 
>>>>>>>>>>>> priority
>>>>>>>>>>>> rings if the
>>>>>>>>>>>> context is marked as high priority. And a corresponding 
>>>>>>>>>>>> priority
>>>>>>>>>>>> should be
>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>
>>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all 
>>>>>>>>>>>> submissions to a
>>>>>>>>>>>> context
>>>>>>>>>>>>     * Create high priority and non-high priority contexts 
>>>>>>>>>>>> in the same
>>>>>>>>>>>> process
>>>>>>>>>>>>
>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> Similar to the above, but instead of programming the 
>>>>>>>>>>>> priorities and
>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the 
>>>>>>>>>>>> queue priorities
>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>
>>>>>>>>>>>> This would involve having a hardware specific callback from 
>>>>>>>>>>>> the
>>>>>>>>>>>> scheduler to
>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring, 
>>>>>>>>>>>> int index,
>>>>>>>>>>>> int priority)
>>>>>>>>>>>>
>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex 
>>>>>>>>>>>> to perform
>>>>>>>>>>>> the appropriate
>>>>>>>>>>>> HW programming, and I'm not really sure if that is 
>>>>>>>>>>>> something we
>>>>>>>>>>>> should be doing from
>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>
>>>>>>>>>>>> On the positive side, this approach would allow us to 
>>>>>>>>>>>> program a range of
>>>>>>>>>>>> priorities for jobs instead of a single "high priority" 
>>>>>>>>>>>> value",
>>>>>>>>>>>> achieving
>>>>>>>>>>>> something similar to the niceness API available for CPU 
>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not sure if this flexibility is something that we would 
>>>>>>>>>>>> need for
>>>>>>>>>>>> our use
>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple 
>>>>>>>>>>>> users
>>>>>>>>>>>> sharing compute
>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>
>>>>>>>>>>>> This approach would require a new int field in 
>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>> repurposing
>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>
>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD 
>>>>>>>>>>>> priorities, and
>>>>>>>>>>>> instead it picks
>>>>>>>>>>>> jobs at random. Settings from the shader itself are also 
>>>>>>>>>>>> disregarded
>>>>>>>>>>>> as this is
>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>
>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP, 
>>>>>>>>>>>> but we
>>>>>>>>>>>> might not get the
>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>
>>>>>>>>>>>> The current programming would have to be changed to allow 
>>>>>>>>>>>> priority
>>>>>>>>>>>> propagation
>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>
>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> For consistency purposes, the high priority context can be 
>>>>>>>>>>>> enabled
>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>> with support of the SW scheduler. This will function 
>>>>>>>>>>>> similarly to the
>>>>>>>>>>>> current
>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump 
>>>>>>>>>>>> ahead of
>>>>>>>>>>>> anything not
>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>
>>>>>>>>>>>> The benefits of requesting a high priority context for a 
>>>>>>>>>>>> non-compute
>>>>>>>>>>>> queue will
>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is 
>>>>>>>>>>>> stuck in
>>>>>>>>>>>> front of
>>>>>>>>>>>> you), but having the API in place will allow us to easily 
>>>>>>>>>>>> improve the
>>>>>>>>>>>> implementation
>>>>>>>>>>>> in the future as new features become available in new 
>>>>>>>>>>>> hardware.
>>>>>>>>>>>>
>>>>>>>>>>>> Future steps:
>>>>>>>>>>>> -------------
>>>>>>>>>>>>
>>>>>>>>>>>> Once we have an approach settled, I can take care of the 
>>>>>>>>>>>> implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, once the interface is mostly decided, we can start 
>>>>>>>>>>>> thinking about
>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>
>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>
>>>>>>>>>>>> We aren't married to any of the approaches outlined above. 
>>>>>>>>>>>> Our goal
>>>>>>>>>>>> is to
>>>>>>>>>>>> obtain a mechanism that will allow us to complete the 
>>>>>>>>>>>> reprojection
>>>>>>>>>>>> job within a
>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any 
>>>>>>>>>>>> suggestions for
>>>>>>>>>>>> improvements or alternative strategies we are more than 
>>>>>>>>>>>> happy to hear
>>>>>>>>>>>> them.
>>>>>>>>>>>>
>>>>>>>>>>>> If any of the technical information above is also 
>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>> free to point
>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>
>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Andres
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all 
>>>>>>>>>>>> the list
>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>> To see the collection of prior postings to the list, visit the
>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all 
>>>>>>>>>>>> the list
>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>
>>>>>
>>>>> Sincerely yours,
>>>>> Serguei Sagalovitch
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx at lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>>>
>>>
>>
>

Sincerely yours,
Serguei Sagalovitch