[RFC] Mechanism for high priority scheduling in amdgpu

Mon Jan 2 14:09:45 UTC 2017

> I hate to keep bringing up display topics in an unrelated 
> conversation, but I'm not sure where you got "Application -> X server 
> -> compositor -> X server" from.
Sorry for that. It just sounded like you was assuming something about 
the current stack which looked like it won't work currently.

But when you are willing to implement that stuff then I'm not really 
concerned any more. Especially when Dave is already involved.

Regards,
Christian.

Am 23.12.2016 um 19:18 schrieb Pierre-Loup A. Griffais:
> I hate to keep bringing up display topics in an unrelated 
> conversation, but I'm not sure where you got "Application -> X server 
> -> compositor -> X server" from. As I was saying before, we need to be 
> presenting directly to the HMD display as no display server can be in 
> the way, both for latency but also quality of service reasons (a buggy 
> application cannot be allowed to accidentally display undistorted 
> rendering into the HMD); we intend to do the necessary work for this, 
> and the extent of X's (or a Wayland implementation, or any other 
> display server) involvment will be to participate enough to know that 
> the HMD display is off-limits. If you have more questions on the 
> display aspect, or VR rendering in general, I'm happy to try to 
> address them out-of-band from this conversation.
>
> On 12/23/2016 02:54 AM, Christian König wrote:
>>> But yes, in general you don't want another compositor in the way, so
>>> we'll be acquiring the HMD display directly, separate from any desktop
>>> or display server.
>> Assuming that the the HMD is attached to the rendering device in some
>> way you have the X server and the Compositor which both try to be DRM
>> master at the same time.
>>
>> Please correct me if that was fixed in the meantime, but that sounds
>> like it will simply not work. Or is this what Andres mention below Dave
>> is working on ?.
>>
>> Additional to that a compositor in combination with X is a bit counter
>> productive when you want to keep the latency low.
>>
>> E.g. the "normal" flow of a GL or Vulkan surface filled with rendered
>> data to be displayed is from the Application -> X server -> compositor
>> -> X server.
>>
>> The extra step between X server and compositor just means extra latency
>> and for this use case you probably don't want that.
>>
>> Targeting something like Wayland and when you need X compatibility
>> XWayland sounds like the much better idea.
>>
>> Regards,
>> Christian.
>>
>> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>>> Display concerns are a separate issue, and as Andres said we have
>>> other plans to address. But yes, in general you don't want another
>>> compositor in the way, so we'll be acquiring the HMD display directly,
>>> separate from any desktop or display server. Same with security, we
>>> can have a separate conversation about that when the time comes.
>>>
>>> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>>>> Andres,
>>>>
>>>> Did you measure  latency, etc. impact of __any__ compositor?
>>>>
>>>> My understanding is that VR has pretty strict requirements related to
>>>> QoS.
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>>>>> Hey Christian,
>>>>>
>>>>> We are currently interested in X, but with some distros switching to
>>>>> other compositors by default, we also need to consider those.
>>>>>
>>>>> We agree, running the full vrcompositor in root isn't something that
>>>>> we want to do. Too many security concerns. Having a small root helper
>>>>> that does the privilege escalation for us is the initial idea.
>>>>>
>>>>> For a long term approach, Pierre-Loup and Dave are working on dealing
>>>>> with the "two compositors" scenario a little better in DRM+X.
>>>>> Fullscreen isn't really a sufficient approach, since we don't want 
>>>>> the
>>>>> HMD to be used as part of the Desktop environment when a VR app is 
>>>>> not
>>>>> in use (this is extremely annoying).
>>>>>
>>>>> When the above is settled, we should have an auth mechanism besides
>>>>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
>>>>> HMD permanently away from X. Re-using that auth method to gate this
>>>>> IOCTL is probably going to be the final solution.
>>>>>
>>>>> I propose to start with ROOT_ONLY since it should allow us to respect
>>>>> kernel IOCTL compatibility guidelines with the most flexibility. 
>>>>> Going
>>>>> from a restrictive to a more flexible permission model would be
>>>>> inclusive, but going from a general to a restrictive model may 
>>>>> exclude
>>>>> some apps that used to work.
>>>>>
>>>>> Regards,
>>>>> Andres
>>>>>
>>>>> On 12/22/2016 6:42 AM, Christian König wrote:
>>>>>> Hi Andres,
>>>>>>
>>>>>> well using root might cause stability and security problems as well.
>>>>>> We worked quite hard to avoid exactly this for X.
>>>>>>
>>>>>> We could make this feature depend on the compositor being DRM 
>>>>>> master,
>>>>>> but for example with X the X server is master (and e.g. can change
>>>>>> resolutions etc..) and not the compositor.
>>>>>>
>>>>>> So another question is also what windowing system (if any) are you
>>>>>> planning to use? X, Wayland, Flinger or something completely
>>>>>> different ?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> That is definitely a concern. What we are currently thinking is to
>>>>>>> make the high priority queues accessible to root only.
>>>>>>>
>>>>>>> Therefore is a non-root user attempts to set the high priority flag
>>>>>>> on context allocation, we would fail the call and return ENOPERM.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>>
>>>>>>>
>>>>>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>>> to solve it?
>>>>>>>> Yeah, that problem came to my mind as well.
>>>>>>>>
>>>>>>>> Basically we need to restrict those high priority submissions to
>>>>>>>> the VR compositor or otherwise any malfunctioning application 
>>>>>>>> could
>>>>>>>> use it.
>>>>>>>>
>>>>>>>> Just think about some WebGL suddenly taking all our rendering away
>>>>>>>> and we won't get anything drawn any more.
>>>>>>>>
>>>>>>>> Alex or Michel any ideas on that?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>>>>>> > If compute queue is occupied only by you, the efficiency
>>>>>>>>> > is equal with setting job queue to high priority I think.
>>>>>>>>> The only risk is the situation when graphics will take all
>>>>>>>>> needed CUs. But in any case it should be very good test.
>>>>>>>>>
>>>>>>>>> Andres/Pierre-Loup,
>>>>>>>>>
>>>>>>>>> Did you try to do it or it is a lot of work for you?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>>> to solve it?
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>>>>>> Do you encounter the priority issue for compute queue with
>>>>>>>>>> current driver?
>>>>>>>>>>
>>>>>>>>>> If compute queue is occupied only by you, the efficiency is 
>>>>>>>>>> equal
>>>>>>>>>> with setting job queue to high priority I think.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> David Zhou
>>>>>>>>>>
>>>>>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure if I'm asking for too much, but if we can
>>>>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the
>>>>>>>>>>> vulkan level that would be great.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>>>>>
>>>>>>>>>>> - Andres
>>>>>>>>>>>
>>>>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan
>>>>>>>>>>>>> userspace drivers and that the amdgpu kernel interface
>>>>>>>>>>>>> improvements we would pursue following this discussion would
>>>>>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>>>>>> Of course.
>>>>>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro
>>>>>>>>>>>>>> driver?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on 
>>>>>>>>>>>>>>> top of
>>>>>>>>>>>>>>> amgpu;
>>>>>>>>>>>>>>> see replies inline.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>>> So we could have potential memory overcommit case or do
>>>>>>>>>>>>>>>> you do
>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>>>>>>>> overcomit in
>>>>>>>>>>>>>>>> VR case to
>>>>>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You're entirely correct; currently the VR runtime is
>>>>>>>>>>>>>>> setting up
>>>>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're
>>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this
>>>>>>>>>>>>>>> thread), and in
>>>>>>>>>>>>>>> the future it will make sense to do work in order to make
>>>>>>>>>>>>>>> sure that
>>>>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>>>>>>>> unwelcome
>>>>>>>>>>>>>>> additional latency in the event of needing to perform
>>>>>>>>>>>>>>> just-in-time
>>>>>>>>>>>>>>> reprojection.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>>>>>> Based on my understanding sharing BOs between different
>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>> could introduce additional synchronization constrains. 
>>>>>>>>>>>>>>>> btw:
>>>>>>>>>>>>>>>> I am not
>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> They are different processes; it is important for the
>>>>>>>>>>>>>>> compositor that
>>>>>>>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>>>>>>>> consistently
>>>>>>>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>>>>>>>> semaphore
>>>>>>>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>>> but the just-in-time reprojection discussed here does not
>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>> have any direct interactions with cross-process resource
>>>>>>>>>>>>>>> sharing,
>>>>>>>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>>>>>>>> up-to-date
>>>>>>>>>>>>>>> eye images that have already been sent by the client
>>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>>> which are already available to use without additional
>>>>>>>>>>>>>>> synchronization.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>> Yes,  IMHO the best is to run in "full screen mode".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to 
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> headset
>>>>>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>>>>>> I would assume that this is the known problem (at least 
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>>> usage).
>>>>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather 
>>>>>>>>>>>>>>>> CPU
>>>>>>>>>>>>>>>> intensive
>>>>>>>>>>>>>>>> (at least
>>>>>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>>>>>>>> However, if
>>>>>>>>>>>>>>> there's high degrees of variance then that would be
>>>>>>>>>>>>>>> troublesome and we
>>>>>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>>>>>>>> sense, we're
>>>>>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far
>>>>>>>>>>>>>>>>> as I
>>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to
>>>>>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will 
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>>> between Vulkan compute and OpenCL.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can
>>>>>>>>>>>>>>>> start with a
>>>>>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no
>>>>>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This should be more or less the use case we expect from VR
>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>>>>>> will be not
>>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Correct, this is why we want to enable the high priority
>>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>>> queue through
>>>>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan
>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>> running actually:
>>>>>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>>>>>     2) VR Compositor (this is the process that will 
>>>>>>>>>>>>>>>> require
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm 
>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>>>>>> I would also like to be able to address this case in the
>>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [Serguei] The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>>>>>> illustration of what the reprojection scheduling looks 
>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>>>> found here:
>>>>>>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we
>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will
>>>>>>>>>>>>>>>>> be executed
>>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>>>>>>>> dependencies on
>>>>>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you
>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Preempt the game with the compositor task and then resume
>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics
>>>>>>>>>>>>>>>>> as well as
>>>>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure
>>>>>>>>>>>>>>>> out a way
>>>>>>>>>>>>>>>> for us to get
>>>>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then
>>>>>>>>>>>>>>>> I'll take you
>>>>>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>>>>>> to high-priority queue  when it will be in use and "free"
>>>>>>>>>>>>>>>> them later
>>>>>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic 
>>>>>>>>>>>>>>>> task to
>>>>>>>>>>>>>>>> degrade
>>>>>>>>>>>>>>>> graphics
>>>>>>>>>>>>>>>> performance).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Otherwise we could have scenario when long graphics 
>>>>>>>>>>>>>>>> task (or
>>>>>>>>>>>>>>>> low-priority
>>>>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>>>>>>>> wait for
>>>>>>>>>>>>>>>> needed resources.
>>>>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit
>>>>>>>>>>>>>>>> "real"
>>>>>>>>>>>>>>>> compute task
>>>>>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>>>>>>>> testing.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>>>>>>>> everything is
>>>>>>>>>>>>>>>> going via kernel
>>>>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I
>>>>>>>>>>>>>>>> am not sure
>>>>>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming
>>>>>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that 
>>>>>>>>>>>>>>>> "scheduler"
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> deciding which
>>>>>>>>>>>>>>>> queue to  run will check if there is enough resources and
>>>>>>>>>>>>>>>> if not then
>>>>>>>>>>>>>>>> it will begin
>>>>>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to
>>>>>>>>>>>>>>>> high-priority
>>>>>>>>>>>>>>>> queue and have
>>>>>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue?
>>>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this 
>>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far 
>>>>>>>>>>>>>>>> as I
>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to 
>>>>>>>>>>>>>>>> switch to
>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>>>>>> be not
>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  we will not be able to provide a solution compatible 
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> GFX
>>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the
>>>>>>>>>>>>>>>> currently running
>>>>>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>>>>>>>> where it
>>>>>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>>>>>>>> solution for
>>>>>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and
>>>>>>>>>>>>>>>> porting it to
>>>>>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>> (a) it may
>>>>>>>>>>>>>>>> take time so
>>>>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have 
>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>> "context"
>>>>>>>>>>>>>>>> - we want
>>>>>>>>>>>>>>>> to guarantee that submissions from the same context 
>>>>>>>>>>>>>>>> will be
>>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be 
>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on
>>>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>>> Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>>>>>> To: amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are interested in feedback for a mechanism to
>>>>>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>>>>>> Polaris10
>>>>>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brief context:
>>>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The main objective of reprojection is to avoid motion
>>>>>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>>>>>> users in
>>>>>>>>>>>>>>>> scenarios where the game or application would fail to 
>>>>>>>>>>>>>>>> finish
>>>>>>>>>>>>>>>> rendering a new
>>>>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>>>>>>>> user's head
>>>>>>>>>>>>>>>> movements
>>>>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for 
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> duration
>>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear
>>>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>>> eyes may
>>>>>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a
>>>>>>>>>>>>>>>> new frame
>>>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>>>> user's updated head position in combination with the
>>>>>>>>>>>>>>>> previous frames.
>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Because of the adverse effects on the user, we require 
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> confidence that the
>>>>>>>>>>>>>>>> reprojection task will complete before the VBLANK 
>>>>>>>>>>>>>>>> interval.
>>>>>>>>>>>>>>>> Even if
>>>>>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>>>>>> is currently full of work from the game/application (which
>>>>>>>>>>>>>>>> is most
>>>>>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For more details and illustrations, please refer to the
>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>> document:
>>>>>>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU 
>>>>>>>>>>>>>>>> technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU 
>>>>>>>>>>>>>>>> technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU 
>>>>>>>>>>>>>>>> technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Requirements:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * Job round trip time must be predictable, from
>>>>>>>>>>>>>>>> submission to
>>>>>>>>>>>>>>>> fence signal
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism should provide low submission 
>>>>>>>>>>>>>>>> latencies
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on 
>>>>>>>>>>>>>>>> busy
>>>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>>>>>>>> capabilities in
>>>>>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an
>>>>>>>>>>>>>>>> idea,
>>>>>>>>>>>>>>>> approach or
>>>>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>>>>>>>> please let
>>>>>>>>>>>>>>>> us know
>>>>>>>>>>>>>>>> about it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The above guarantees should also be respected by
>>>>>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>>>>>>>> necessary as
>>>>>>>>>>>>>>>> users running
>>>>>>>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>>>>>>>> background.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similar to the windows driver, we could expose a high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> compute queue to
>>>>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority, and may
>>>>>>>>>>>>>>>> acquire hardware resources previously in use by other
>>>>>>>>>>>>>>>> queues.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>>>>>>>> field in
>>>>>>>>>>>>>>>> the HQDs
>>>>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler.
>>>>>>>>>>>>>>>> The relevant
>>>>>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The relevant priorities can be set so that submissions to
>>>>>>>>>>>>>>>> the high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> rings if the
>>>>>>>>>>>>>>>> context is marked as high priority. And a corresponding
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The user will request a high priority context by 
>>>>>>>>>>>>>>>> setting an
>>>>>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or 
>>>>>>>>>>>>>>>> similar):
>>>>>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all
>>>>>>>>>>>>>>>> submissions to a
>>>>>>>>>>>>>>>> context
>>>>>>>>>>>>>>>>     * Create high priority and non-high priority contexts
>>>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>>>>>>>> priorities and
>>>>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the
>>>>>>>>>>>>>>>> queue priorities
>>>>>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This would involve having a hardware specific callback 
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> scheduler to
>>>>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring,
>>>>>>>>>>>>>>>> int index,
>>>>>>>>>>>>>>>> int priority)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex
>>>>>>>>>>>>>>>> to perform
>>>>>>>>>>>>>>>> the appropriate
>>>>>>>>>>>>>>>> HW programming, and I'm not really sure if that is
>>>>>>>>>>>>>>>> something we
>>>>>>>>>>>>>>>> should be doing from
>>>>>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On the positive side, this approach would allow us to
>>>>>>>>>>>>>>>> program a range of
>>>>>>>>>>>>>>>> priorities for jobs instead of a single "high priority"
>>>>>>>>>>>>>>>> value",
>>>>>>>>>>>>>>>> achieving
>>>>>>>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure if this flexibility is something that we 
>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>> need for
>>>>>>>>>>>>>>>> our use
>>>>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>> sharing compute
>>>>>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>>>>>> repurposing
>>>>>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>>>>>>>> priorities, and
>>>>>>>>>>>>>>>> instead it picks
>>>>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>>>>>>>> disregarded
>>>>>>>>>>>>>>>> as this is
>>>>>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Effectively we can get our compute wavefront launched 
>>>>>>>>>>>>>>>> ASAP,
>>>>>>>>>>>>>>>> but we
>>>>>>>>>>>>>>>> might not get the
>>>>>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> propagation
>>>>>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>>>>>> with support of the SW scheduler. This will function
>>>>>>>>>>>>>>>> similarly to the
>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>>>>>>>> ahead of
>>>>>>>>>>>>>>>> anything not
>>>>>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>>>>>>>> non-compute
>>>>>>>>>>>>>>>> queue will
>>>>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>>>>>>>> stuck in
>>>>>>>>>>>>>>>> front of
>>>>>>>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>>>>>>>> improve the
>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>> in the future as new features become available in new
>>>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Future steps:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>>>>>>>> thinking about
>>>>>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We aren't married to any of the approaches outlined above.
>>>>>>>>>>>>>>>> Our goal
>>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>>>>>>>> reprojection
>>>>>>>>>>>>>>>> job within a
>>>>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>>>>>>>> suggestions for
>>>>>>>>>>>>>>>> improvements or alternative strategies we are more than
>>>>>>>>>>>>>>>> happy to hear
>>>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If any of the technical information above is also
>>>>>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>>>>>> free to point
>>>>>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>>> To see the collection of prior postings to the list,
>>>>>>>>>>>>>>>> visit the
>>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>>> To see the collection of prior postings to the list,
>>>>>>>>>>>>>>>> visit the
>>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>
>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx