[RFC] Mechanism for high priority scheduling in amdgpu

Fri Dec 23 16:30:06 UTC 2016

I'm actually testing that out that today.

Example prototype patch:
https://github.com/lostgoat/linux/commit/c9d88d409d8655d63aa8386edc66687a2ba64a12

My goal is to first implement this approach, then slowly work my way
towards the HW level optimizations.

The problem I expect to see with this approach is that there will still be
unpredictably long latencies depending on what has been committed to the HW
rings.

But it is definitely a good start.

Regards,
Andres

On Fri, Dec 23, 2016 at 11:20 AM, Bridgman, John <John.Bridgman at amd.com>
wrote:

> One question I just remembered - the amdgpu driver includes some scheduler
> logic which maintains per-process queues and therefore avoids loading up
> the primary ring with a ton of work.
>
>
> Has there been any experimentation with injecting priorities at that level
> rather than jumping straight to HW-level changes ?
> ------------------------------
> *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
> Andres Rodriguez <andresx7 at gmail.com>
> *Sent:* December 23, 2016 11:13 AM
> *To:* Koenig, Christian
> *Cc:* Zhou, David(ChunMing); Huan, Alvin; Mao, David; Sagalovitch,
> Serguei; amd-gfx at lists.freedesktop.org; Andres Rodriguez; Pierre-Loup A.
> Griffais; Zhang, Hawking
>
> *Subject:* Re: [RFC] Mechanism for high priority scheduling in amdgpu
>
> Hey Christian,
>
> But yes, in general you don't want another compositor in the way, so we'll
>>> be acquiring the HMD display directly, separate from any desktop or display
>>> server.
>>
>>
>> Assuming that the the HMD is attached to the rendering device in some way
>> you have the X server and the Compositor which both try to be DRM master at
>> the same time.
>>
>> Please correct me if that was fixed in the meantime, but that sounds like
>> it will simply not work. Or is this what Andres mention below Dave is
>> working on ?.
>>
>
> You are correct on both statements. We can't have two DRM_MASTERs, so the
> current DRM+X does not support this use case. And this what Dave and
> Pierre-Loup are currently working on.
>
> Additional to that a compositor in combination with X is a bit counter
>> productive when you want to keep the latency low.
>>
>
> One thing I'd like to correct is that our main goal is to get latency
> _predictable_, secondary goal is to make it low.
>
> The high priority queue feature addresses our main source of
> unpredictability: the scheduling latency when the hardware is already full
> of work from the game engine.
>
> The DirectMode feature addresses one of the latency sources: multiple
> (unnecessary) context switches to submit a surface to the DRM driver.
>
> Targeting something like Wayland and when you need X compatibility
>> XWayland sounds like the much better idea.
>>
>
> We are pretty enthusiastic about Wayland (and really glad to see Fedora 25
> use Wayland by default). Once we have everything working nicely under X
> (where most of the users are currently), I'm sure Pierre-Loup will be
> pushing us to get everything optimized under Wayland as well (which should
> be a lot simpler!).
>
> Ever since working with SurfaceFlinger on Android with explicit fencing
> I've been waiting for the day I can finally ditch X altogether :)
>
> Regards,
> Andres
>
>
> On Fri, Dec 23, 2016 at 5:54 AM, Christian König <christian.koenig at amd.com
> > wrote:
>
>> But yes, in general you don't want another compositor in the way, so
>>> we'll be acquiring the HMD display directly, separate from any desktop or
>>> display server.
>>>
>> Assuming that the the HMD is attached to the rendering device in some way
>> you have the X server and the Compositor which both try to be DRM master at
>> the same time.
>>
>> Please correct me if that was fixed in the meantime, but that sounds like
>> it will simply not work. Or is this what Andres mention below Dave is
>> working on ?.
>>
>> Additional to that a compositor in combination with X is a bit counter
>> productive when you want to keep the latency low.
>>
>> E.g. the "normal" flow of a GL or Vulkan surface filled with rendered
>> data to be displayed is from the Application -> X server -> compositor -> X
>> server.
>>
>> The extra step between X server and compositor just means extra latency
>> and for this use case you probably don't want that.
>>
>> Targeting something like Wayland and when you need X compatibility
>> XWayland sounds like the much better idea.
>>
>> Regards,
>> Christian.
>>
>>
>> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>>
>>> Display concerns are a separate issue, and as Andres said we have other
>>> plans to address. But yes, in general you don't want another compositor in
>>> the way, so we'll be acquiring the HMD display directly, separate from any
>>> desktop or display server. Same with security, we can have a separate
>>> conversation about that when the time comes.
>>>
>>> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>>>
>>>> Andres,
>>>>
>>>> Did you measure  latency, etc. impact of __any__ compositor?
>>>>
>>>> My understanding is that VR has pretty strict requirements related to
>>>> QoS.
>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>>>>
>>>>> Hey Christian,
>>>>>
>>>>> We are currently interested in X, but with some distros switching to
>>>>> other compositors by default, we also need to consider those.
>>>>>
>>>>> We agree, running the full vrcompositor in root isn't something that
>>>>> we want to do. Too many security concerns. Having a small root helper
>>>>> that does the privilege escalation for us is the initial idea.
>>>>>
>>>>> For a long term approach, Pierre-Loup and Dave are working on dealing
>>>>> with the "two compositors" scenario a little better in DRM+X.
>>>>> Fullscreen isn't really a sufficient approach, since we don't want the
>>>>> HMD to be used as part of the Desktop environment when a VR app is not
>>>>> in use (this is extremely annoying).
>>>>>
>>>>> When the above is settled, we should have an auth mechanism besides
>>>>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
>>>>> HMD permanently away from X. Re-using that auth method to gate this
>>>>> IOCTL is probably going to be the final solution.
>>>>>
>>>>> I propose to start with ROOT_ONLY since it should allow us to respect
>>>>> kernel IOCTL compatibility guidelines with the most flexibility. Going
>>>>> from a restrictive to a more flexible permission model would be
>>>>> inclusive, but going from a general to a restrictive model may exclude
>>>>> some apps that used to work.
>>>>>
>>>>> Regards,
>>>>> Andres
>>>>>
>>>>> On 12/22/2016 6:42 AM, Christian König wrote:
>>>>>
>>>>>> Hi Andres,
>>>>>>
>>>>>> well using root might cause stability and security problems as well.
>>>>>> We worked quite hard to avoid exactly this for X.
>>>>>>
>>>>>> We could make this feature depend on the compositor being DRM master,
>>>>>> but for example with X the X server is master (and e.g. can change
>>>>>> resolutions etc..) and not the compositor.
>>>>>>
>>>>>> So another question is also what windowing system (if any) are you
>>>>>> planning to use? X, Wayland, Flinger or something completely
>>>>>> different ?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>>>>>
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> That is definitely a concern. What we are currently thinking is to
>>>>>>> make the high priority queues accessible to root only.
>>>>>>>
>>>>>>> Therefore is a non-root user attempts to set the high priority flag
>>>>>>> on context allocation, we would fail the call and return ENOPERM.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andres
>>>>>>>
>>>>>>>
>>>>>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>>>>
>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>>> to solve it?
>>>>>>>>>
>>>>>>>> Yeah, that problem came to my mind as well.
>>>>>>>>
>>>>>>>> Basically we need to restrict those high priority submissions to
>>>>>>>> the VR compositor or otherwise any malfunctioning application could
>>>>>>>> use it.
>>>>>>>>
>>>>>>>> Just think about some WebGL suddenly taking all our rendering away
>>>>>>>> and we won't get anything drawn any more.
>>>>>>>>
>>>>>>>> Alex or Michel any ideas on that?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>>>>>
>>>>>>>>> > If compute queue is occupied only by you, the efficiency
>>>>>>>>> > is equal with setting job queue to high priority I think.
>>>>>>>>> The only risk is the situation when graphics will take all
>>>>>>>>> needed CUs. But in any case it should be very good test.
>>>>>>>>>
>>>>>>>>> Andres/Pierre-Loup,
>>>>>>>>>
>>>>>>>>> Did you try to do it or it is a lot of work for you?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>>> to solve it?
>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>>>>>
>>>>>>>>>> Do you encounter the priority issue for compute queue with
>>>>>>>>>> current driver?
>>>>>>>>>>
>>>>>>>>>> If compute queue is occupied only by you, the efficiency is equal
>>>>>>>>>> with setting job queue to high priority I think.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> David Zhou
>>>>>>>>>>
>>>>>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure if I'm asking for too much, but if we can
>>>>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the
>>>>>>>>>>> vulkan level that would be great.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>>>>>
>>>>>>>>>>> - Andres
>>>>>>>>>>>
>>>>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan
>>>>>>>>>>>>> userspace drivers and that the amdgpu kernel interface
>>>>>>>>>>>>> improvements we would pursue following this discussion would
>>>>>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>>>>>>>
>>>>>>>>>>>> Of course.
>>>>>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro
>>>>>>>>>>>>>> driver?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of
>>>>>>>>>>>>>>> amgpu;
>>>>>>>>>>>>>>> see replies inline.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So we could have potential memory overcommit case or do you
>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>>>>>>>> overcomit in
>>>>>>>>>>>>>>>> VR case to
>>>>>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You're entirely correct; currently the VR runtime is setting
>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're
>>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this
>>>>>>>>>>>>>>> thread), and in
>>>>>>>>>>>>>>> the future it will make sense to do work in order to make
>>>>>>>>>>>>>>> sure that
>>>>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>>>>>>>> unwelcome
>>>>>>>>>>>>>>> additional latency in the event of needing to perform
>>>>>>>>>>>>>>> just-in-time
>>>>>>>>>>>>>>> reprojection.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>>>>>> Based on my understanding sharing BOs between different
>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>> could introduce additional synchronization constrains. btw:
>>>>>>>>>>>>>>>> I am not
>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> They are different processes; it is important for the
>>>>>>>>>>>>>>> compositor that
>>>>>>>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>>>>>>>> consistently
>>>>>>>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>>>>>>>> semaphore
>>>>>>>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>>> but the just-in-time reprojection discussed here does not
>>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>>> have any direct interactions with cross-process resource
>>>>>>>>>>>>>>> sharing,
>>>>>>>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>>>>>>>> up-to-date
>>>>>>>>>>>>>>> eye images that have already been sent by the client
>>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>>> which are already available to use without additional
>>>>>>>>>>>>>>> synchronization.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the
>>>>>>>>>>>>>>> headset
>>>>>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I would assume that this is the known problem (at least for
>>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>>> usage).
>>>>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU
>>>>>>>>>>>>>>>> intensive
>>>>>>>>>>>>>>>> (at least
>>>>>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>>>>>>>> However, if
>>>>>>>>>>>>>>> there's high degrees of variance then that would be
>>>>>>>>>>>>>>> troublesome and we
>>>>>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>>>>>>>> sense, we're
>>>>>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can
>>>>>>>>>>>>>>>> start with a
>>>>>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no
>>>>>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This should be more or less the use case we expect from VR
>>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>>>>>> will be not
>>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Correct, this is why we want to enable the high priority
>>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>>> queue through
>>>>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan
>>>>>>>>>>>>>>>> later.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>> running actually:
>>>>>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>>>>>     2) VR Compositor (this is the process that will require
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>>>>>> I would also like to be able to address this case in the
>>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>>>>>> illustration of what the reprojection scheduling looks like
>>>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>>>> found here:
>>>>>>>>>>>>>>>> https://community.amd.com/serv
>>>>>>>>>>>>>>>> let/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will
>>>>>>>>>>>>>>>>> be executed
>>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>>>>>>>> dependencies on
>>>>>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you
>>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Preempt the game with the compositor task and then resume
>>>>>>>>>>>>>>>> it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics
>>>>>>>>>>>>>>>>> as well as
>>>>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure
>>>>>>>>>>>>>>>> out a way
>>>>>>>>>>>>>>>> for us to get
>>>>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then
>>>>>>>>>>>>>>>> I'll take you
>>>>>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>>>>>> to high-priority queue  when it will be in use and "free"
>>>>>>>>>>>>>>>> them later
>>>>>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to
>>>>>>>>>>>>>>>> degrade
>>>>>>>>>>>>>>>> graphics
>>>>>>>>>>>>>>>> performance).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>>>>>>> low-priority
>>>>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>>>>>>>> wait for
>>>>>>>>>>>>>>>> needed resources.
>>>>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit
>>>>>>>>>>>>>>>> "real"
>>>>>>>>>>>>>>>> compute task
>>>>>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>>>>>>>> testing.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>>>>>>>> everything is
>>>>>>>>>>>>>>>> going via kernel
>>>>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I
>>>>>>>>>>>>>>>> am not sure
>>>>>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming
>>>>>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler"
>>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> deciding which
>>>>>>>>>>>>>>>> queue to  run will check if there is enough resources and
>>>>>>>>>>>>>>>> if not then
>>>>>>>>>>>>>>>> it will begin
>>>>>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to
>>>>>>>>>>>>>>>> high-priority
>>>>>>>>>>>>>>>> queue and have
>>>>>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue?
>>>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>>>>>> be not
>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  we will not be able to provide a solution compatible with
>>>>>>>>>>>>>>>>> GFX
>>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the
>>>>>>>>>>>>>>>> currently running
>>>>>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>>>>>>>> where it
>>>>>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>>>>>>>> solution for
>>>>>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and
>>>>>>>>>>>>>>>> porting it to
>>>>>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>> (a) it may
>>>>>>>>>>>>>>>> take time so
>>>>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different
>>>>>>>>>>>>>>>> "context"
>>>>>>>>>>>>>>>> - we want
>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on
>>>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>>> Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>>>>>> To: amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>>>>>> https://gist.github.com/lostgo
>>>>>>>>>>>>>>>> at/7000432cd6864265dbc2c3ab93204249
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We are interested in feedback for a mechanism to
>>>>>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>>>>>> Polaris10
>>>>>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Brief context:
>>>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The main objective of reprojection is to avoid motion
>>>>>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>>>>>> users in
>>>>>>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>>>>>>> rendering a new
>>>>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>>>>>>>> user's head
>>>>>>>>>>>>>>>> movements
>>>>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the
>>>>>>>>>>>>>>>> duration
>>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear
>>>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>>> eyes may
>>>>>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a
>>>>>>>>>>>>>>>> new frame
>>>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>>>> user's updated head position in combination with the
>>>>>>>>>>>>>>>> previous frames.
>>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>>>>>>> confidence that the
>>>>>>>>>>>>>>>> reprojection task will complete before the VBLANK interval.
>>>>>>>>>>>>>>>> Even if
>>>>>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>>>>>> is currently full of work from the game/application (which
>>>>>>>>>>>>>>>> is most
>>>>>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For more details and illustrations, please refer to the
>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>> document:
>>>>>>>>>>>>>>>> https://community.amd.com/comm
>>>>>>>>>>>>>>>> unity/gaming/blog/2016/03/28/asynchronous-shaders-evolved
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Requirements:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * Job round trip time must be predictable, from
>>>>>>>>>>>>>>>> submission to
>>>>>>>>>>>>>>>> fence signal
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy
>>>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>>>>>>>> capabilities in
>>>>>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an
>>>>>>>>>>>>>>>> idea,
>>>>>>>>>>>>>>>> approach or
>>>>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>>>>>>>> please let
>>>>>>>>>>>>>>>> us know
>>>>>>>>>>>>>>>> about it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>     * The above guarantees should also be respected by
>>>>>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>>>>>>>> necessary as
>>>>>>>>>>>>>>>> users running
>>>>>>>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>>>>>>>> background.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similar to the windows driver, we could expose a high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> compute queue to
>>>>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with
>>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>>> priority, and may
>>>>>>>>>>>>>>>> acquire hardware resources previously in use by other
>>>>>>>>>>>>>>>> queues.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>>>>>>>> field in
>>>>>>>>>>>>>>>> the HQDs
>>>>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler.
>>>>>>>>>>>>>>>> The relevant
>>>>>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The relevant priorities can be set so that submissions to
>>>>>>>>>>>>>>>> the high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> rings if the
>>>>>>>>>>>>>>>> context is marked as high priority. And a corresponding
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>>>>>>> https://github.com/torvalds/li
>>>>>>>>>>>>>>>> nux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all
>>>>>>>>>>>>>>>> submissions to a
>>>>>>>>>>>>>>>> context
>>>>>>>>>>>>>>>>     * Create high priority and non-high priority contexts
>>>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>>>>>>>> priorities and
>>>>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the
>>>>>>>>>>>>>>>> queue priorities
>>>>>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This would involve having a hardware specific callback from
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> scheduler to
>>>>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring,
>>>>>>>>>>>>>>>> int index,
>>>>>>>>>>>>>>>> int priority)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex
>>>>>>>>>>>>>>>> to perform
>>>>>>>>>>>>>>>> the appropriate
>>>>>>>>>>>>>>>> HW programming, and I'm not really sure if that is
>>>>>>>>>>>>>>>> something we
>>>>>>>>>>>>>>>> should be doing from
>>>>>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On the positive side, this approach would allow us to
>>>>>>>>>>>>>>>> program a range of
>>>>>>>>>>>>>>>> priorities for jobs instead of a single "high priority"
>>>>>>>>>>>>>>>> value",
>>>>>>>>>>>>>>>> achieving
>>>>>>>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not sure if this flexibility is something that we would
>>>>>>>>>>>>>>>> need for
>>>>>>>>>>>>>>>> our use
>>>>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>> sharing compute
>>>>>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>>>>>> repurposing
>>>>>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>>>>>>>> priorities, and
>>>>>>>>>>>>>>>> instead it picks
>>>>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>>>>>>>> disregarded
>>>>>>>>>>>>>>>> as this is
>>>>>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP,
>>>>>>>>>>>>>>>> but we
>>>>>>>>>>>>>>>> might not get the
>>>>>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>>> propagation
>>>>>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>>>>>> with support of the SW scheduler. This will function
>>>>>>>>>>>>>>>> similarly to the
>>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>>>>>>>> ahead of
>>>>>>>>>>>>>>>> anything not
>>>>>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>>>>>>>> non-compute
>>>>>>>>>>>>>>>> queue will
>>>>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>>>>>>>> stuck in
>>>>>>>>>>>>>>>> front of
>>>>>>>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>>>>>>>> improve the
>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>> in the future as new features become available in new
>>>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Future steps:
>>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>>>>>>>> thinking about
>>>>>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We aren't married to any of the approaches outlined above.
>>>>>>>>>>>>>>>> Our goal
>>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>>>>>>>> reprojection
>>>>>>>>>>>>>>>> job within a
>>>>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>>>>>>>> suggestions for
>>>>>>>>>>>>>>>> improvements or alternative strategies we are more than
>>>>>>>>>>>>>>>> happy to hear
>>>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If any of the technical information above is also
>>>>>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>>>>>> free to point
>>>>>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>>> To see the collection of prior postings to the list, visit
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>>> To see the collection of prior postings to the list, visit
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Sincerely yours,
>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> amd-gfx mailing list
>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>> Sincerely yours,
>>>> Serguei Sagalovitch
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20161223/74800ac0/attachment-0001.html>