[RFC] Mechanism for high priority scheduling in amdgpu

Fri Dec 23 16:13:01 UTC 2016

Hey Christian,

But yes, in general you don't want another compositor in the way, so we'll
>> be acquiring the HMD display directly, separate from any desktop or display
>> server.
>
>
> Assuming that the the HMD is attached to the rendering device in some way
> you have the X server and the Compositor which both try to be DRM master at
> the same time.
>
> Please correct me if that was fixed in the meantime, but that sounds like
> it will simply not work. Or is this what Andres mention below Dave is
> working on ?.
>

You are correct on both statements. We can't have two DRM_MASTERs, so the
current DRM+X does not support this use case. And this what Dave and
Pierre-Loup are currently working on.

Additional to that a compositor in combination with X is a bit counter
> productive when you want to keep the latency low.
>

One thing I'd like to correct is that our main goal is to get latency
_predictable_, secondary goal is to make it low.

The high priority queue feature addresses our main source of
unpredictability: the scheduling latency when the hardware is already full
of work from the game engine.

The DirectMode feature addresses one of the latency sources: multiple
(unnecessary) context switches to submit a surface to the DRM driver.

Targeting something like Wayland and when you need X compatibility XWayland
> sounds like the much better idea.
>

We are pretty enthusiastic about Wayland (and really glad to see Fedora 25
use Wayland by default). Once we have everything working nicely under X
(where most of the users are currently), I'm sure Pierre-Loup will be
pushing us to get everything optimized under Wayland as well (which should
be a lot simpler!).

Ever since working with SurfaceFlinger on Android with explicit fencing
I've been waiting for the day I can finally ditch X altogether :)

Regards,
Andres

On Fri, Dec 23, 2016 at 5:54 AM, Christian König <christian.koenig at amd.com>
wrote:

> But yes, in general you don't want another compositor in the way, so we'll
>> be acquiring the HMD display directly, separate from any desktop or display
>> server.
>>
> Assuming that the the HMD is attached to the rendering device in some way
> you have the X server and the Compositor which both try to be DRM master at
> the same time.
>
> Please correct me if that was fixed in the meantime, but that sounds like
> it will simply not work. Or is this what Andres mention below Dave is
> working on ?.
>
> Additional to that a compositor in combination with X is a bit counter
> productive when you want to keep the latency low.
>
> E.g. the "normal" flow of a GL or Vulkan surface filled with rendered data
> to be displayed is from the Application -> X server -> compositor -> X
> server.
>
> The extra step between X server and compositor just means extra latency
> and for this use case you probably don't want that.
>
> Targeting something like Wayland and when you need X compatibility
> XWayland sounds like the much better idea.
>
> Regards,
> Christian.
>
>
> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>
>> Display concerns are a separate issue, and as Andres said we have other
>> plans to address. But yes, in general you don't want another compositor in
>> the way, so we'll be acquiring the HMD display directly, separate from any
>> desktop or display server. Same with security, we can have a separate
>> conversation about that when the time comes.
>>
>> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>>
>>> Andres,
>>>
>>> Did you measure  latency, etc. impact of __any__ compositor?
>>>
>>> My understanding is that VR has pretty strict requirements related to
>>> QoS.
>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>>
>>> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>>>
>>>> Hey Christian,
>>>>
>>>> We are currently interested in X, but with some distros switching to
>>>> other compositors by default, we also need to consider those.
>>>>
>>>> We agree, running the full vrcompositor in root isn't something that
>>>> we want to do. Too many security concerns. Having a small root helper
>>>> that does the privilege escalation for us is the initial idea.
>>>>
>>>> For a long term approach, Pierre-Loup and Dave are working on dealing
>>>> with the "two compositors" scenario a little better in DRM+X.
>>>> Fullscreen isn't really a sufficient approach, since we don't want the
>>>> HMD to be used as part of the Desktop environment when a VR app is not
>>>> in use (this is extremely annoying).
>>>>
>>>> When the above is settled, we should have an auth mechanism besides
>>>> DRM_MASTER or DRM_AUTH that allows the vrcompositor to take over the
>>>> HMD permanently away from X. Re-using that auth method to gate this
>>>> IOCTL is probably going to be the final solution.
>>>>
>>>> I propose to start with ROOT_ONLY since it should allow us to respect
>>>> kernel IOCTL compatibility guidelines with the most flexibility. Going
>>>> from a restrictive to a more flexible permission model would be
>>>> inclusive, but going from a general to a restrictive model may exclude
>>>> some apps that used to work.
>>>>
>>>> Regards,
>>>> Andres
>>>>
>>>> On 12/22/2016 6:42 AM, Christian König wrote:
>>>>
>>>>> Hi Andres,
>>>>>
>>>>> well using root might cause stability and security problems as well.
>>>>> We worked quite hard to avoid exactly this for X.
>>>>>
>>>>> We could make this feature depend on the compositor being DRM master,
>>>>> but for example with X the X server is master (and e.g. can change
>>>>> resolutions etc..) and not the compositor.
>>>>>
>>>>> So another question is also what windowing system (if any) are you
>>>>> planning to use? X, Wayland, Flinger or something completely different
>>>>> ?
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> That is definitely a concern. What we are currently thinking is to
>>>>>> make the high priority queues accessible to root only.
>>>>>>
>>>>>> Therefore is a non-root user attempts to set the high priority flag
>>>>>> on context allocation, we would fail the call and return ENOPERM.
>>>>>>
>>>>>> Regards,
>>>>>> Andres
>>>>>>
>>>>>>
>>>>>> On 12/20/2016 7:56 AM, Christian König wrote:
>>>>>>
>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>> to solve it?
>>>>>>>>
>>>>>>> Yeah, that problem came to my mind as well.
>>>>>>>
>>>>>>> Basically we need to restrict those high priority submissions to
>>>>>>> the VR compositor or otherwise any malfunctioning application could
>>>>>>> use it.
>>>>>>>
>>>>>>> Just think about some WebGL suddenly taking all our rendering away
>>>>>>> and we won't get anything drawn any more.
>>>>>>>
>>>>>>> Alex or Michel any ideas on that?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 19.12.2016 um 15:48 schrieb Serguei Sagalovitch:
>>>>>>>
>>>>>>>> > If compute queue is occupied only by you, the efficiency
>>>>>>>> > is equal with setting job queue to high priority I think.
>>>>>>>> The only risk is the situation when graphics will take all
>>>>>>>> needed CUs. But in any case it should be very good test.
>>>>>>>>
>>>>>>>> Andres/Pierre-Loup,
>>>>>>>>
>>>>>>>> Did you try to do it or it is a lot of work for you?
>>>>>>>>
>>>>>>>>
>>>>>>>> BTW: If there is  non-VR application which will use high-priority
>>>>>>>> h/w queue then VR application will suffer.  Any ideas how
>>>>>>>> to solve it?
>>>>>>>>
>>>>>>>> Sincerely yours,
>>>>>>>> Serguei Sagalovitch
>>>>>>>>
>>>>>>>> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>>>>>>>>
>>>>>>>>> Do you encounter the priority issue for compute queue with
>>>>>>>>> current driver?
>>>>>>>>>
>>>>>>>>> If compute queue is occupied only by you, the efficiency is equal
>>>>>>>>> with setting job queue to high priority I think.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> David Zhou
>>>>>>>>>
>>>>>>>>> On 2016年12月19日 13:29, Andres Rodriguez wrote:
>>>>>>>>>
>>>>>>>>>> Yes, vulkan is available on all-open through the mesa radv UMD.
>>>>>>>>>>
>>>>>>>>>> I'm not sure if I'm asking for too much, but if we can
>>>>>>>>>> coordinate a similar interface in radv and amdgpu-pro at the
>>>>>>>>>> vulkan level that would be great.
>>>>>>>>>>
>>>>>>>>>> I'm not sure what that's going to be yet.
>>>>>>>>>>
>>>>>>>>>> - Andres
>>>>>>>>>>
>>>>>>>>>> On 12/19/2016 12:11 AM, zhoucm1 wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2016年12月19日 11:33, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>
>>>>>>>>>>>> We're currently working with the open stack; I assume that a
>>>>>>>>>>>> mechanism could be exposed by both open and Pro Vulkan
>>>>>>>>>>>> userspace drivers and that the amdgpu kernel interface
>>>>>>>>>>>> improvements we would pursue following this discussion would
>>>>>>>>>>>> let both drivers take advantage of the feature, correct?
>>>>>>>>>>>>
>>>>>>>>>>> Of course.
>>>>>>>>>>> Does open stack have Vulkan support?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> David Zhou
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 12/18/2016 07:26 PM, zhoucm1 wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> By the way, are you using all-open driver or amdgpu-pro driver?
>>>>>>>>>>>>>
>>>>>>>>>>>>> +David Mao, who is working on our Vulkan driver.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> David Zhou
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2016年12月18日 06:05, Pierre-Loup A. Griffais wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm also working on the bringing up our VR runtime on top of
>>>>>>>>>>>>>> amgpu;
>>>>>>>>>>>>>> see replies inline.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 12/16/2016 09:05 PM, Sagalovitch, Serguei wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>> actually:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So we could have potential memory overcommit case or do you
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> partitioning
>>>>>>>>>>>>>>> on your own?  I would think that there is need to avoid
>>>>>>>>>>>>>>> overcomit in
>>>>>>>>>>>>>>> VR case to
>>>>>>>>>>>>>>> prevent any BO migration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You're entirely correct; currently the VR runtime is setting
>>>>>>>>>>>>>> up
>>>>>>>>>>>>>> prioritized CPU scheduling for its VR compositor, we're
>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>> prioritized GPU scheduling and pre-emption (eg. this
>>>>>>>>>>>>>> thread), and in
>>>>>>>>>>>>>> the future it will make sense to do work in order to make
>>>>>>>>>>>>>> sure that
>>>>>>>>>>>>>> its memory allocations do not get evicted, to prevent any
>>>>>>>>>>>>>> unwelcome
>>>>>>>>>>>>>> additional latency in the event of needing to perform
>>>>>>>>>>>>>> just-in-time
>>>>>>>>>>>>>> reprojection.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> BTW: Do you mean __real__ processes or threads?
>>>>>>>>>>>>>>> Based on my understanding sharing BOs between different
>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>> could introduce additional synchronization constrains. btw:
>>>>>>>>>>>>>>> I am not
>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>> if we are able to share Vulkan sync. object cross-process
>>>>>>>>>>>>>>> boundary.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> They are different processes; it is important for the
>>>>>>>>>>>>>> compositor that
>>>>>>>>>>>>>> is responsible for quality-of-service features such as
>>>>>>>>>>>>>> consistently
>>>>>>>>>>>>>> presenting distorted frames with the right latency,
>>>>>>>>>>>>>> reprojection, etc,
>>>>>>>>>>>>>> to be separate from the main application.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently we are using unreleased cross-process memory and
>>>>>>>>>>>>>> semaphore
>>>>>>>>>>>>>> extensions to fetch updated eye images from the client
>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>> but the just-in-time reprojection discussed here does not
>>>>>>>>>>>>>> actually
>>>>>>>>>>>>>> have any direct interactions with cross-process resource
>>>>>>>>>>>>>> sharing,
>>>>>>>>>>>>>> since it's achieved by using whatever is the latest, most
>>>>>>>>>>>>>> up-to-date
>>>>>>>>>>>>>> eye images that have already been sent by the client
>>>>>>>>>>>>>> application,
>>>>>>>>>>>>>> which are already available to use without additional
>>>>>>>>>>>>>> synchronization.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes,  IMHO the best is to run in  "full screen mode".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, we are working on mechanisms to present directly to the
>>>>>>>>>>>>>> headset
>>>>>>>>>>>>>> display without any intermediaries as a separate effort.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  The latency is our main concern,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would assume that this is the known problem (at least for
>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>> usage).
>>>>>>>>>>>>>>> It looks like that amdgpu / kernel submission is rather CPU
>>>>>>>>>>>>>>> intensive
>>>>>>>>>>>>>>> (at least
>>>>>>>>>>>>>>> in the default configuration).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As long as it's a consistent cost, it shouldn't an issue.
>>>>>>>>>>>>>> However, if
>>>>>>>>>>>>>> there's high degrees of variance then that would be
>>>>>>>>>>>>>> troublesome and we
>>>>>>>>>>>>>> would need to account for the worst case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hopefully the requirements and approach we described make
>>>>>>>>>>>>>> sense, we're
>>>>>>>>>>>>>> looking forward to your feedback and suggestions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>  - Pierre-Loup
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>>>>> Sent: December 16, 2016 10:00 PM
>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-) as MEC define it.  As far as I
>>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree the partitioning isn't ideal. I'm hoping we can
>>>>>>>>>>>>>>> start with a
>>>>>>>>>>>>>>> solution that assumes that
>>>>>>>>>>>>>>> only pipe0 has any work and the other pipes are idle (no
>>>>>>>>>>>>>>> HSA/ROCm
>>>>>>>>>>>>>>> running on the system).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This should be more or less the use case we expect from VR
>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree the split is currently not ideal, but I'd like to
>>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>> that a separate task, because
>>>>>>>>>>>>>>> making it dynamic is not straight forward :P
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>>> amdkfd
>>>>>>>>>>>>>>>> will be not
>>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Correct, this is why we want to enable the high priority
>>>>>>>>>>>>>>> compute
>>>>>>>>>>>>>>> queue through
>>>>>>>>>>>>>>> libdrm-amdgpu, so that we can expose it through Vulkan later.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For current VR workloads we have 3 separate processes
>>>>>>>>>>>>>>> running actually:
>>>>>>>>>>>>>>>     1) Game process
>>>>>>>>>>>>>>>     2) VR Compositor (this is the process that will require
>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>> priority queue)
>>>>>>>>>>>>>>>     3) System compositor (we are looking at approaches to
>>>>>>>>>>>>>>> remove this
>>>>>>>>>>>>>>> overhead)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For now I think it is okay to assume no OpenCL/ROCm running
>>>>>>>>>>>>>>> simultaneously, but
>>>>>>>>>>>>>>> I would also like to be able to address this case in the
>>>>>>>>>>>>>>> future
>>>>>>>>>>>>>>> (cross-pipe priorities).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>>> (a) it
>>>>>>>>>>>>>>>> may take time so
>>>>>>>>>>>>>>>> latency may suffer
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The latency is our main concern, we want something that is
>>>>>>>>>>>>>>> predictable. A good
>>>>>>>>>>>>>>> illustration of what the reprojection scheduling looks like
>>>>>>>>>>>>>>> can be
>>>>>>>>>>>>>>> found here:
>>>>>>>>>>>>>>> https://community.amd.com/servlet/JiveServlet/showImage/38-
>>>>>>>>>>>>>>> 1310-104754/pastedImage_3.png
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (b) to preempt we need to have different "context" - we want
>>>>>>>>>>>>>>>> to guarantee that submissions from the same context will
>>>>>>>>>>>>>>>> be executed
>>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is okay, as the reprojection work doesn't have
>>>>>>>>>>>>>>> dependencies on
>>>>>>>>>>>>>>> the game context, and it
>>>>>>>>>>>>>>> even happens in a separate process.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: (a) Do you want "preempt" and later resume or do you
>>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>>> "cancel/abort"
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Preempt the game with the compositor task and then resume it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (b) Vulkan is generic API and could be used for graphics
>>>>>>>>>>>>>>>> as well as
>>>>>>>>>>>>>>>> for plain compute tasks (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yeah, the plan is to use vulkan compute. But if you figure
>>>>>>>>>>>>>>> out a way
>>>>>>>>>>>>>>> for us to get
>>>>>>>>>>>>>>> a guaranteed execution time using vulkan graphics, then
>>>>>>>>>>>>>>> I'll take you
>>>>>>>>>>>>>>> out for a beer :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 9:13 PM
>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Andres,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please see inline (as [Serguei])
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>>>>> Sent: December 16, 2016 8:29 PM
>>>>>>>>>>>>>>> To: Sagalovitch, Serguei; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: RE: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Serguei,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the feedback. Answers inline as [AR].
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>> From: Sagalovitch, Serguei [Serguei.Sagalovitch at amd.com]
>>>>>>>>>>>>>>> Sent: Friday, December 16, 2016 8:15 PM
>>>>>>>>>>>>>>> To: Andres Rodriguez; amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: Re: [RFC] Mechanism for high priority scheduling
>>>>>>>>>>>>>>> in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andres,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Quick comments:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) To minimize "bubbles", etc. we need to "force" CU
>>>>>>>>>>>>>>> assignments/binding
>>>>>>>>>>>>>>> to high-priority queue  when it will be in use and "free"
>>>>>>>>>>>>>>> them later
>>>>>>>>>>>>>>> (we  do not want forever take CUs from e.g. graphic task to
>>>>>>>>>>>>>>> degrade
>>>>>>>>>>>>>>> graphics
>>>>>>>>>>>>>>> performance).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Otherwise we could have scenario when long graphics task (or
>>>>>>>>>>>>>>> low-priority
>>>>>>>>>>>>>>> compute) will took all (extra) CUs and high--priority will
>>>>>>>>>>>>>>> wait for
>>>>>>>>>>>>>>> needed resources.
>>>>>>>>>>>>>>> It will not be visible on "NOP " but only when you submit
>>>>>>>>>>>>>>> "real"
>>>>>>>>>>>>>>> compute task
>>>>>>>>>>>>>>> so I would recommend  not to use "NOP" packets at all for
>>>>>>>>>>>>>>> testing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It (CU assignment) could be relatively easy done when
>>>>>>>>>>>>>>> everything is
>>>>>>>>>>>>>>> going via kernel
>>>>>>>>>>>>>>> (e.g. as part of frame submission) but I must admit that I
>>>>>>>>>>>>>>> am not sure
>>>>>>>>>>>>>>> about the best way for user level submissions (amdkfd).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] I wasn't aware of this part of the programming
>>>>>>>>>>>>>>> sequence. Thanks
>>>>>>>>>>>>>>> for the heads up!
>>>>>>>>>>>>>>> Is this similar to the CU masking programming?
>>>>>>>>>>>>>>> [Serguei] Yes. To simplify: the problem is that "scheduler"
>>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>> deciding which
>>>>>>>>>>>>>>> queue to  run will check if there is enough resources and
>>>>>>>>>>>>>>> if not then
>>>>>>>>>>>>>>> it will begin
>>>>>>>>>>>>>>> to check other queues with lower priority.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2) I would recommend to dedicate the whole pipe to
>>>>>>>>>>>>>>> high-priority
>>>>>>>>>>>>>>> queue and have
>>>>>>>>>>>>>>> nothing their except it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] I'm guessing in this context you mean pipe = queue?
>>>>>>>>>>>>>>> (as opposed
>>>>>>>>>>>>>>> to the MEC definition
>>>>>>>>>>>>>>> of pipe, which is a grouping of queues). I say this because
>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>> only has access to 1 pipe,
>>>>>>>>>>>>>>> and the rest are statically partitioned for amdkfd usage.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] No. I mean pipe :-)  as MEC define it. As far as I
>>>>>>>>>>>>>>> understand (by simplifying)
>>>>>>>>>>>>>>> some scheduling is per pipe.  I know about the current
>>>>>>>>>>>>>>> allocation
>>>>>>>>>>>>>>> scheme but I do not think
>>>>>>>>>>>>>>> that it is  ideal.  I would assume that we need to switch to
>>>>>>>>>>>>>>> dynamical partition
>>>>>>>>>>>>>>> of resources  based on the workload otherwise we will have
>>>>>>>>>>>>>>> resource
>>>>>>>>>>>>>>> conflict
>>>>>>>>>>>>>>> between Vulkan compute and  OpenCL.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BTW: Which user level API do you want to use for compute:
>>>>>>>>>>>>>>> Vulkan or
>>>>>>>>>>>>>>> OpenCL?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] Vulkan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei] Vulkan works via amdgpu (kernel submissions) so
>>>>>>>>>>>>>>> amdkfd will
>>>>>>>>>>>>>>> be not
>>>>>>>>>>>>>>> involved.  I would assume that in the case of VR we will
>>>>>>>>>>>>>>> have one main
>>>>>>>>>>>>>>> application ("console" mode(?)) so we could temporally
>>>>>>>>>>>>>>> "ignore"
>>>>>>>>>>>>>>> OpenCL/ROCm needs when VR is running.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  we will not be able to provide a solution compatible with
>>>>>>>>>>>>>>>> GFX
>>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I assume that you are talking about graphics? Am I right?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [AR] Yeah, my understanding is that pre-empting the
>>>>>>>>>>>>>>> currently running
>>>>>>>>>>>>>>> graphics job and scheduling in
>>>>>>>>>>>>>>> something else using mid-buffer pre-emption has some cases
>>>>>>>>>>>>>>> where it
>>>>>>>>>>>>>>> doesn't work well. But if with
>>>>>>>>>>>>>>> polaris10 it starts working well, it might be a better
>>>>>>>>>>>>>>> solution for
>>>>>>>>>>>>>>> us (because the whole reprojection
>>>>>>>>>>>>>>> work uses the vulkan graphics stack at the moment, and
>>>>>>>>>>>>>>> porting it to
>>>>>>>>>>>>>>> compute is not trivial).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [Serguei]  The problem with pre-emption of graphics task:
>>>>>>>>>>>>>>> (a) it may
>>>>>>>>>>>>>>> take time so
>>>>>>>>>>>>>>> latency may suffer (b) to preempt we need to have different
>>>>>>>>>>>>>>> "context"
>>>>>>>>>>>>>>> - we want
>>>>>>>>>>>>>>> to guarantee that submissions from the same context will be
>>>>>>>>>>>>>>> executed
>>>>>>>>>>>>>>> in order.
>>>>>>>>>>>>>>> BTW: (a) Do you want  "preempt" and later resume or do you
>>>>>>>>>>>>>>> want
>>>>>>>>>>>>>>> "preempt" and
>>>>>>>>>>>>>>> "cancel/abort"?  (b) Vulkan is generic API and could be used
>>>>>>>>>>>>>>> for graphics as well as for plain compute tasks
>>>>>>>>>>>>>>> (VK_QUEUE_COMPUTE_BIT).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sincerely yours,
>>>>>>>>>>>>>>> Serguei Sagalovitch
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on
>>>>>>>>>>>>>>> behalf of
>>>>>>>>>>>>>>> Andres Rodriguez <andresr at valvesoftware.com>
>>>>>>>>>>>>>>> Sent: December 16, 2016 6:15 PM
>>>>>>>>>>>>>>> To: amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> Subject: [RFC] Mechanism for high priority scheduling in
>>>>>>>>>>>>>>> amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This RFC is also available as a gist here:
>>>>>>>>>>>>>>> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab932
>>>>>>>>>>>>>>> 04249
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>> gist.github.com
>>>>>>>>>>>>>>> [RFC] Mechanism for high priority scheduling in amdgpu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We are interested in feedback for a mechanism to
>>>>>>>>>>>>>>> effectively schedule
>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>> priority VR reprojection tasks (also referred to as
>>>>>>>>>>>>>>> time-warping) for
>>>>>>>>>>>>>>> Polaris10
>>>>>>>>>>>>>>> running on the amdgpu kernel driver.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Brief context:
>>>>>>>>>>>>>>> --------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The main objective of reprojection is to avoid motion
>>>>>>>>>>>>>>> sickness for VR
>>>>>>>>>>>>>>> users in
>>>>>>>>>>>>>>> scenarios where the game or application would fail to finish
>>>>>>>>>>>>>>> rendering a new
>>>>>>>>>>>>>>> frame in time for the next VBLANK. When this happens, the
>>>>>>>>>>>>>>> user's head
>>>>>>>>>>>>>>> movements
>>>>>>>>>>>>>>> are not reflected on the Head Mounted Display (HMD) for the
>>>>>>>>>>>>>>> duration
>>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>> extra frame. This extended mismatch between the inner ear
>>>>>>>>>>>>>>> and the
>>>>>>>>>>>>>>> eyes may
>>>>>>>>>>>>>>> cause the user to experience motion sickness.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The VR compositor deals with this problem by fabricating a
>>>>>>>>>>>>>>> new frame
>>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>>> user's updated head position in combination with the
>>>>>>>>>>>>>>> previous frames.
>>>>>>>>>>>>>>> This
>>>>>>>>>>>>>>> avoids a prolonged mismatch between the HMD output and the
>>>>>>>>>>>>>>> inner ear.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Because of the adverse effects on the user, we require high
>>>>>>>>>>>>>>> confidence that the
>>>>>>>>>>>>>>> reprojection task will complete before the VBLANK interval.
>>>>>>>>>>>>>>> Even if
>>>>>>>>>>>>>>> the GFX pipe
>>>>>>>>>>>>>>> is currently full of work from the game/application (which
>>>>>>>>>>>>>>> is most
>>>>>>>>>>>>>>> likely the case).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For more details and illustrations, please refer to the
>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>> document:
>>>>>>>>>>>>>>> https://community.amd.com/community/gaming/blog/2016/03/28/
>>>>>>>>>>>>>>> asynchronous-shaders-evolved
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Gaming: Asynchronous Shaders Evolved | Community
>>>>>>>>>>>>>>> community.amd.com
>>>>>>>>>>>>>>> One of the most exciting new developments in GPU technology
>>>>>>>>>>>>>>> over the
>>>>>>>>>>>>>>> past year has been the adoption of asynchronous shaders,
>>>>>>>>>>>>>>> which can
>>>>>>>>>>>>>>> make more efficient use of ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Requirements:
>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The mechanism must expose the following functionaility:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * Job round trip time must be predictable, from
>>>>>>>>>>>>>>> submission to
>>>>>>>>>>>>>>> fence signal
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The mechanism must support compute workloads.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Goals:
>>>>>>>>>>>>>>> ------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The mechanism should provide low submission latencies
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Test: submitting a NOP packet through the mechanism on busy
>>>>>>>>>>>>>>> hardware
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>> be equivalent to submitting a NOP on idle hardware.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nice to have:
>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The mechanism should also support GFX workloads.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My understanding is that with the current hardware
>>>>>>>>>>>>>>> capabilities in
>>>>>>>>>>>>>>> Polaris10 we
>>>>>>>>>>>>>>> will not be able to provide a solution compatible with GFX
>>>>>>>>>>>>>>> worloads.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But I would love to hear otherwise. So if anyone has an idea,
>>>>>>>>>>>>>>> approach or
>>>>>>>>>>>>>>> suggestion that will also be compatible with the GFX ring,
>>>>>>>>>>>>>>> please let
>>>>>>>>>>>>>>> us know
>>>>>>>>>>>>>>> about it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     * The above guarantees should also be respected by
>>>>>>>>>>>>>>> amdkfd workloads
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Would be good to have for consistency, but not strictly
>>>>>>>>>>>>>>> necessary as
>>>>>>>>>>>>>>> users running
>>>>>>>>>>>>>>> games are not traditionally running HPC workloads in the
>>>>>>>>>>>>>>> background.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Proposed approach:
>>>>>>>>>>>>>>> ------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Similar to the windows driver, we could expose a high
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> compute queue to
>>>>>>>>>>>>>>> userspace.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Submissions to this compute queue will be scheduled with high
>>>>>>>>>>>>>>> priority, and may
>>>>>>>>>>>>>>> acquire hardware resources previously in use by other queues.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This can be achieved by taking advantage of the 'priority'
>>>>>>>>>>>>>>> field in
>>>>>>>>>>>>>>> the HQDs
>>>>>>>>>>>>>>> and could be programmed by amdgpu or the amdgpu scheduler.
>>>>>>>>>>>>>>> The relevant
>>>>>>>>>>>>>>> register fields are:
>>>>>>>>>>>>>>>         * mmCP_HQD_PIPE_PRIORITY
>>>>>>>>>>>>>>>         * mmCP_HQD_QUEUE_PRIORITY
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Implementation approach 1 - static partitioning:
>>>>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The amdgpu driver currently controls 8 compute queues from
>>>>>>>>>>>>>>> pipe0. We can
>>>>>>>>>>>>>>> statically partition these as follows:
>>>>>>>>>>>>>>>         * 7x regular
>>>>>>>>>>>>>>>         * 1x high priority
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The relevant priorities can be set so that submissions to
>>>>>>>>>>>>>>> the high
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> ring will starve the other compute rings and the GFX ring.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The amdgpu scheduler will only place jobs into the high
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> rings if the
>>>>>>>>>>>>>>> context is marked as high priority. And a corresponding
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>> added to keep track of this information:
>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_KERNEL
>>>>>>>>>>>>>>>      * -> AMD_SCHED_PRIORITY_HIGH
>>>>>>>>>>>>>>>      * AMD_SCHED_PRIORITY_NORMAL
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The user will request a high priority context by setting an
>>>>>>>>>>>>>>> appropriate flag
>>>>>>>>>>>>>>> in drm_amdgpu_ctx_in (AMDGPU_CTX_HIGH_PRIORITY or similar):
>>>>>>>>>>>>>>> https://github.com/torvalds/linux/blob/master/include/uapi/
>>>>>>>>>>>>>>> drm/amdgpu_drm.h#L163
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The setting is in a per context level so that we can:
>>>>>>>>>>>>>>>     * Maintain a consistent FIFO ordering of all
>>>>>>>>>>>>>>> submissions to a
>>>>>>>>>>>>>>> context
>>>>>>>>>>>>>>>     * Create high priority and non-high priority contexts
>>>>>>>>>>>>>>> in the same
>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Implementation approach 2 - dynamic priority programming:
>>>>>>>>>>>>>>> ---------------------------------------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Similar to the above, but instead of programming the
>>>>>>>>>>>>>>> priorities and
>>>>>>>>>>>>>>> amdgpu_init() time, the SW scheduler will reprogram the
>>>>>>>>>>>>>>> queue priorities
>>>>>>>>>>>>>>> dynamically when scheduling a task.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This would involve having a hardware specific callback from
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> scheduler to
>>>>>>>>>>>>>>> set the appropriate queue priority: set_priority(int ring,
>>>>>>>>>>>>>>> int index,
>>>>>>>>>>>>>>> int priority)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> During this callback we would have to grab the SRBM mutex
>>>>>>>>>>>>>>> to perform
>>>>>>>>>>>>>>> the appropriate
>>>>>>>>>>>>>>> HW programming, and I'm not really sure if that is
>>>>>>>>>>>>>>> something we
>>>>>>>>>>>>>>> should be doing from
>>>>>>>>>>>>>>> the scheduler.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On the positive side, this approach would allow us to
>>>>>>>>>>>>>>> program a range of
>>>>>>>>>>>>>>> priorities for jobs instead of a single "high priority"
>>>>>>>>>>>>>>> value",
>>>>>>>>>>>>>>> achieving
>>>>>>>>>>>>>>> something similar to the niceness API available for CPU
>>>>>>>>>>>>>>> scheduling.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure if this flexibility is something that we would
>>>>>>>>>>>>>>> need for
>>>>>>>>>>>>>>> our use
>>>>>>>>>>>>>>> case, but it might be useful in other scenarios (multiple
>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>> sharing compute
>>>>>>>>>>>>>>> time on a server).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This approach would require a new int field in
>>>>>>>>>>>>>>> drm_amdgpu_ctx_in, or
>>>>>>>>>>>>>>> repurposing
>>>>>>>>>>>>>>> of the flags field.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Known current obstacles:
>>>>>>>>>>>>>>> ------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The SQ is currently programmed to disregard the HQD
>>>>>>>>>>>>>>> priorities, and
>>>>>>>>>>>>>>> instead it picks
>>>>>>>>>>>>>>> jobs at random. Settings from the shader itself are also
>>>>>>>>>>>>>>> disregarded
>>>>>>>>>>>>>>> as this is
>>>>>>>>>>>>>>> considered a privileged field.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Effectively we can get our compute wavefront launched ASAP,
>>>>>>>>>>>>>>> but we
>>>>>>>>>>>>>>> might not get the
>>>>>>>>>>>>>>> time we need on the SQ.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The current programming would have to be changed to allow
>>>>>>>>>>>>>>> priority
>>>>>>>>>>>>>>> propagation
>>>>>>>>>>>>>>> from the HQD into the SQ.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Generic approach for all HW IPs:
>>>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For consistency purposes, the high priority context can be
>>>>>>>>>>>>>>> enabled
>>>>>>>>>>>>>>> for all HW IPs
>>>>>>>>>>>>>>> with support of the SW scheduler. This will function
>>>>>>>>>>>>>>> similarly to the
>>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>> AMD_SCHED_PRIORITY_KERNEL priority, where the job can jump
>>>>>>>>>>>>>>> ahead of
>>>>>>>>>>>>>>> anything not
>>>>>>>>>>>>>>> commited to the HW queue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The benefits of requesting a high priority context for a
>>>>>>>>>>>>>>> non-compute
>>>>>>>>>>>>>>> queue will
>>>>>>>>>>>>>>> be lesser (e.g. up to 10s of wait time if a GFX command is
>>>>>>>>>>>>>>> stuck in
>>>>>>>>>>>>>>> front of
>>>>>>>>>>>>>>> you), but having the API in place will allow us to easily
>>>>>>>>>>>>>>> improve the
>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>> in the future as new features become available in new
>>>>>>>>>>>>>>> hardware.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Future steps:
>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Once we have an approach settled, I can take care of the
>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, once the interface is mostly decided, we can start
>>>>>>>>>>>>>>> thinking about
>>>>>>>>>>>>>>> exposing the high priority queue through radv.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Request for feedback:
>>>>>>>>>>>>>>> ---------------------
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We aren't married to any of the approaches outlined above.
>>>>>>>>>>>>>>> Our goal
>>>>>>>>>>>>>>> is to
>>>>>>>>>>>>>>> obtain a mechanism that will allow us to complete the
>>>>>>>>>>>>>>> reprojection
>>>>>>>>>>>>>>> job within a
>>>>>>>>>>>>>>> predictable amount of time. So if anyone anyone has any
>>>>>>>>>>>>>>> suggestions for
>>>>>>>>>>>>>>> improvements or alternative strategies we are more than
>>>>>>>>>>>>>>> happy to hear
>>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If any of the technical information above is also
>>>>>>>>>>>>>>> incorrect, feel
>>>>>>>>>>>>>>> free to point
>>>>>>>>>>>>>>> out my misunderstandings.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looking forward to hearing from you.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Andres
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>> To see the collection of prior postings to the list, visit
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> amd-gfx Info Page - lists.freedesktop.org
>>>>>>>>>>>>>>> lists.freedesktop.org
>>>>>>>>>>>>>>> To see the collection of prior postings to the list, visit
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> amd-gfx Archives. Using amd-gfx: To post a message to all
>>>>>>>>>>>>>>> the list
>>>>>>>>>>>>>>> members, send email ...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> amd-gfx mailing list
>>>>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> Sincerely yours,
>>>>>>>> Serguei Sagalovitch
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx at lists.freedesktop.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> Sincerely yours,
>>> Serguei Sagalovitch
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20161223/8b56dfb9/attachment-0001.html>