[RFC] Mechanism for high priority scheduling in amdgpu
zhoucm1
david1.zhou at amd.com
Mon Dec 26 02:26:39 UTC 2016
Nice experiment, which is exactly SW scheduler can provide.
And as you said "I.e. your context can be scheduled into the
HW queue ahead of any other context, but everything already commited
to the HW queue is executed in strict FIFO order."
If you want to keep consistent latency, which will need to enable hw
priority queue feature.
Regards,
David Zhou
On 2016年12月24日 06:20, Andres Rodriguez wrote:
> Hey John,
>
> I've collected bit of data using high priority SW scheduler queues,
> thought you might be interested.
>
> Implementation as per the patch above.
>
> Control test 1
> ==============
>
> Sascha Willems mesh sample running on its own at regular priority
>
> Results
> -------
>
> Mesh: ~0.14ms per-frame latency
>
> Control test 2
> ==============
>
> Two Sascha Willems mesh sample running on its own at regular priority
>
> Results
> -------
>
> Mesh 1: ~0.26ms per-frame latency
> Mesh 2: ~0.26ms per-frame latency
>
> Test 1
> ======
>
> Two Sascha Willems mesh samples running simultaneously. One at high
> priority and the other running in a regular priority graphics context.
>
> Results
> -------
>
> Mesh High: 0.14 - 0.24ms per-frame latency
> Mesh Regular: 0.24 - 0.40ms per-frame latency
>
> Test 2
> ======
>
> Ten Sascha Willems mesh samples running simultaneously. One at high
> priority and the others running in a regular priority graphics context.
>
> Results
> -------
>
> Mesh High: 0.14 - 0.8ms per-frame latency
> Mesh Regular: 1.10 - 2.05ms per-frame latency
>
> Test 3
> ======
>
> Two Sascha Willems mesh samples running simultaneously. One at high
> priority and the other running in a regular priority graphics context.
>
> Also running Unigine Heaven at Exteme preset @ 2560x1600
>
> Results
> -------
>
> Mesh High: 7 - 100ms per-frame latency (Lots of fluctuation)
> Mesh Regular: 40 - 130ms per-frame latency(Lots of fluctuation)
> Unigine Heaven: 20-40 fps
>
>
> Test 4
> ======
>
> Two Sascha Willems mesh samples running simultaneously. One at high
> priority and the other running in a regular priority graphics context.
>
> Also running Talos Principle @ 4K
>
> Results
> -------
>
> Mesh High: 0.14 - 3.97ms per-frame latency (Mostly floats ~0.4ms)
> Mesh Regular: 0.43 - 8.11ms per-frame latency (Lots of fluctuation)
> Talos: 24.8 fps AVG
>
> Observations
> ============
>
> The high priority queue based on the SW scheduler provides significant
> gains when paired with tasks that submit short duration commands into
> the queue. This can be observed in tests 1 and 2.
>
> When the pipe is full of long running commands, the effects are dampened.
> As observed in test 3, the per-frame latency suffers very large spikes,
> and the latencies are very inconsistent.
>
> Talos seems to be a better behaved game. It may be submitting shorter
> draw commands and the SW scheduler is able to interleave the rest of
> the work.
>
> The results seem consistent with the hypothetical advantages the SW
> scheduler should provide. I.e. your context can be scheduled into the
> HW queue ahead of any other context, but everything already commited
> to the HW queue is executed in strict FIFO order.
>
> In order to deal with cases similar to Test 3, we will need to take
> advantage of further features.
>
> Notes
> =====
>
> - Tests were run multiple times, and reboots were performed during tests.
> - The mesh sample isn't really designed for benchmarking, but it should
> be decent for ballpark figures
> - The high priority mesh app was run with default niceness and also
> niceness
> at -20. This had no effect on the results, so it was not added above.
> - CPU usage was not saturated while running the tests
>
> Regards,
> Andres
>
>
> On Fri, Dec 23, 2016 at 1:18 PM, Pierre-Loup A. Griffais
> <pgriffais at valvesoftware.com <mailto:pgriffais at valvesoftware.com>> wrote:
>
> I hate to keep bringing up display topics in an unrelated
> conversation, but I'm not sure where you got "Application -> X
> server -> compositor -> X server" from. As I was saying before, we
> need to be presenting directly to the HMD display as no display
> server can be in the way, both for latency but also quality of
> service reasons (a buggy application cannot be allowed to
> accidentally display undistorted rendering into the HMD); we
> intend to do the necessary work for this, and the extent of X's
> (or a Wayland implementation, or any other display server)
> involvment will be to participate enough to know that the HMD
> display is off-limits. If you have more questions on the display
> aspect, or VR rendering in general, I'm happy to try to address
> them out-of-band from this conversation.
>
>
> On 12/23/2016 02:54 AM, Christian König wrote:
>
> But yes, in general you don't want another compositor in
> the way, so
> we'll be acquiring the HMD display directly, separate from
> any desktop
> or display server.
>
> Assuming that the the HMD is attached to the rendering device
> in some
> way you have the X server and the Compositor which both try to
> be DRM
> master at the same time.
>
> Please correct me if that was fixed in the meantime, but that
> sounds
> like it will simply not work. Or is this what Andres mention
> below Dave
> is working on ?.
>
> Additional to that a compositor in combination with X is a bit
> counter
> productive when you want to keep the latency low.
>
> E.g. the "normal" flow of a GL or Vulkan surface filled with
> rendered
> data to be displayed is from the Application -> X server ->
> compositor
> -> X server.
>
> The extra step between X server and compositor just means
> extra latency
> and for this use case you probably don't want that.
>
> Targeting something like Wayland and when you need X compatibility
> XWayland sounds like the much better idea.
>
> Regards,
> Christian.
>
> Am 22.12.2016 um 20:54 schrieb Pierre-Loup A. Griffais:
>
> Display concerns are a separate issue, and as Andres said
> we have
> other plans to address. But yes, in general you don't want
> another
> compositor in the way, so we'll be acquiring the HMD
> display directly,
> separate from any desktop or display server. Same with
> security, we
> can have a separate conversation about that when the time
> comes.
>
> On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:
>
> Andres,
>
> Did you measure latency, etc. impact of __any__
> compositor?
>
> My understanding is that VR has pretty strict
> requirements related to
> QoS.
>
> Sincerely yours,
> Serguei Sagalovitch
>
>
> On 2016-12-22 11:35 AM, Andres Rodriguez wrote:
>
> Hey Christian,
>
> We are currently interested in X, but with some
> distros switching to
> other compositors by default, we also need to
> consider those.
>
> We agree, running the full vrcompositor in root
> isn't something that
> we want to do. Too many security concerns. Having
> a small root helper
> that does the privilege escalation for us is the
> initial idea.
>
> For a long term approach, Pierre-Loup and Dave are
> working on dealing
> with the "two compositors" scenario a little
> better in DRM+X.
> Fullscreen isn't really a sufficient approach,
> since we don't want the
> HMD to be used as part of the Desktop environment
> when a VR app is not
> in use (this is extremely annoying).
>
> When the above is settled, we should have an auth
> mechanism besides
> DRM_MASTER or DRM_AUTH that allows the
> vrcompositor to take over the
> HMD permanently away from X. Re-using that auth
> method to gate this
> IOCTL is probably going to be the final solution.
>
> I propose to start with ROOT_ONLY since it should
> allow us to respect
> kernel IOCTL compatibility guidelines with the
> most flexibility. Going
> from a restrictive to a more flexible permission
> model would be
> inclusive, but going from a general to a
> restrictive model may exclude
> some apps that used to work.
>
> Regards,
> Andres
>
> On 12/22/2016 6:42 AM, Christian König wrote:
>
> Hi Andres,
>
> well using root might cause stability and
> security problems as well.
> We worked quite hard to avoid exactly this for X.
>
> We could make this feature depend on the
> compositor being DRM master,
> but for example with X the X server is master
> (and e.g. can change
> resolutions etc..) and not the compositor.
>
> So another question is also what windowing
> system (if any) are you
> planning to use? X, Wayland, Flinger or
> something completely
> different ?
>
> Regards,
> Christian.
>
> Am 20.12.2016 um 16:51 schrieb Andres Rodriguez:
>
> Hi Christian,
>
> That is definitely a concern. What we are
> currently thinking is to
> make the high priority queues accessible
> to root only.
>
> Therefore is a non-root user attempts to
> set the high priority flag
> on context allocation, we would fail the
> call and return ENOPERM.
>
> Regards,
> Andres
>
>
> On 12/20/2016 7:56 AM, Christian König wrote:
>
> BTW: If there is non-VR
> application which will use
> high-priority
> h/w queue then VR application will
> suffer. Any ideas how
> to solve it?
>
> Yeah, that problem came to my mind as
> well.
>
> Basically we need to restrict those
> high priority submissions to
> the VR compositor or otherwise any
> malfunctioning application could
> use it.
>
> Just think about some WebGL suddenly
> taking all our rendering away
> and we won't get anything drawn any more.
>
> Alex or Michel any ideas on that?
>
> Regards,
> Christian.
>
> Am 19.12.2016 um 15:48 schrieb Serguei
> Sagalovitch:
>
> > If compute queue is occupied
> only by you, the efficiency
> > is equal with setting job queue
> to high priority I think.
> The only risk is the situation
> when graphics will take all
> needed CUs. But in any case it
> should be very good test.
>
> Andres/Pierre-Loup,
>
> Did you try to do it or it is a
> lot of work for you?
>
>
> BTW: If there is non-VR
> application which will use
> high-priority
> h/w queue then VR application will
> suffer. Any ideas how
> to solve it?
>
> Sincerely yours,
> Serguei Sagalovitch
>
> On 2016-12-19 12:50 AM, zhoucm1 wrote:
>
> Do you encounter the priority
> issue for compute queue with
> current driver?
>
> If compute queue is occupied
> only by you, the efficiency is
> equal
> with setting job queue to high
> priority I think.
>
> Regards,
> David Zhou
>
> On 2016年12月19日 13:29,
> Andres Rodriguez wrote:
>
> Yes, vulkan is available
> on all-open through the
> mesa radv UMD.
>
> I'm not sure if I'm asking
> for too much, but if we can
> coordinate a similar
> interface in radv and
> amdgpu-pro at the
> vulkan level that would be
> great.
>
> I'm not sure what that's
> going to be yet.
>
> - Andres
>
> On 12/19/2016 12:11 AM,
> zhoucm1 wrote:
>
>
>
> On 2016年12月19日
> 11:33, Pierre-Loup A.
> Griffais wrote:
>
> We're currently
> working with the
> open stack; I
> assume that a
> mechanism could be
> exposed by both
> open and Pro Vulkan
> userspace drivers
> and that the
> amdgpu kernel
> interface
> improvements we
> would pursue
> following this
> discussion would
> let both drivers
> take advantage of
> the feature, correct?
>
> Of course.
> Does open stack have
> Vulkan support?
>
> Regards,
> David Zhou
>
>
> On 12/18/2016
> 07:26 PM, zhoucm1
> wrote:
>
> By the way,
> are you using
> all-open
> driver or
> amdgpu-pro
> driver?
>
> +David Mao,
> who is working
> on our Vulkan
> driver.
>
> Regards,
> David Zhou
>
> On 2016年12月
> 18日 06:05,
> Pierre-Loup A.
> Griffais wrote:
>
> Hi Serguei,
>
> I'm also
> working on
> the
> bringing
> up our VR
> runtime on
> top of
> amgpu;
> see
> replies
> inline.
>
> On
> 12/16/2016
> 09:05 PM,
> Sagalovitch,
> Serguei wrote:
>
> Andres,
>
> For
> current
> VR
> workloads
> we
> have
> 3
> separate
> processes
> running
> actually:
>
> So we
> could
> have
> potential
> memory
> overcommit
> case or do
> you do
> partitioning
> on
> your
> own?
> I
> would
> think
> that
> there
> is
> need
> to avoid
> overcomit
> in
> VR case to
> prevent any
> BO
> migration.
>
>
> You're
> entirely
> correct;
> currently
> the VR
> runtime is
> setting up
> prioritized CPU
> scheduling
> for its VR
> compositor, we're
> working on
> prioritized GPU
> scheduling
> and
> pre-emption (eg.
> this
> thread),
> and in
> the future
> it will
> make sense
> to do work
> in order
> to make
> sure that
> its memory
> allocations do
> not get
> evicted,
> to prevent any
> unwelcome
> additional
> latency in
> the event
> of needing
> to perform
> just-in-time
> reprojection.
>
> BTW:
> Do you
> mean
> __real__
> processes
> or
> threads?
> Based
> on my
> understanding
> sharing BOs
> between different
> processes
> could
> introduce
> additional
> synchronization
> constrains.
> btw:
> I am not
> sure
> if we
> are
> able
> to
> share
> Vulkan
> sync.
> object
> cross-process
> boundary.
>
>
> They are
> different
> processes;
> it is
> important
> for the
> compositor
> that
> is
> responsible for
> quality-of-service
> features
> such as
> consistently
> presenting
> distorted
> frames
> with the
> right latency,
> reprojection,
> etc,
> to be
> separate
> from the
> main
> application.
>
> Currently
> we are
> using
> unreleased
> cross-process
> memory and
> semaphore
> extensions
> to fetch
> updated
> eye images
> from the
> client
> application,
> but the
> just-in-time
> reprojection
> discussed
> here does not
> actually
> have any
> direct
> interactions
> with
> cross-process
> resource
> sharing,
> since it's
> achieved
> by using
> whatever
> is the
> latest, most
> up-to-date
> eye images
> that have
> already
> been sent
> by the client
> application,
> which are
> already
> available
> to use
> without
> additional
> synchronization.
>
>
>
> 3) System
> compositor
> (we are
> looking
> at
> approaches
> to
> remove
> this
> overhead)
>
> Yes,
> IMHO
> the
> best
> is to
> run
> in
> "full
> screen
> mode".
>
>
> Yes, we
> are
> working on
> mechanisms
> to present
> directly
> to the
> headset
> display
> without
> any
> intermediaries
> as a
> separate
> effort.
>
>
> The
> latency
> is
> our main
> concern,
>
> I
> would
> assume
> that
> this
> is the
> known
> problem (at
> least for
> compute
> usage).
> It
> looks
> like
> that
> amdgpu
> /
> kernel
> submission
> is
> rather CPU
> intensive
> (at least
> in the
> default configuration).
>
>
> As long as
> it's a
> consistent
> cost, it
> shouldn't
> an issue.
> However, if
> there's
> high
> degrees of
> variance
> then that
> would be
> troublesome and
> we
> would need
> to account
> for the
> worst case.
>
> Hopefully
> the
> requirements
> and
> approach
> we
> described make
> sense, we're
> looking
> forward to
> your
> feedback
> and
> suggestions.
>
> Thanks!
> - Pierre-Loup
>
>
> Sincerely
> yours,
> Serguei Sagalovitch
>
>
> From:
> Andres
> Rodriguez
> <andresr at valvesoftware.com
> <mailto:andresr at valvesoftware.com>>
> Sent:
> December
> 16,
> 2016
> 10:00 PM
> To:
> Sagalovitch,
> Serguei;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Subject:
> RE:
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in amdgpu
>
> Hey
> Serguei,
>
> [Serguei]
> No. I
> mean
> pipe
> :-) as
> MEC define
> it.
> As far
> as I
> understand
> (by simplifying)
> some
> scheduling
> is
> per pipe.
> I
> know
> about
> the current
> allocation
> scheme
> but I
> do
> not think
> that
> it
> is ideal.
> I
> would
> assume
> that
> we
> need
> to
> switch
> to
> dynamical
> partition
> of
> resources
> based
> on
> the workload
> otherwise
> we
> will
> have
> resource
> conflict
> between
> Vulkan
> compute
> and
> OpenCL.
>
>
> I
> agree
> the
> partitioning
> isn't
> ideal.
> I'm
> hoping
> we can
> start
> with a
> solution
> that
> assumes that
> only
> pipe0
> has
> any
> work
> and
> the
> other
> pipes
> are
> idle (no
> HSA/ROCm
> running on
> the
> system).
>
> This
> should
> be
> more
> or
> less
> the
> use
> case
> we
> expect
> from VR
> users.
>
> I
> agree
> the
> split
> is
> currently
> not
> ideal,
> but
> I'd
> like to
> consider
> that a
> separate
> task,
> because
> making
> it
> dynamic is
> not
> straight
> forward :P
>
> [Serguei]
> Vulkan
> works
> via amdgpu
> (kernel
> submissions)
> so
> amdkfd
> will
> be not
> involved.
> I
> would
> assume
> that
> in
> the case
> of
> VR
> we
> will
> have
> one main
> application
> ("console"
> mode(?))
> so
> we
> could
> temporally
> "ignore"
> OpenCL/ROCm
> needs
> when
> VR
> is
> running.
>
>
> Correct,
> this
> is why
> we
> want
> to
> enable
> the
> high
> priority
> compute
> queue
> through
> libdrm-amdgpu,
> so
> that
> we can
> expose
> it
> through Vulkan
> later.
>
> For
> current VR
> workloads
> we
> have 3
> separate
> processes
> running actually:
> 1)
> Game
> process
> 2)
> VR
> Compositor
> (this
> is the
> process that
> will
> require
> high
> priority
> queue)
> 3)
> System
> compositor
> (we
> are
> looking at
> approaches
> to
> remove
> this
> overhead)
>
> For
> now I
> think
> it is
> okay
> to
> assume
> no
> OpenCL/ROCm
> running
> simultaneously,
> but
> I
> would
> also
> like
> to be
> able
> to
> address this
> case
> in the
> future
> (cross-pipe
> priorities).
>
> [Serguei]
> The problem
> with
> pre-emption
> of
> graphics
> task:
> (a) it
> may take
> time
> so
> latency
> may suffer
>
>
> The
> latency is
> our
> main
> concern,
> we
> want
> something
> that is
> predictable.
> A good
> illustration
> of
> what
> the
> reprojection
> scheduling
> looks like
> can be
> found
> here:
> https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png
> <https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png>
>
>
>
>
> (b) to
> preempt
> we
> need
> to
> have
> different
> "context"
> - we
> want
> to
> guarantee
> that
> submissions
> from
> the same
> context
> will
> be
> executed
> in
> order.
>
>
> This
> is
> okay,
> as the
> reprojection
> work
> doesn't have
> dependencies
> on
> the
> game
> context,
> and it
> even
> happens in
> a
> separate
> process.
>
> BTW:
> (a) Do
> you want
> "preempt"
> and later
> resume
> or
> do you
> want
> "preempt"
> and
> "cancel/abort"
>
>
> Preempt the
> game
> with
> the
> compositor
> task
> and
> then
> resume
> it.
>
> (b) Vulkan
> is
> generic
> API and
> could
> be
> used
> for graphics
> as
> well
> as
> for plain
> compute
> tasks
> (VK_QUEUE_COMPUTE_BIT).
>
>
> Yeah,
> the
> plan
> is to
> use
> vulkan
> compute.
> But if
> you figure
> out a way
> for us
> to get
> a
> guaranteed
> execution
> time
> using
> vulkan
> graphics,
> then
> I'll
> take you
> out
> for a
> beer :)
>
> Regards,
> Andres
> ________________________________________
> From:
> Sagalovitch,
> Serguei [Serguei.Sagalovitch at amd.com
> <mailto:Serguei.Sagalovitch at amd.com>]
> Sent:
> Friday, December
> 16,
> 2016
> 9:13 PM
> To:
> Andres
> Rodriguez;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Subject:
> Re:
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in amdgpu
>
> Hi Andres,
>
> Please
> see
> inline
> (as
> [Serguei])
>
> Sincerely
> yours,
> Serguei Sagalovitch
>
>
> From:
> Andres
> Rodriguez
> <andresr at valvesoftware.com
> <mailto:andresr at valvesoftware.com>>
> Sent:
> December
> 16,
> 2016
> 8:29 PM
> To:
> Sagalovitch,
> Serguei;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Subject:
> RE:
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in amdgpu
>
> Hi
> Serguei,
>
> Thanks
> for
> the
> feedback.
> Answers inline
> as [AR].
>
> Regards,
> Andres
>
> ________________________________________
> From:
> Sagalovitch,
> Serguei [Serguei.Sagalovitch at amd.com
> <mailto:Serguei.Sagalovitch at amd.com>]
> Sent:
> Friday, December
> 16,
> 2016
> 8:15 PM
> To:
> Andres
> Rodriguez;
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Subject:
> Re:
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in amdgpu
>
> Andres,
>
>
> Quick
> comments:
>
> 1) To
> minimize
> "bubbles",
> etc.
> we
> need
> to
> "force" CU
> assignments/binding
> to
> high-priority
> queue
> when
> it
> will
> be in
> use
> and "free"
> them later
> (we
> do not
> want
> forever take
> CUs
> from
> e.g.
> graphic task
> to
> degrade
> graphics
> performance).
>
> Otherwise
> we
> could
> have
> scenario
> when
> long
> graphics
> task (or
> low-priority
> compute)
> will
> took
> all
> (extra) CUs
> and
> high--priority
> will
> wait for
> needed
> resources.
> It
> will
> not be
> visible on
> "NOP "
> but
> only
> when
> you submit
> "real"
> compute task
> so I
> would
> recommend
> not to
> use
> "NOP"
> packets at
> all for
> testing.
>
> It (CU
> assignment)
> could
> be
> relatively
> easy
> done when
> everything
> is
> going
> via kernel
> (e.g.
> as
> part
> of
> frame
> submission)
> but I
> must
> admit
> that I
> am not
> sure
> about
> the
> best
> way
> for
> user
> level
> submissions
> (amdkfd).
>
> [AR] I
> wasn't
> aware
> of
> this
> part
> of the
> programming
> sequence.
> Thanks
> for
> the
> heads up!
> Is
> this
> similar to
> the CU
> masking programming?
> [Serguei]
> Yes.
> To
> simplify:
> the
> problem is
> that
> "scheduler"
> when
> deciding
> which
> queue
> to
> run
> will
> check
> if
> there
> is
> enough
> resources
> and
> if not
> then
> it
> will begin
> to
> check
> other
> queues
> with
> lower
> priority.
>
> 2) I
> would
> recommend
> to
> dedicate
> the
> whole
> pipe to
> high-priority
> queue
> and have
> nothing their
> except it.
>
> [AR]
> I'm
> guessing
> in
> this
> context you
> mean
> pipe =
> queue?
> (as
> opposed
> to the
> MEC
> definition
> of
> pipe,
> which
> is a
> grouping
> of
> queues).
> I say
> this
> because
> amdgpu
> only
> has
> access
> to 1 pipe,
> and
> the
> rest
> are
> statically
> partitioned
> for
> amdkfd
> usage.
>
> [Serguei]
> No. I
> mean
> pipe
> :-)
> as MEC
> define
> it. As
> far as I
> understand
> (by
> simplifying)
> some
> scheduling
> is per
> pipe.
> I know
> about
> the
> current
> allocation
> scheme
> but I
> do not
> think
> that
> it is
> ideal. I
> would
> assume
> that
> we
> need
> to
> switch to
> dynamical
> partition
> of
> resources
> based
> on the
> workload
> otherwise
> we
> will have
> resource
> conflict
> between Vulkan
> compute and
> OpenCL.
>
>
> BTW:
> Which
> user
> level
> API do
> you
> want
> to use
> for
> compute:
> Vulkan or
> OpenCL?
>
> [AR]
> Vulkan
>
> [Serguei]
> Vulkan
> works
> via
> amdgpu
> (kernel submissions)
> so
> amdkfd
> will
> be not
> involved.
> I
> would
> assume
> that
> in the
> case
> of VR
> we will
> have
> one main
> application
> ("console"
> mode(?))
> so we
> could
> temporally
> "ignore"
> OpenCL/ROCm
> needs
> when
> VR is
> running.
>
> we will
> not be
> able
> to
> provide
> a
> solution
> compatible
> with
> GFX
> worloads.
>
> I
> assume
> that
> you
> are
> talking about
> graphics?
> Am I
> right?
>
> [AR]
> Yeah,
> my
> understanding
> is
> that
> pre-empting
> the
> currently
> running
> graphics
> job
> and
> scheduling
> in
> something
> else
> using
> mid-buffer
> pre-emption
> has
> some cases
> where it
> doesn't work
> well.
> But if
> with
> polaris10
> it
> starts
> working well,
> it
> might
> be a
> better
> solution
> for
> us
> (because
> the
> whole
> reprojection
> work
> uses
> the
> vulkan
> graphics
> stack
> at the
> moment, and
> porting it
> to
> compute is
> not
> trivial).
>
> [Serguei]
> The
> problem with
> pre-emption
> of
> graphics
> task:
> (a) it may
> take
> time so
> latency may
> suffer
> (b) to
> preempt we
> need
> to
> have
> different
> "context"
> - we want
> to
> guarantee
> that
> submissions
> from
> the
> same
> context will
> be
> executed
> in order.
> BTW:
> (a) Do
> you
> want
> "preempt"
> and
> later
> resume
> or do you
> want
> "preempt"
> and
> "cancel/abort"?
> (b)
> Vulkan
> is
> generic API
> and
> could
> be used
> for
> graphics
> as
> well
> as for
> plain
> compute tasks
> (VK_QUEUE_COMPUTE_BIT).
>
>
> Sincerely
> yours,
> Serguei Sagalovitch
>
>
>
> From:
> amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> <mailto:amd-gfx-bounces at lists.freedesktop.org>>
> on
> behalf of
> Andres
> Rodriguez
> <andresr at valvesoftware.com
> <mailto:andresr at valvesoftware.com>>
> Sent:
> December
> 16,
> 2016
> 6:15 PM
> To:
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> Subject:
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in
> amdgpu
>
> Hi
> Everyone,
>
> This
> RFC is
> also
> available
> as a
> gist here:
> https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249
> <https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249>
>
>
>
>
>
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in amdgpu
> gist.github.com
> <http://gist.github.com>
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in amdgpu
>
>
>
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in amdgpu
> gist.github.com
> <http://gist.github.com>
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in amdgpu
>
>
>
>
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in amdgpu
> gist.github.com
> <http://gist.github.com>
> [RFC]
> Mechanism
> for
> high
> priority
> scheduling
> in amdgpu
>
>
> We are
> interested
> in
> feedback
> for a
> mechanism
> to
> effectively
> schedule
> high
> priority
> VR
> reprojection
> tasks
> (also
> referred
> to as
> time-warping)
> for
> Polaris10
> running on
> the
> amdgpu
> kernel
> driver.
>
> Brief
> context:
> --------------
>
> The
> main
> objective
> of
> reprojection
> is to
> avoid
> motion
> sickness
> for VR
> users in
> scenarios
> where
> the
> game
> or
> application
> would
> fail
> to finish
> rendering
> a new
> frame
> in
> time
> for
> the
> next
> VBLANK. When
> this
> happens,
> the
> user's
> head
> movements
> are
> not
> reflected
> on the
> Head
> Mounted Display
> (HMD)
> for the
> duration
> of an
> extra
> frame.
> This
> extended
> mismatch
> between the
> inner ear
> and the
> eyes may
> cause
> the
> user
> to
> experience
> motion
> sickness.
>
> The VR
> compositor
> deals
> with
> this
> problem by
> fabricating
> a
> new frame
> using the
> user's
> updated head
> position
> in
> combination
> with the
> previous
> frames.
> This
> avoids
> a
> prolonged
> mismatch
> between the
> HMD
> output
> and the
> inner ear.
>
> Because of
> the
> adverse effects
> on the
> user,
> we
> require high
> confidence
> that the
> reprojection
> task
> will
> complete
> before
> the
> VBLANK
> interval.
> Even if
> the
> GFX pipe
> is
> currently
> full
> of
> work
> from
> the
> game/application
> (which
> is most
> likely
> the case).
>
> For
> more
> details and
> illustrations,
> please
> refer
> to the
> following
> document:
> https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved
> <https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved>
>
>
>
>
>
> Gaming: Asynchronous
> Shaders Evolved
> |
> Community
> community.amd.com
> <http://community.amd.com>
> One of
> the
> most
> exciting
> new
> developments
> in GPU
> technology
> over the
> past
> year
> has
> been
> the
> adoption
> of
> asynchronous
> shaders,
> which can
> make
> more
> efficient
> use of ...
>
>
>
> Gaming: Asynchronous
> Shaders Evolved
> |
> Community
> community.amd.com
> <http://community.amd.com>
> One of
> the
> most
> exciting
> new
> developments
> in GPU
> technology
> over the
> past
> year
> has
> been
> the
> adoption
> of
> asynchronous
> shaders,
> which can
> make
> more
> efficient
> use of ...
>
>
>
> Gaming: Asynchronous
> Shaders Evolved
> |
> Community
> community.amd.com
> <http://community.amd.com>
> One of
> the
> most
> exciting
> new
> developments
> in GPU
> technology
> over the
> past
> year
> has
> been
> the
> adoption
> of
> asynchronous
> shaders,
> which can
> make
> more
> efficient
> use of ...
>
>
> Requirements:
> -------------
>
> The
> mechanism
> must
> expose
> the
> following
> functionaility:
>
> *
> Job
> round
> trip
> time
> must
> be
> predictable,
> from
> submission
> to
> fence
> signal
>
> *
> The
> mechanism
> must
> support compute
> workloads.
>
> Goals:
> ------
>
> *
> The
> mechanism
> should
> provide low
> submission
> latencies
>
> Test:
> submitting
> a NOP
> packet
> through the
> mechanism
> on busy
> hardware
> should
> be
> equivalent
> to
> submitting
> a NOP
> on
> idle
> hardware.
>
> Nice
> to have:
> -------------
>
> *
> The
> mechanism
> should
> also
> support GFX
> workloads.
>
> My
> understanding
> is
> that
> with
> the
> current hardware
> capabilities
> in
> Polaris10
> we
> will
> not be
> able
> to
> provide a
> solution
> compatible
> with GFX
> worloads.
>
> But I
> would
> love
> to
> hear
> otherwise.
> So if
> anyone
> has an
> idea,
> approach
> or
> suggestion
> that
> will
> also
> be
> compatible
> with
> the
> GFX ring,
> please let
> us know
> about it.
>
> *
> The
> above
> guarantees
> should
> also
> be
> respected
> by
> amdkfd
> workloads
>
> Would
> be
> good
> to
> have
> for
> consistency,
> but
> not
> strictly
> necessary
> as
> users
> running
> games
> are
> not
> traditionally
> running HPC
> workloads
> in the
> background.
>
> Proposed
> approach:
> ------------------
>
> Similar to
> the
> windows driver,
> we
> could
> expose
> a high
> priority
> compute queue
> to
> userspace.
>
> Submissions
> to
> this
> compute queue
> will
> be
> scheduled
> with
> high
> priority,
> and may
> acquire hardware
> resources
> previously
> in use
> by other
> queues.
>
> This
> can be
> achieved
> by
> taking
> advantage
> of the
> 'priority'
> field in
> the HQDs
> and
> could
> be
> programmed
> by
> amdgpu
> or the
> amdgpu
> scheduler.
> The
> relevant
> register
> fields
> are:
>
> *
> mmCP_HQD_PIPE_PRIORITY
>
> *
> mmCP_HQD_QUEUE_PRIORITY
>
> Implementation
> approach
> 1 -
> static
> partitioning:
> ------------------------------------------------
>
> The
> amdgpu
> driver
> currently
> controls
> 8
> compute queues
> from
> pipe0.
> We can
> statically
> partition
> these
> as
> follows:
>
> * 7x
> regular
>
> * 1x
> high
> priority
>
> The
> relevant
> priorities
> can be
> set so
> that
> submissions
> to
> the high
> priority
> ring
> will
> starve
> the
> other
> compute rings
> and
> the
> GFX ring.
>
> The
> amdgpu
> scheduler
> will
> only
> place
> jobs
> into
> the high
> priority
> rings
> if the
> context is
> marked
> as
> high
> priority.
> And a
> corresponding
> priority
> should be
> added
> to
> keep
> track
> of
> this
> information:
> *
> AMD_SCHED_PRIORITY_KERNEL
> *
> ->
> AMD_SCHED_PRIORITY_HIGH
> *
> AMD_SCHED_PRIORITY_NORMAL
>
> The
> user
> will
> request a
> high
> priority
> context by
> setting an
> appropriate
> flag
> in
> drm_amdgpu_ctx_in
> (AMDGPU_CTX_HIGH_PRIORITY
> or
> similar):
> https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163
> <https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163>
>
>
>
>
> The
> setting is
> in a
> per
> context level
> so
> that
> we can:
> *
> Maintain
> a
> consistent
> FIFO
> ordering
> of all
> submissions
> to a
> context
> *
> Create
> high
> priority
> and
> non-high
> priority
> contexts
> in the
> same
> process
>
> Implementation
> approach
> 2 -
> dynamic priority
> programming:
> ---------------------------------------------------------
>
> Similar to
> the
> above,
> but
> instead of
> programming
> the
> priorities
> and
> amdgpu_init()
> time,
> the SW
> scheduler
> will
> reprogram
> the
> queue
> priorities
> dynamically
> when
> scheduling
> a task.
>
> This
> would
> involve having
> a
> hardware
> specific
> callback
> from
> the
> scheduler
> to
> set
> the
> appropriate
> queue
> priority:
> set_priority(int
> ring,
> int index,
> int
> priority)
>
> During
> this
> callback
> we
> would
> have
> to
> grab
> the
> SRBM mutex
> to perform
> the
> appropriate
> HW
> programming,
> and
> I'm
> not
> really
> sure
> if that is
> something
> we
> should
> be
> doing from
> the
> scheduler.
>
> On the
> positive
> side,
> this
> approach
> would
> allow
> us to
> program a
> range of
> priorities
> for
> jobs
> instead of
> a
> single
> "high
> priority"
> value",
> achieving
> something
> similar to
> the
> niceness
> API
> available
> for CPU
> scheduling.
>
> I'm
> not
> sure
> if
> this
> flexibility
> is
> something
> that
> we would
> need for
> our use
> case,
> but it
> might
> be
> useful
> in
> other
> scenarios
> (multiple
> users
> sharing compute
> time
> on a
> server).
>
> This
> approach
> would
> require a
> new
> int
> field in
> drm_amdgpu_ctx_in,
> or
> repurposing
> of the
> flags
> field.
>
> Known
> current obstacles:
> ------------------------
>
> The SQ
> is
> currently
> programmed
> to
> disregard
> the HQD
> priorities,
> and
> instead it
> picks
> jobs
> at
> random. Settings
> from
> the
> shader
> itself
> are also
> disregarded
> as this is
> considered
> a
> privileged
> field.
>
> Effectively
> we can
> get
> our
> compute wavefront
> launched
> ASAP,
> but we
> might
> not
> get the
> time
> we
> need
> on the SQ.
>
> The
> current programming
> would
> have
> to be
> changed to
> allow
> priority
> propagation
> from
> the
> HQD
> into
> the SQ.
>
> Generic approach
> for
> all HW
> IPs:
> --------------------------------
>
> For
> consistency
> purposes,
> the
> high
> priority
> context can
> be
> enabled
> for
> all HW IPs
> with
> support of
> the SW
> scheduler.
> This
> will
> function
> similarly
> to the
> current
> AMD_SCHED_PRIORITY_KERNEL
> priority,
> where
> the
> job
> can jump
> ahead of
> anything
> not
> commited
> to the
> HW queue.
>
> The
> benefits
> of
> requesting
> a high
> priority
> context for
> a
> non-compute
> queue will
> be
> lesser
> (e.g.
> up to
> 10s of
> wait
> time
> if a
> GFX
> command is
> stuck in
> front of
> you),
> but
> having
> the
> API in
> place
> will
> allow
> us to
> easily
> improve the
> implementation
> in the
> future
> as new
> features
> become
> available
> in new
> hardware.
>
> Future
> steps:
> -------------
>
> Once
> we
> have
> an
> approach
> settled,
> I can
> take
> care
> of the
> implementation.
>
> Also,
> once
> the
> interface
> is
> mostly
> decided,
> we can
> start
> thinking
> about
> exposing
> the
> high
> priority
> queue
> through radv.
>
> Request for
> feedback:
> ---------------------
>
> We
> aren't
> married to
> any of
> the
> approaches
> outlined
> above.
> Our goal
> is to
> obtain
> a
> mechanism
> that
> will
> allow
> us to
> complete
> the
> reprojection
> job
> within a
> predictable
> amount
> of
> time.
> So if
> anyone
> anyone
> has any
> suggestions
> for
> improvements
> or
> alternative
> strategies
> we are
> more than
> happy
> to hear
> them.
>
> If any
> of the
> technical
> information
> above
> is also
> incorrect,
> feel
> free
> to point
> out my
> misunderstandings.
>
> Looking forward
> to
> hearing from
> you.
>
> Regards,
> Andres
>
> _______________________________________________
> amd-gfx mailing
> list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>
> amd-gfx Info
> Page -
> lists.freedesktop.org
> <http://lists.freedesktop.org>
> lists.freedesktop.org
> <http://lists.freedesktop.org>
> To see
> the
> collection
> of
> prior
> postings
> to the
> list,
> visit the
> amd-gfx Archives.
> Using
> amd-gfx:
> To
> post a
> message to
> all
> the list
> members,
> send
> email ...
>
>
>
> amd-gfx Info
> Page -
> lists.freedesktop.org
> <http://lists.freedesktop.org>
> lists.freedesktop.org
> <http://lists.freedesktop.org>
> To see
> the
> collection
> of
> prior
> postings
> to the
> list,
> visit the
> amd-gfx Archives.
> Using
> amd-gfx:
> To
> post a
> message to
> all
> the list
> members,
> send
> email ...
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> amd-gfx mailing
> list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>
> _______________________________________________
> amd-gfx
> mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>
>
>
> Sincerely yours,
> Serguei Sagalovitch
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> <mailto:amd-gfx at lists.freedesktop.org>
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> <https://lists.freedesktop.org/mailman/listinfo/amd-gfx>
>
>
>
>
>
>
>
> Sincerely yours,
> Serguei Sagalovitch
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20161226/0f265ded/attachment-0001.html>
More information about the amd-gfx
mailing list