<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Nice experiment, which is exactly SW scheduler can provide.<br>
And as you said "<span style="font-family:monospace,monospace">I.e.
your context can be scheduled into the<br>
HW queue ahead of any other context, but everything already
commited<br>
to the HW queue is executed in strict FIFO order.</span>"<br>
<br>
If you want to keep <span style="font-family:monospace,monospace">consistent</span>
latency, which will need to enable hw priority queue feature.<br>
<br>
Regards,<br>
David Zhou<br>
<br>
<div class="moz-cite-prefix">On 2016年12月24日 06:20, Andres Rodriguez
wrote:<br>
</div>
<blockquote
cite="mid:CAFQ_0eHg=Kf5qV50cgm51m6bTcMYdkgRXkT-sykJnYNzu3Zzsg@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<div dir="ltr">
<div>
<div><span style="font-family:monospace,monospace">Hey John,<br>
<br>
</span></div>
<span style="font-family:monospace,monospace">I've collected
bit of data using high priority SW scheduler queues,<br>
</span></div>
<div><span style="font-family:monospace,monospace">thought you
might be interested.<br>
</span></div>
<div><span style="font-family:monospace,monospace"><br>
Implementation as per the patch above.<br>
<br>
Control test 1<br>
==============<br>
<br>
Sascha Willems mesh sample running on its own at regular
priority<br>
<br>
Results<br>
-------<br>
<br>
Mesh: ~0.14ms per-frame latency<br>
<br>
Control test 2<br>
==============<br>
<br>
Two Sascha Willems mesh sample running on its own at regular
priority<br>
<br>
Results<br>
-------<br>
<br>
Mesh 1: ~0.26ms per-frame latency<br>
Mesh 2: ~0.26ms per-frame latency<br>
<br>
Test 1<br>
======<br>
<br>
Two Sascha Willems mesh samples running simultaneously. One
at high<br>
priority and the other running in a regular priority
graphics context.<br>
<br>
Results<br>
-------<br>
<br>
Mesh High: 0.14 - 0.24ms per-frame latency<br>
Mesh Regular: 0.24 - 0.40ms per-frame latency<br>
<br>
Test 2<br>
======<br>
<br>
Ten Sascha Willems mesh samples running simultaneously. One
at high<br>
priority and the others running in a regular priority
graphics context.<br>
<br>
Results<br>
-------<br>
<br>
Mesh High: 0.14 - 0.8ms per-frame latency<br>
Mesh Regular: 1.10 - 2.05ms per-frame latency<br>
<br>
Test 3<br>
======<br>
<br>
Two Sascha Willems mesh samples running simultaneously. One
at high<br>
priority and the other running in a regular priority
graphics context.<br>
<br>
Also running Unigine Heaven at Exteme preset @ 2560x1600<br>
<br>
Results<br>
-------<br>
<br>
Mesh High: 7 - 100ms per-frame latency </span><span
style="font-family:monospace,monospace"><span
style="font-family:monospace,monospace"><span
style="font-family:monospace,monospace">(Lots of
fluctuation)</span></span><br>
Mesh Regular: 40 - 130ms per-frame latency</span><span
style="font-family:monospace,monospace"><span
style="font-family:monospace,monospace"></span><span
style="font-family:monospace,monospace"><span
style="font-family:monospace,monospace"><span
style="font-family:monospace,monospace"> (Lots of
fluctuation)<br>
</span></span></span>Unigine Heaven: 20-40 fps<br>
<br>
</span><br>
<span style="font-family:monospace,monospace"><span
style="font-family:monospace,monospace">Test 4<br>
======<br>
<br>
Two Sascha Willems mesh samples running simultaneously.
One at high<br>
priority and the other running in a regular priority
graphics context.<br>
<br>
Also running Talos Principle @ 4K<br>
<br>
Results<br>
-------<br>
<br>
Mesh High: 0.14 - 3.97ms per-frame latency (Mostly
floats ~0.4ms)<br>
Mesh Regular: 0.43 - 8.11ms per-frame latency (Lots of
fluctuation)<br>
Talos: 24.8 fps AVG</span><br>
<br>
Observations<br>
============<br>
<br>
The high priority queue based on the SW scheduler provides
significant<br>
gains when paired with tasks that submit short duration
commands into<br>
the queue. This can be observed in tests 1 and 2.<br>
<br>
When the pipe is full of long running commands, the effects
are dampened.<br>
As observed in test 3, the per-frame latency suffers very
large spikes,<br>
and the latencies are very inconsistent.<br>
<br>
Talos seems to be a better behaved game. It may be
submitting shorter<br>
draw commands and the SW scheduler is able to interleave the
rest of<br>
the work.<br>
<br>
The results seem consistent with the hypothetical advantages
the SW<br>
scheduler should provide. I.e. your context can be scheduled
into the<br>
HW queue ahead of any other context, but everything already
commited<br>
to the HW queue is executed in strict FIFO order.<br>
<br>
</span></div>
<div><span style="font-family:monospace,monospace">In order to
deal with cases similar to Test 3, we will need to take<br>
</span></div>
<div><span style="font-family:monospace,monospace">advantage of
further features.<br>
<br>
Notes<br>
=====<br>
<br>
- Tests were run multiple times, and reboots were performed
during tests.<br>
- The mesh sample isn't really designed for benchmarking,
but it should<br>
be decent for ballpark figures<br>
- The high priority mesh app was run with default niceness
and also niceness<br>
at -20. This had no effect on the results, so it was not
added above.<br>
- CPU usage was not saturated while running the tests<br>
<br>
</span></div>
<div><span style="font-family:monospace,monospace">Regards,<br>
</span></div>
<div><span style="font-family:monospace,monospace">Andres<br>
</span></div>
<br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Fri, Dec 23, 2016 at 1:18 PM,
Pierre-Loup A. Griffais <span dir="ltr"><<a
moz-do-not-send="true"
href="mailto:pgriffais@valvesoftware.com" target="_blank">pgriffais@valvesoftware.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">I hate to
keep bringing up display topics in an unrelated
conversation, but I'm not sure where you got "Application
-> X server -> compositor -> X server" from. As I
was saying before, we need to be presenting directly to the
HMD display as no display server can be in the way, both for
latency but also quality of service reasons (a buggy
application cannot be allowed to accidentally display
undistorted rendering into the HMD); we intend to do the
necessary work for this, and the extent of X's (or a Wayland
implementation, or any other display server) involvment will
be to participate enough to know that the HMD display is
off-limits. If you have more questions on the display
aspect, or VR rendering in general, I'm happy to try to
address them out-of-band from this conversation.
<div class="HOEnZb">
<div class="h5"><br>
<br>
On 12/23/2016 02:54 AM, Christian König wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
But yes, in general you don't want another
compositor in the way, so<br>
we'll be acquiring the HMD display directly,
separate from any desktop<br>
or display server.<br>
</blockquote>
Assuming that the the HMD is attached to the rendering
device in some<br>
way you have the X server and the Compositor which
both try to be DRM<br>
master at the same time.<br>
<br>
Please correct me if that was fixed in the meantime,
but that sounds<br>
like it will simply not work. Or is this what Andres
mention below Dave<br>
is working on ?.<br>
<br>
Additional to that a compositor in combination with X
is a bit counter<br>
productive when you want to keep the latency low.<br>
<br>
E.g. the "normal" flow of a GL or Vulkan surface
filled with rendered<br>
data to be displayed is from the Application -> X
server -> compositor<br>
-> X server.<br>
<br>
The extra step between X server and compositor just
means extra latency<br>
and for this use case you probably don't want that.<br>
<br>
Targeting something like Wayland and when you need X
compatibility<br>
XWayland sounds like the much better idea.<br>
<br>
Regards,<br>
Christian.<br>
<br>
Am 22.12.2016 um 20:54 schrieb Pierre-Loup A.
Griffais:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
Display concerns are a separate issue, and as Andres
said we have<br>
other plans to address. But yes, in general you
don't want another<br>
compositor in the way, so we'll be acquiring the HMD
display directly,<br>
separate from any desktop or display server. Same
with security, we<br>
can have a separate conversation about that when the
time comes.<br>
<br>
On 12/22/2016 08:41 AM, Serguei Sagalovitch wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
Andres,<br>
<br>
Did you measure latency, etc. impact of __any__
compositor?<br>
<br>
My understanding is that VR has pretty strict
requirements related to<br>
QoS.<br>
<br>
Sincerely yours,<br>
Serguei Sagalovitch<br>
<br>
<br>
On 2016-12-22 11:35 AM, Andres Rodriguez wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0
0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">
Hey Christian,<br>
<br>
We are currently interested in X, but with some
distros switching to<br>
other compositors by default, we also need to
consider those.<br>
<br>
We agree, running the full vrcompositor in root
isn't something that<br>
we want to do. Too many security concerns.
Having a small root helper<br>
that does the privilege escalation for us is the
initial idea.<br>
<br>
For a long term approach, Pierre-Loup and Dave
are working on dealing<br>
with the "two compositors" scenario a little
better in DRM+X.<br>
Fullscreen isn't really a sufficient approach,
since we don't want the<br>
HMD to be used as part of the Desktop
environment when a VR app is not<br>
in use (this is extremely annoying).<br>
<br>
When the above is settled, we should have an
auth mechanism besides<br>
DRM_MASTER or DRM_AUTH that allows the
vrcompositor to take over the<br>
HMD permanently away from X. Re-using that auth
method to gate this<br>
IOCTL is probably going to be the final
solution.<br>
<br>
I propose to start with ROOT_ONLY since it
should allow us to respect<br>
kernel IOCTL compatibility guidelines with the
most flexibility. Going<br>
from a restrictive to a more flexible permission
model would be<br>
inclusive, but going from a general to a
restrictive model may exclude<br>
some apps that used to work.<br>
<br>
Regards,<br>
Andres<br>
<br>
On 12/22/2016 6:42 AM, Christian König wrote:<br>
<blockquote class="gmail_quote" style="margin:0
0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">
Hi Andres,<br>
<br>
well using root might cause stability and
security problems as well.<br>
We worked quite hard to avoid exactly this for
X.<br>
<br>
We could make this feature depend on the
compositor being DRM master,<br>
but for example with X the X server is master
(and e.g. can change<br>
resolutions etc..) and not the compositor.<br>
<br>
So another question is also what windowing
system (if any) are you<br>
planning to use? X, Wayland, Flinger or
something completely<br>
different ?<br>
<br>
Regards,<br>
Christian.<br>
<br>
Am 20.12.2016 um 16:51 schrieb Andres
Rodriguez:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
Hi Christian,<br>
<br>
That is definitely a concern. What we are
currently thinking is to<br>
make the high priority queues accessible to
root only.<br>
<br>
Therefore is a non-root user attempts to set
the high priority flag<br>
on context allocation, we would fail the
call and return ENOPERM.<br>
<br>
Regards,<br>
Andres<br>
<br>
<br>
On 12/20/2016 7:56 AM, Christian König
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
BTW: If there is non-VR application
which will use high-priority<br>
h/w queue then VR application will
suffer. Any ideas how<br>
to solve it?<br>
</blockquote>
Yeah, that problem came to my mind as
well.<br>
<br>
Basically we need to restrict those high
priority submissions to<br>
the VR compositor or otherwise any
malfunctioning application could<br>
use it.<br>
<br>
Just think about some WebGL suddenly
taking all our rendering away<br>
and we won't get anything drawn any more.<br>
<br>
Alex or Michel any ideas on that?<br>
<br>
Regards,<br>
Christian.<br>
<br>
Am 19.12.2016 um 15:48 schrieb Serguei
Sagalovitch:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
> If compute queue is occupied only
by you, the efficiency<br>
> is equal with setting job queue to
high priority I think.<br>
The only risk is the situation when
graphics will take all<br>
needed CUs. But in any case it should be
very good test.<br>
<br>
Andres/Pierre-Loup,<br>
<br>
Did you try to do it or it is a lot of
work for you?<br>
<br>
<br>
BTW: If there is non-VR application
which will use high-priority<br>
h/w queue then VR application will
suffer. Any ideas how<br>
to solve it?<br>
<br>
Sincerely yours,<br>
Serguei Sagalovitch<br>
<br>
On 2016-12-19 12:50 AM, zhoucm1 wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
Do you encounter the priority issue
for compute queue with<br>
current driver?<br>
<br>
If compute queue is occupied only by
you, the efficiency is equal<br>
with setting job queue to high
priority I think.<br>
<br>
Regards,<br>
David Zhou<br>
<br>
On 2016年12月19日 13:29, Andres Rodriguez
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
Yes, vulkan is available on all-open
through the mesa radv UMD.<br>
<br>
I'm not sure if I'm asking for too
much, but if we can<br>
coordinate a similar interface in
radv and amdgpu-pro at the<br>
vulkan level that would be great.<br>
<br>
I'm not sure what that's going to be
yet.<br>
<br>
- Andres<br>
<br>
On 12/19/2016 12:11 AM, zhoucm1
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<br>
<br>
On 2016年12月19日 11:33, Pierre-Loup
A. Griffais wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
We're currently working with the
open stack; I assume that a<br>
mechanism could be exposed by
both open and Pro Vulkan<br>
userspace drivers and that the
amdgpu kernel interface<br>
improvements we would pursue
following this discussion would<br>
let both drivers take advantage
of the feature, correct?<br>
</blockquote>
Of course.<br>
Does open stack have Vulkan
support?<br>
<br>
Regards,<br>
David Zhou<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<br>
On 12/18/2016 07:26 PM, zhoucm1
wrote:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
By the way, are you using
all-open driver or amdgpu-pro<br>
driver?<br>
<br>
+David Mao, who is working on
our Vulkan driver.<br>
<br>
Regards,<br>
David Zhou<br>
<br>
On 2016年12月18日 06:05,
Pierre-Loup A. Griffais wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
Hi Serguei,<br>
<br>
I'm also working on the
bringing up our VR runtime
on top of<br>
amgpu;<br>
see replies inline.<br>
<br>
On 12/16/2016 09:05 PM,
Sagalovitch, Serguei wrote:<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
Andres,<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
For current VR
workloads we have 3
separate processes<br>
running<br>
actually:<br>
</blockquote>
So we could have potential
memory overcommit case or
do<br>
you do<br>
partitioning<br>
on your own? I would
think that there is need
to avoid<br>
overcomit in<br>
VR case to<br>
prevent any BO migration.<br>
</blockquote>
<br>
You're entirely correct;
currently the VR runtime is<br>
setting up<br>
prioritized CPU scheduling
for its VR compositor, we're<br>
working on<br>
prioritized GPU scheduling
and pre-emption (eg. this<br>
thread), and in<br>
the future it will make
sense to do work in order to
make<br>
sure that<br>
its memory allocations do
not get evicted, to prevent
any<br>
unwelcome<br>
additional latency in the
event of needing to perform<br>
just-in-time<br>
reprojection.<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
BTW: Do you mean __real__
processes or threads?<br>
Based on my understanding
sharing BOs between
different<br>
processes<br>
could introduce additional
synchronization
constrains. btw:<br>
I am not<br>
sure<br>
if we are able to share
Vulkan sync. object
cross-process<br>
boundary.<br>
</blockquote>
<br>
They are different
processes; it is important
for the<br>
compositor that<br>
is responsible for
quality-of-service features
such as<br>
consistently<br>
presenting distorted frames
with the right latency,<br>
reprojection, etc,<br>
to be separate from the main
application.<br>
<br>
Currently we are using
unreleased cross-process
memory and<br>
semaphore<br>
extensions to fetch updated
eye images from the client<br>
application,<br>
but the just-in-time
reprojection discussed here
does not<br>
actually<br>
have any direct interactions
with cross-process resource<br>
sharing,<br>
since it's achieved by using
whatever is the latest, most<br>
up-to-date<br>
eye images that have already
been sent by the client<br>
application,<br>
which are already available
to use without additional<br>
synchronization.<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
3) System compositor
(we are looking at
approaches to<br>
remove this<br>
overhead)<br>
</blockquote>
Yes, IMHO the best is to
run in "full screen
mode".<br>
</blockquote>
<br>
Yes, we are working on
mechanisms to present
directly to the<br>
headset<br>
display without any
intermediaries as a separate
effort.<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
The latency is our main
concern,<br>
</blockquote>
I would assume that this
is the known problem (at
least for<br>
compute<br>
usage).<br>
It looks like that amdgpu
/ kernel submission is
rather CPU<br>
intensive<br>
(at least<br>
in the default
configuration).<br>
</blockquote>
<br>
As long as it's a consistent
cost, it shouldn't an issue.<br>
However, if<br>
there's high degrees of
variance then that would be<br>
troublesome and we<br>
would need to account for
the worst case.<br>
<br>
Hopefully the requirements
and approach we described
make<br>
sense, we're<br>
looking forward to your
feedback and suggestions.<br>
<br>
Thanks!<br>
- Pierre-Loup<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px #ccc
solid;padding-left:1ex">
<br>
Sincerely yours,<br>
Serguei Sagalovitch<br>
<br>
<br>
From: Andres Rodriguez
<<a
moz-do-not-send="true"
href="mailto:andresr@valvesoftware.com"
target="_blank">andresr@valvesoftware.com</a>><br>
Sent: December 16, 2016
10:00 PM<br>
To: Sagalovitch, Serguei;
<a moz-do-not-send="true"
href="mailto:amd-gfx@lists.freedesktop.org" target="_blank">amd-gfx@lists.freedesktop.org</a><br>
Subject: RE: [RFC]
Mechanism for high
priority scheduling<br>
in amdgpu<br>
<br>
Hey Serguei,<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
[Serguei] No. I mean
pipe :-) as MEC define
it. As far<br>
as I<br>
understand (by
simplifying)<br>
some scheduling is per
pipe. I know about the
current<br>
allocation<br>
scheme but I do not
think<br>
that it is ideal. I
would assume that we
need to<br>
switch to<br>
dynamical partition<br>
of resources based on
the workload otherwise
we will have<br>
resource<br>
conflict<br>
between Vulkan compute
and OpenCL.<br>
</blockquote>
<br>
I agree the partitioning
isn't ideal. I'm hoping we
can<br>
start with a<br>
solution that assumes that<br>
only pipe0 has any work
and the other pipes are
idle (no<br>
HSA/ROCm<br>
running on the system).<br>
<br>
This should be more or
less the use case we
expect from VR<br>
users.<br>
<br>
I agree the split is
currently not ideal, but
I'd like to<br>
consider<br>
that a separate task,
because<br>
making it dynamic is not
straight forward :P<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
[Serguei] Vulkan works
via amdgpu (kernel
submissions) so<br>
amdkfd<br>
will be not<br>
involved. I would
assume that in the case
of VR we will<br>
have one main<br>
application ("console"
mode(?)) so we could
temporally<br>
"ignore"<br>
OpenCL/ROCm needs when
VR is running.<br>
</blockquote>
<br>
Correct, this is why we
want to enable the high
priority<br>
compute<br>
queue through<br>
libdrm-amdgpu, so that we
can expose it through
Vulkan<br>
later.<br>
<br>
For current VR workloads
we have 3 separate
processes<br>
running actually:<br>
1) Game process<br>
2) VR Compositor (this
is the process that will
require<br>
high<br>
priority queue)<br>
3) System compositor
(we are looking at
approaches to<br>
remove this<br>
overhead)<br>
<br>
For now I think it is okay
to assume no OpenCL/ROCm
running<br>
simultaneously, but<br>
I would also like to be
able to address this case
in the<br>
future<br>
(cross-pipe priorities).<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
[Serguei] The problem
with pre-emption of
graphics task:<br>
(a) it<br>
may take time so<br>
latency may suffer<br>
</blockquote>
<br>
The latency is our main
concern, we want something
that is<br>
predictable. A good<br>
illustration of what the
reprojection scheduling
looks like<br>
can be<br>
found here:<br>
<a moz-do-not-send="true"
href="https://community.amd.com/servlet/JiveServlet/showImage/38-1310-104754/pastedImage_3.png"
rel="noreferrer"
target="_blank">https://community.amd.com/serv<wbr>let/JiveServlet/showImage/38-<wbr>1310-104754/pastedImage_3.png</a><br>
<br>
<br>
<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
(b) to preempt we need
to have different
"context" - we<br>
want<br>
to guarantee that
submissions from the
same context will<br>
be executed<br>
in order.<br>
</blockquote>
<br>
This is okay, as the
reprojection work doesn't
have<br>
dependencies on<br>
the game context, and it<br>
even happens in a separate
process.<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
BTW: (a) Do you want
"preempt" and later
resume or do you<br>
want<br>
"preempt" and<br>
"cancel/abort"<br>
</blockquote>
<br>
Preempt the game with the
compositor task and then
resume<br>
it.<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
(b) Vulkan is generic
API and could be used
for graphics<br>
as well as<br>
for plain compute tasks
(VK_QUEUE_COMPUTE_BIT).<br>
</blockquote>
<br>
Yeah, the plan is to use
vulkan compute. But if you
figure<br>
out a way<br>
for us to get<br>
a guaranteed execution
time using vulkan
graphics, then<br>
I'll take you<br>
out for a beer :)<br>
<br>
Regards,<br>
Andres<br>
______________________________<wbr>__________<br>
From: Sagalovitch, Serguei
[<a moz-do-not-send="true"
href="mailto:Serguei.Sagalovitch@amd.com" target="_blank">Serguei.Sagalovitch@amd.com</a>]<br>
Sent: Friday, December 16,
2016 9:13 PM<br>
To: Andres Rodriguez; <a
moz-do-not-send="true"
href="mailto:amd-gfx@lists.freedesktop.org"
target="_blank">amd-gfx@lists.freedesktop.org</a><br>
Subject: Re: [RFC]
Mechanism for high
priority scheduling<br>
in amdgpu<br>
<br>
Hi Andres,<br>
<br>
Please see inline (as
[Serguei])<br>
<br>
Sincerely yours,<br>
Serguei Sagalovitch<br>
<br>
<br>
From: Andres Rodriguez
<<a
moz-do-not-send="true"
href="mailto:andresr@valvesoftware.com"
target="_blank">andresr@valvesoftware.com</a>><br>
Sent: December 16, 2016
8:29 PM<br>
To: Sagalovitch, Serguei;
<a moz-do-not-send="true"
href="mailto:amd-gfx@lists.freedesktop.org" target="_blank">amd-gfx@lists.freedesktop.org</a><br>
Subject: RE: [RFC]
Mechanism for high
priority scheduling<br>
in amdgpu<br>
<br>
Hi Serguei,<br>
<br>
Thanks for the feedback.
Answers inline as [AR].<br>
<br>
Regards,<br>
Andres<br>
<br>
______________________________<wbr>__________<br>
From: Sagalovitch, Serguei
[<a moz-do-not-send="true"
href="mailto:Serguei.Sagalovitch@amd.com" target="_blank">Serguei.Sagalovitch@amd.com</a>]<br>
Sent: Friday, December 16,
2016 8:15 PM<br>
To: Andres Rodriguez; <a
moz-do-not-send="true"
href="mailto:amd-gfx@lists.freedesktop.org"
target="_blank">amd-gfx@lists.freedesktop.org</a><br>
Subject: Re: [RFC]
Mechanism for high
priority scheduling<br>
in amdgpu<br>
<br>
Andres,<br>
<br>
<br>
Quick comments:<br>
<br>
1) To minimize "bubbles",
etc. we need to "force" CU<br>
assignments/binding<br>
to high-priority queue
when it will be in use and
"free"<br>
them later<br>
(we do not want forever
take CUs from e.g. graphic
task to<br>
degrade<br>
graphics<br>
performance).<br>
<br>
Otherwise we could have
scenario when long
graphics task (or<br>
low-priority<br>
compute) will took all
(extra) CUs and
high--priority will<br>
wait for<br>
needed resources.<br>
It will not be visible on
"NOP " but only when you
submit<br>
"real"<br>
compute task<br>
so I would recommend not
to use "NOP" packets at
all for<br>
testing.<br>
<br>
It (CU assignment) could
be relatively easy done
when<br>
everything is<br>
going via kernel<br>
(e.g. as part of frame
submission) but I must
admit that I<br>
am not sure<br>
about the best way for
user level submissions
(amdkfd).<br>
<br>
[AR] I wasn't aware of
this part of the
programming<br>
sequence. Thanks<br>
for the heads up!<br>
Is this similar to the CU
masking programming?<br>
[Serguei] Yes. To
simplify: the problem is
that "scheduler"<br>
when<br>
deciding which<br>
queue to run will check
if there is enough
resources and<br>
if not then<br>
it will begin<br>
to check other queues with
lower priority.<br>
<br>
2) I would recommend to
dedicate the whole pipe to<br>
high-priority<br>
queue and have<br>
nothing their except it.<br>
<br>
[AR] I'm guessing in this
context you mean pipe =
queue?<br>
(as opposed<br>
to the MEC definition<br>
of pipe, which is a
grouping of queues). I say
this because<br>
amdgpu<br>
only has access to 1 pipe,<br>
and the rest are
statically partitioned for
amdkfd usage.<br>
<br>
[Serguei] No. I mean pipe
:-) as MEC define it. As
far as I<br>
understand (by
simplifying)<br>
some scheduling is per
pipe. I know about the
current<br>
allocation<br>
scheme but I do not think<br>
that it is ideal. I
would assume that we need
to switch to<br>
dynamical partition<br>
of resources based on the
workload otherwise we will
have<br>
resource<br>
conflict<br>
between Vulkan compute
and OpenCL.<br>
<br>
<br>
BTW: Which user level API
do you want to use for
compute:<br>
Vulkan or<br>
OpenCL?<br>
<br>
[AR] Vulkan<br>
<br>
[Serguei] Vulkan works via
amdgpu (kernel
submissions) so<br>
amdkfd will<br>
be not<br>
involved. I would assume
that in the case of VR we
will<br>
have one main<br>
application ("console"
mode(?)) so we could
temporally<br>
"ignore"<br>
OpenCL/ROCm needs when VR
is running.<br>
<br>
<blockquote
class="gmail_quote"
style="margin:0 0 0
.8ex;border-left:1px
#ccc
solid;padding-left:1ex">
we will not be able to
provide a solution
compatible with<br>
GFX<br>
worloads.<br>
</blockquote>
I assume that you are
talking about graphics? Am
I right?<br>
<br>
[AR] Yeah, my
understanding is that
pre-empting the<br>
currently running<br>
graphics job and
scheduling in<br>
something else using
mid-buffer pre-emption has
some cases<br>
where it<br>
doesn't work well. But if
with<br>
polaris10 it starts
working well, it might be
a better<br>
solution for<br>
us (because the whole
reprojection<br>
work uses the vulkan
graphics stack at the
moment, and<br>
porting it to<br>
compute is not trivial).<br>
<br>
[Serguei] The problem
with pre-emption of
graphics task:<br>
(a) it may<br>
take time so<br>
latency may suffer (b) to
preempt we need to have
different<br>
"context"<br>
- we want<br>
to guarantee that
submissions from the same
context will be<br>
executed<br>
in order.<br>
BTW: (a) Do you want
"preempt" and later resume
or do you<br>
want<br>
"preempt" and<br>
"cancel/abort"? (b)
Vulkan is generic API and
could be used<br>
for graphics as well as
for plain compute tasks<br>
(VK_QUEUE_COMPUTE_BIT).<br>
<br>
<br>
Sincerely yours,<br>
Serguei Sagalovitch<br>
<br>
<br>
<br>
From: amd-gfx <<a
moz-do-not-send="true"
href="mailto:amd-gfx-bounces@lists.freedesktop.org"
target="_blank">amd-gfx-bounces@lists.freedes<wbr>ktop.org</a>>
on<br>
behalf of<br>
Andres Rodriguez <<a
moz-do-not-send="true"
href="mailto:andresr@valvesoftware.com"
target="_blank">andresr@valvesoftware.com</a>><br>
Sent: December 16, 2016
6:15 PM<br>
To: <a
moz-do-not-send="true"
href="mailto:amd-gfx@lists.freedesktop.org"
target="_blank">amd-gfx@lists.freedesktop.org</a><br>
Subject: [RFC] Mechanism
for high priority
scheduling in<br>
amdgpu<br>
<br>
Hi Everyone,<br>
<br>
This RFC is also available
as a gist here:<br>
<a moz-do-not-send="true"
href="https://gist.github.com/lostgoat/7000432cd6864265dbc2c3ab93204249"
rel="noreferrer"
target="_blank">https://gist.github.com/lostgo<wbr>at/7000432cd6864265dbc2c3ab932<wbr>04249</a><br>
<br>
<br>
<br>
<br>
<br>
[RFC] Mechanism for high
priority scheduling in
amdgpu<br>
<a moz-do-not-send="true"
href="http://gist.github.com" rel="noreferrer" target="_blank">gist.github.com</a><br>
[RFC] Mechanism for high
priority scheduling in
amdgpu<br>
<br>
<br>
<br>
[RFC] Mechanism for high
priority scheduling in
amdgpu<br>
<a moz-do-not-send="true"
href="http://gist.github.com" rel="noreferrer" target="_blank">gist.github.com</a><br>
[RFC] Mechanism for high
priority scheduling in
amdgpu<br>
<br>
<br>
<br>
<br>
[RFC] Mechanism for high
priority scheduling in
amdgpu<br>
<a moz-do-not-send="true"
href="http://gist.github.com" rel="noreferrer" target="_blank">gist.github.com</a><br>
[RFC] Mechanism for high
priority scheduling in
amdgpu<br>
<br>
<br>
We are interested in
feedback for a mechanism
to<br>
effectively schedule<br>
high<br>
priority VR reprojection
tasks (also referred to as<br>
time-warping) for<br>
Polaris10<br>
running on the amdgpu
kernel driver.<br>
<br>
Brief context:<br>
--------------<br>
<br>
The main objective of
reprojection is to avoid
motion<br>
sickness for VR<br>
users in<br>
scenarios where the game
or application would fail
to finish<br>
rendering a new<br>
frame in time for the next
VBLANK. When this happens,
the<br>
user's head<br>
movements<br>
are not reflected on the
Head Mounted Display (HMD)
for the<br>
duration<br>
of an<br>
extra frame. This extended
mismatch between the inner
ear<br>
and the<br>
eyes may<br>
cause the user to
experience motion
sickness.<br>
<br>
The VR compositor deals
with this problem by
fabricating a<br>
new frame<br>
using the<br>
user's updated head
position in combination
with the<br>
previous frames.<br>
This<br>
avoids a prolonged
mismatch between the HMD
output and the<br>
inner ear.<br>
<br>
Because of the adverse
effects on the user, we
require high<br>
confidence that the<br>
reprojection task will
complete before the VBLANK
interval.<br>
Even if<br>
the GFX pipe<br>
is currently full of work
from the game/application
(which<br>
is most<br>
likely the case).<br>
<br>
For more details and
illustrations, please
refer to the<br>
following<br>
document:<br>
<a moz-do-not-send="true"
href="https://community.amd.com/community/gaming/blog/2016/03/28/asynchronous-shaders-evolved"
rel="noreferrer"
target="_blank">https://community.amd.com/comm<wbr>unity/gaming/blog/2016/03/28/<wbr>asynchronous-shaders-evolved</a><br>
<br>
<br>
<br>
<br>
<br>
Gaming: Asynchronous
Shaders Evolved |
Community<br>
<a moz-do-not-send="true"
href="http://community.amd.com" rel="noreferrer" target="_blank">community.amd.com</a><br>
One of the most exciting
new developments in GPU
technology<br>
over the<br>
past year has been the
adoption of asynchronous
shaders,<br>
which can<br>
make more efficient use of
...<br>
<br>
<br>
<br>
Gaming: Asynchronous
Shaders Evolved |
Community<br>
<a moz-do-not-send="true"
href="http://community.amd.com" rel="noreferrer" target="_blank">community.amd.com</a><br>
One of the most exciting
new developments in GPU
technology<br>
over the<br>
past year has been the
adoption of asynchronous
shaders,<br>
which can<br>
make more efficient use of
...<br>
<br>
<br>
<br>
Gaming: Asynchronous
Shaders Evolved |
Community<br>
<a moz-do-not-send="true"
href="http://community.amd.com" rel="noreferrer" target="_blank">community.amd.com</a><br>
One of the most exciting
new developments in GPU
technology<br>
over the<br>
past year has been the
adoption of asynchronous
shaders,<br>
which can<br>
make more efficient use of
...<br>
<br>
<br>
Requirements:<br>
-------------<br>
<br>
The mechanism must expose
the following
functionaility:<br>
<br>
* Job round trip time
must be predictable, from<br>
submission to<br>
fence signal<br>
<br>
* The mechanism must
support compute workloads.<br>
<br>
Goals:<br>
------<br>
<br>
* The mechanism should
provide low submission
latencies<br>
<br>
Test: submitting a NOP
packet through the
mechanism on busy<br>
hardware<br>
should<br>
be equivalent to
submitting a NOP on idle
hardware.<br>
<br>
Nice to have:<br>
-------------<br>
<br>
* The mechanism should
also support GFX
workloads.<br>
<br>
My understanding is that
with the current hardware<br>
capabilities in<br>
Polaris10 we<br>
will not be able to
provide a solution
compatible with GFX<br>
worloads.<br>
<br>
But I would love to hear
otherwise. So if anyone
has an<br>
idea,<br>
approach or<br>
suggestion that will also
be compatible with the GFX
ring,<br>
please let<br>
us know<br>
about it.<br>
<br>
* The above guarantees
should also be respected
by<br>
amdkfd workloads<br>
<br>
Would be good to have for
consistency, but not
strictly<br>
necessary as<br>
users running<br>
games are not
traditionally running HPC
workloads in the<br>
background.<br>
<br>
Proposed approach:<br>
------------------<br>
<br>
Similar to the windows
driver, we could expose a
high<br>
priority<br>
compute queue to<br>
userspace.<br>
<br>
Submissions to this
compute queue will be
scheduled with<br>
high<br>
priority, and may<br>
acquire hardware resources
previously in use by other<br>
queues.<br>
<br>
This can be achieved by
taking advantage of the
'priority'<br>
field in<br>
the HQDs<br>
and could be programmed by
amdgpu or the amdgpu
scheduler.<br>
The relevant<br>
register fields are:<br>
*
mmCP_HQD_PIPE_PRIORITY<br>
*
mmCP_HQD_QUEUE_PRIORITY<br>
<br>
Implementation approach 1
- static partitioning:<br>
------------------------------<wbr>------------------<br>
<br>
The amdgpu driver
currently controls 8
compute queues from<br>
pipe0. We can<br>
statically partition these
as follows:<br>
* 7x regular<br>
* 1x high priority<br>
<br>
The relevant priorities
can be set so that
submissions to<br>
the high<br>
priority<br>
ring will starve the other
compute rings and the GFX
ring.<br>
<br>
The amdgpu scheduler will
only place jobs into the
high<br>
priority<br>
rings if the<br>
context is marked as high
priority. And a
corresponding<br>
priority<br>
should be<br>
added to keep track of
this information:<br>
*
AMD_SCHED_PRIORITY_KERNEL<br>
* ->
AMD_SCHED_PRIORITY_HIGH<br>
*
AMD_SCHED_PRIORITY_NORMAL<br>
<br>
The user will request a
high priority context by
setting an<br>
appropriate flag<br>
in drm_amdgpu_ctx_in
(AMDGPU_CTX_HIGH_PRIORITY
or similar):<br>
<a moz-do-not-send="true"
href="https://github.com/torvalds/linux/blob/master/include/uapi/drm/amdgpu_drm.h#L163"
rel="noreferrer"
target="_blank">https://github.com/torvalds/li<wbr>nux/blob/master/include/uapi/<wbr>drm/amdgpu_drm.h#L163</a><br>
<br>
<br>
<br>
<br>
The setting is in a per
context level so that we
can:<br>
* Maintain a
consistent FIFO ordering
of all<br>
submissions to a<br>
context<br>
* Create high priority
and non-high priority
contexts<br>
in the same<br>
process<br>
<br>
Implementation approach 2
- dynamic priority
programming:<br>
------------------------------<wbr>---------------------------<br>
<br>
Similar to the above, but
instead of programming the<br>
priorities and<br>
amdgpu_init() time, the SW
scheduler will reprogram
the<br>
queue priorities<br>
dynamically when
scheduling a task.<br>
<br>
This would involve having
a hardware specific
callback from<br>
the<br>
scheduler to<br>
set the appropriate queue
priority: set_priority(int
ring,<br>
int index,<br>
int priority)<br>
<br>
During this callback we
would have to grab the
SRBM mutex<br>
to perform<br>
the appropriate<br>
HW programming, and I'm
not really sure if that is<br>
something we<br>
should be doing from<br>
the scheduler.<br>
<br>
On the positive side, this
approach would allow us to<br>
program a range of<br>
priorities for jobs
instead of a single "high
priority"<br>
value",<br>
achieving<br>
something similar to the
niceness API available for
CPU<br>
scheduling.<br>
<br>
I'm not sure if this
flexibility is something
that we would<br>
need for<br>
our use<br>
case, but it might be
useful in other scenarios
(multiple<br>
users<br>
sharing compute<br>
time on a server).<br>
<br>
This approach would
require a new int field in<br>
drm_amdgpu_ctx_in, or<br>
repurposing<br>
of the flags field.<br>
<br>
Known current obstacles:<br>
------------------------<br>
<br>
The SQ is currently
programmed to disregard
the HQD<br>
priorities, and<br>
instead it picks<br>
jobs at random. Settings
from the shader itself are
also<br>
disregarded<br>
as this is<br>
considered a privileged
field.<br>
<br>
Effectively we can get our
compute wavefront launched
ASAP,<br>
but we<br>
might not get the<br>
time we need on the SQ.<br>
<br>
The current programming
would have to be changed
to allow<br>
priority<br>
propagation<br>
from the HQD into the SQ.<br>
<br>
Generic approach for all
HW IPs:<br>
------------------------------<wbr>--<br>
<br>
For consistency purposes,
the high priority context
can be<br>
enabled<br>
for all HW IPs<br>
with support of the SW
scheduler. This will
function<br>
similarly to the<br>
current<br>
AMD_SCHED_PRIORITY_KERNEL
priority, where the job
can jump<br>
ahead of<br>
anything not<br>
commited to the HW queue.<br>
<br>
The benefits of requesting
a high priority context
for a<br>
non-compute<br>
queue will<br>
be lesser (e.g. up to 10s
of wait time if a GFX
command is<br>
stuck in<br>
front of<br>
you), but having the API
in place will allow us to
easily<br>
improve the<br>
implementation<br>
in the future as new
features become available
in new<br>
hardware.<br>
<br>
Future steps:<br>
-------------<br>
<br>
Once we have an approach
settled, I can take care
of the<br>
implementation.<br>
<br>
Also, once the interface
is mostly decided, we can
start<br>
thinking about<br>
exposing the high priority
queue through radv.<br>
<br>
Request for feedback:<br>
---------------------<br>
<br>
We aren't married to any
of the approaches outlined
above.<br>
Our goal<br>
is to<br>
obtain a mechanism that
will allow us to complete
the<br>
reprojection<br>
job within a<br>
predictable amount of
time. So if anyone anyone
has any<br>
suggestions for<br>
improvements or
alternative strategies we
are more than<br>
happy to hear<br>
them.<br>
<br>
If any of the technical
information above is also<br>
incorrect, feel<br>
free to point<br>
out my misunderstandings.<br>
<br>
Looking forward to hearing
from you.<br>
<br>
Regards,<br>
Andres<br>
<br>
______________________________<wbr>_________________<br>
amd-gfx mailing list<br>
<a moz-do-not-send="true"
href="mailto:amd-gfx@lists.freedesktop.org" target="_blank">amd-gfx@lists.freedesktop.org</a><br>
<a moz-do-not-send="true"
href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx"
rel="noreferrer"
target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/amd-gfx</a><br>
<br>
<br>
amd-gfx Info Page - <a
moz-do-not-send="true"
href="http://lists.freedesktop.org"
rel="noreferrer"
target="_blank">lists.freedesktop.org</a><br>
<a moz-do-not-send="true"
href="http://lists.freedesktop.org" rel="noreferrer" target="_blank">lists.freedesktop.org</a><br>
To see the collection of
prior postings to the
list,<br>
visit the<br>
amd-gfx Archives. Using
amd-gfx: To post a message
to all<br>
the list<br>
members, send email ...<br>
<br>
<br>
<br>
amd-gfx Info Page - <a
moz-do-not-send="true"
href="http://lists.freedesktop.org"
rel="noreferrer"
target="_blank">lists.freedesktop.org</a><br>
<a moz-do-not-send="true"
href="http://lists.freedesktop.org" rel="noreferrer" target="_blank">lists.freedesktop.org</a><br>
To see the collection of
prior postings to the
list,<br>
visit the<br>
amd-gfx Archives. Using
amd-gfx: To post a message
to all<br>
the list<br>
members, send email ...<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
______________________________<wbr>_________________<br>
amd-gfx mailing list<br>
<a moz-do-not-send="true"
href="mailto:amd-gfx@lists.freedesktop.org" target="_blank">amd-gfx@lists.freedesktop.org</a><br>
<a moz-do-not-send="true"
href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx"
rel="noreferrer"
target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/amd-gfx</a><br>
<br>
</blockquote>
<br>
______________________________<wbr>_________________<br>
amd-gfx mailing list<br>
<a moz-do-not-send="true"
href="mailto:amd-gfx@lists.freedesktop.org"
target="_blank">amd-gfx@lists.freedesktop.org</a><br>
<a moz-do-not-send="true"
href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx"
rel="noreferrer"
target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/amd-gfx</a><br>
</blockquote>
<br>
</blockquote>
<br>
</blockquote>
<br>
______________________________<wbr>_________________<br>
amd-gfx mailing list<br>
<a moz-do-not-send="true"
href="mailto:amd-gfx@lists.freedesktop.org"
target="_blank">amd-gfx@lists.freedesktop.org</a><br>
<a moz-do-not-send="true"
href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx"
rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/amd-gfx</a><br>
</blockquote>
<br>
</blockquote>
<br>
</blockquote>
<br>
Sincerely yours,<br>
Serguei Sagalovitch<br>
<br>
______________________________<wbr>_________________<br>
amd-gfx mailing list<br>
<a moz-do-not-send="true"
href="mailto:amd-gfx@lists.freedesktop.org"
target="_blank">amd-gfx@lists.freedesktop.org</a><br>
<a moz-do-not-send="true"
href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx"
rel="noreferrer" target="_blank">https://lists.freedesktop.org/<wbr>mailman/listinfo/amd-gfx</a><br>
</blockquote>
<br>
<br>
</blockquote>
<br>
</blockquote>
<br>
</blockquote>
<br>
</blockquote>
<br>
Sincerely yours,<br>
Serguei Sagalovitch<br>
<br>
</blockquote>
<br>
</blockquote>
<br>
</blockquote>
<br>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>