Add support for high priority scheduling in amdgpu
Christian König
deathsimple at vodafone.de
Thu Mar 2 11:00:33 UTC 2017
Am 01.03.2017 um 18:24 schrieb Andres Rodriguez:
>
>
> On 2017-03-01 12:13 PM, Andres Rodriguez wrote:
>>
>>
>> On 3/1/2017 6:42 AM, Christian König wrote:
>>> Patches #1-#14 are Acked-by: Christian König
>>> <christian.koenig at amd.com>.
>>>
>>> Patch #15:
>>>
>>> Not sure if that is a good idea or not, need to take a closer look
>>> after digging through the rest.
>>>
>>> In general the HW IP is just for the IOCTL API and not for internal
>>> use inside the driver.
>> I'll drop this patch and use ring->funcs->type instead.
>>>
>>> Patch #16:
>>>
>>> Really nice :) I don't have time to look into it in detail, but you
>>> have one misconception I like to point out:
>>>> The queue manager maintains a per-file descriptor map of user ring ids
>>>> to amdgpu_ring pointers. Once a map is created it is permanent
>>>> (this is
>>>> required to maintain FIFO execution guarantees for a ring).
>>> Actually we don't have a FIFO execution guarantee per ring. We only
>>> have that per context.
>>
>> Agreed. I'm using pretty imprecise terminology here which can be
>> confusing. I wanted to be more precise than "context", because two
>> amdgpu_cs_request submissions to the same context but with a
>> different ring field can execute out of order.
>>
>> I think s/ring/context's ring/ should be enough to clarify here if
>> you think so as well.
Yeah, just fix the description a bit and we are good to go.
>>
>>>
>>> E.g. commands from different context can execute at the same time
>>> and out of order.
>>>
>>> Making this per file is ok for now, but you should keep in mind that
>>> we might want to change that sooner or later.
>>>
>>> Patch #17 & #18 need to take a closer look when I have more time,
>>> but the comments from others sounded valid to me as well.
>>>
>>> Patch #19: Raising and lowering the priority of a ring during
>>> command submission doesn't sound like a good idea to me.
>> I'm not really sure what would be a better time than at command
>> submission.
>>
>> If it was just SPI priorities we could have static partitioning of
>> rings, some high priority and some regular, etc. But that approach
>> reduces the number of rings
> Sorry, I finished typing something else and forgot this section was
> incomplete. Full reply:
>
> I'm not really sure what would be a better time than at command
> submission.
>
> If it was just SPI priorities we could have static partitioning of
> rings, some high priority and some regular, etc. But that approach
> reduces the number of rings available. It would also require a
> callback at command submission time for CU reservation.
Ok, as Alex wrote as well I'm not 100% sure if that really works on all
hardware. But we could give it a try, cause I don't see much of a better
alternative either.
>>>
>>> The way you currently have it implemented would also raise the
>>> priority of already running jobs on the same ring. Keep in mind that
>>> everything is pipelined here.
>> That is actually intentional. If there is work already on the ring
>> with lower priority we don't want the high priority work to have to
>> wait for it to finish executing at regular priority. Therefore the
>> work that has already been commited to the ring inherits the higher
>> priority level.
>>
>> I agree this isn't ideal, which is why the LRU ring mapping policy is
>> there to make sure this doesn't happen often.
>>>
>>> Additional to that you can't have a fence callback in the job
>>> structure, cause the job structure is freed by the same fence as
>>> well. So it can happen that you access freed up memory (but only for
>>> a very short period of time).
>> Any strong preference for either 1) refcounting the job structure, or
>> 2) allocating a new piece of memory to store the callback parameters?
How about option #3, just add that to the job lifetime.
See drivers/gpu/drm/amd/amdgpu/amdgpu_job.c. amdgpu_job_run() is called
when the job is ready to run.
amdgpu_job_free_cb() is from a work item after a scheduler job has
finished executing. If that is to late we could also add another
callback to the scheduler for this.
amdgpu_job_free() is called when we directly submitted the job without
going through the scheduler. Don't touch it that is just for GPU reset
handling.
Regards,
Christian.
>>
>>> Patches #20-#22 are Acked-by: Christian König
>>> <christian.koenig at amd.com>.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 28.02.2017 um 23:14 schrieb Andres Rodriguez:
>>>> This patch series introduces a mechanism that allows users with
>>>> sufficient
>>>> privileges to categorize their work as "high priority". A userspace
>>>> app can
>>>> create a high priority amdgpu context, where any work submitted to
>>>> this context
>>>> will receive preferential treatment over any other work.
>>>>
>>>> High priority contexts will be scheduled ahead of other contexts by
>>>> the sw gpu
>>>> scheduler. This functionality is generic for all HW blocks.
>>>>
>>>> Optionally, a ring can implement a set_priority() function that allows
>>>> programming HW specific features to elevate a ring's priority.
>>>>
>>>> This patch series implements set_priority() for gfx8 compute rings.
>>>> It takes
>>>> advantage of SPI scheduling and CU reservation to provide improved
>>>> frame
>>>> latencies for high priority contexts.
>>>>
>>>> For compute + compute scenarios we get near perfect scheduling
>>>> latency. E.g.
>>>> one high priority ComputeParticles + one low priority
>>>> ComputeParticles:
>>>> - High priority ComputeParticles: 2.0-2.6 ms/frame
>>>> - Regular ComputeParticles: 35.2-68.5 ms/frame
>>>>
>>>> For compute + gfx scenarios the high priority compute application does
>>>> experience some latency variance. However, the variance has smaller
>>>> bounds and
>>>> a smalled deviation then without high priority scheduling.
>>>>
>>>> Following is a graph of the frame time experienced by a high
>>>> priority compute
>>>> app in 4 different scenarios to exemplify the compute + gfx latency
>>>> variance:
>>>> - ComputeParticles: this scenario invloves running the compute
>>>> particles
>>>> sample on its own.
>>>> - +SSAO: Previous scenario with the addition of running the
>>>> ssao sample
>>>> application that clogs the GFX ring with constant work.
>>>> - +SPI Priority: Previous scenario with the addition of SPI
>>>> priority
>>>> programming for compute rings.
>>>> - +CU Reserve: Previous scenario with the addition of dynamic CU
>>>> reservation for compute rings.
>>>>
>>>> Graph link:
>>>> https://plot.ly/~lostgoat/9/
>>>>
>>>> As seen above, high priority contexts for compute allow us to
>>>> schedule work
>>>> with enhanced confidence of completion latency under high GPU
>>>> loads. This
>>>> property will be important for VR reprojection workloads.
>>>>
>>>> Note: The first part of this series is a resend of "Change
>>>> queue/pipe split
>>>> between amdkfd and amdgpu" with the following changes:
>>>> - Fixed kfdtest on Kaveri due to shift overflow. Refer to:
>>>> "drm/amdkfdallow
>>>> split HQD on per-queue granularity v3"
>>>> - Used Felix's suggestions for a simplified HQD programming
>>>> sequence
>>>> - Added a workaround for a Tonga HW bug during HQD programming
>>>>
>>>> This series is also available at:
>>>> https://github.com/lostgoat/linux/tree/wip-high-priority
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx at lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>>>
>>
>>
>
More information about the amd-gfx
mailing list