Add support for high priority scheduling in amdgpu

Tue Feb 28 22:14:27 UTC 2017

This patch series introduces a mechanism that allows users with sufficient
privileges to categorize their work as "high priority". A userspace app can
create a high priority amdgpu context, where any work submitted to this context
will receive preferential treatment over any other work.

High priority contexts will be scheduled ahead of other contexts by the sw gpu
scheduler. This functionality is generic for all HW blocks.

Optionally, a ring can implement a set_priority() function that allows
programming HW specific features to elevate a ring's priority.

This patch series implements set_priority() for gfx8 compute rings. It takes
advantage of SPI scheduling and CU reservation to provide improved frame
latencies for high priority contexts.

For compute + compute scenarios we get near perfect scheduling latency. E.g.
one high priority ComputeParticles + one low priority ComputeParticles:
    - High priority ComputeParticles: 2.0-2.6 ms/frame
    - Regular ComputeParticles: 35.2-68.5 ms/frame

For compute + gfx scenarios the high priority compute application does
experience some latency variance. However, the variance has smaller bounds and
a smalled deviation then without high priority scheduling.

Following is a graph of the frame time experienced by a high priority compute
app in 4 different scenarios to exemplify the compute + gfx latency variance:
    - ComputeParticles: this scenario invloves running the compute particles
      sample on its own.
    - +SSAO: Previous scenario with the addition of running the ssao sample
      application that clogs the GFX ring with constant work.
    - +SPI Priority: Previous scenario with the addition of SPI priority
      programming for compute rings.
    - +CU Reserve: Previous scenario with the addition of dynamic CU
      reservation for compute rings.

Graph link:
https://plot.ly/~lostgoat/9/

As seen above, high priority contexts for compute allow us to schedule work
with enhanced confidence of completion latency under high GPU loads. This
property will be important for VR reprojection workloads.

Note: The first part of this series is a resend of "Change queue/pipe split
between amdkfd and amdgpu" with the following changes:
    - Fixed kfdtest on Kaveri due to shift overflow. Refer to: "drm/amdkfdallow
      split HQD on per-queue granularity v3"
    - Used Felix's suggestions for a simplified HQD programming sequence
    - Added a workaround for a Tonga HW bug during HQD programming

This series is also available at:
https://github.com/lostgoat/linux/tree/wip-high-priority