[RFC] VRAM allocations with amdkfd+ttm for HSA processes

Tue Feb 24 06:56:12 PST 2015

In a nutshell:

This RFC proposes a control mechanism for VRAM (GPU local memory) memory 
pinning that is initiated by HSA processes. This control mechanism is proposed 
in order to prevent starvation of graphic applications due to high VRAM usage 
by HSA processes.

TOC:
----------------------------------------------------------------
1.  amdkfd's VRAM-related IOCTLs overview
2.  TTM BOs migration overview
3.  The why
4.  Analyzing the use-cases
5.  Proposed mechanism
6.  Conclusion
----------------------------------------------------------------

1. amdkfd's VRAM-related IOCTLs overview:

amdkfd provides four IOCTLs for VRAM allocation & mapping (the names below are 
presented just for convinience and can be changed until the final 
implementation) :

  - Allocate memory on VRAM	-> AMDKFD_IOC_ALLOC_MEMORY_ON_GPU
  - Free memory on VRAM		-> AMDKFD_IOC_FREE_MEMORY_ON_GPU
  - Map memory to GPU		-> AMDKFD_IOC_MAP_MEMORY_TO_GPU
  - Unmap memory to GPU		-> AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU

An HSA process which needs to use VRAM, first calls the 
AMDKFD_IOC_ALLOC_MEMORY_ON_GPU IOCTL. This IOCTL allocates a list of BOs 
(Buffer Objects) that represent the amount of memory the HSA process wanted to 
allocate. 
e.g. If a single BO represent 1MB of VRAM, than amdkfd will allocate a list 
of 100 BOs for an allocation request of 100MB of VRAM.

Before the memory can be used, the HSA process needs to call the 
AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL. This IOCTL pins the relevant BOs (part or 
all of the BOs that were created in the alloc IOCTL) and updates the PT/PD of 
the GPUVM. 
e.g. In regard to the previous example, if the HSA process wants to 
dispatch a kernel that will use the last 10MB (of the 100MB it allocated), then 
amdkfd will pin the last ten BOs in the list.

After the GPU kernel has finished using the memory, the HSA process needs to 
call the AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU IOCTL. This IOCTL unpins the BOs and 
updates the PT/PD of the GPUVM.

If the HSA process wants to dispatch another GPU kernel which will use the 
same memory, than it can again call the AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL. 
After the kernel finishes, the HSA process needs to call the 
AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU IOCTL.

Finally, when the memory has no more use, the HSA process needs to call the 
AMDKFD_IOC_FREE_MEMORY_ON_GPU IOCTL. This IOCTL destroys the BOs. This action 
will also be performed on process tear-down.

The important point to remember is that once the HSA process calls the 
AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL and amdkfd pins a list of BOs, than from 
amdkfd's POV, those BOs are in use and must not be unpinned & moved, even if 
they are currently idle (not used by a GPU kernel).

2. TTM subsystem overview:

For those unfamiliar with TTM, here is a short overview regarding migration of 
BOs in TTM (Note, this is a simplistic overview):

Every BO has a reservation point (fence) attached to it. When the GPU 
has finished working with that BO, it writes to its resv. point to signal the 
work has been done and the BO is now idle. To enable this mechanism, the 
graphic driver (radeon) dispatches a fence packet after each CS.

TTM maintains an LRU list of BOs. All the BOs are on that list, regardless if 
they are in use or idle, pinned or unpinned. When TTM encounters a memory 
pressure situation (e.g. it tries to pin a BO on VRAM but does not have enough 
space), it walks over the LRU list and tries to evict BOs who are placed in 
VRAM *and* are idle (meaning that they can be migrated to GART or system 
memory) until it has enough space for the new request.

How TTM finds out if a BO is idle or not ? It checks its reservation point. If 
it is signaled, then the BO is idle and can be migrated. If not, that BO is 
still in use. The check is done in two stages. First, TTM does a simple check 
that asks if a fence is signaled or not and this one is called in atomic 
context, so the device driver can't block. The second check is the 
wait_until_signaled and that function is can block, but there is a timeout 
enforced by TTM.

What is a reservation point ? It is a generic Linux kernel mechanism to 
allow sharing of fences between different device drivers. In our case, TTM 
assigns a reservation point to every BO. When TTM checks the BO's reservation 
point, it actually calls a callback function of that resv. point that tells it 
if the resv. point's fence has been signaled.

The callback function is implemented by the entity using the BO. e.g. radeon 
driver. When that callback is called, radeon needs to respond whether that BO 
is idle or not. radeon has that information because it dispatches a fence 
packet after each CS. That way, when the GPU kernel has finished, the GPU 
handles the fence packet and writes to that fence. When radeon checks if a BO 
is idle, it actually checks if its fence has been written to by the GPU.

Now, back to the migration process. If the BO is in use, TTM just moves to the 
next BO on the LRU list. If the BO is idle, TTM migrates it to GART or system 
memory to clear space for the new BO. 

If there is not enough memory for the new request after passing over the entire 
LRU list, TTM fails the new BO validation request.

3. The why:

HSA userspace applications sometimes need to use VRAM (GPU local memory) for 
their operation. This is especially true when running on discrete GPUs, 
which have a high bandwidth on-chip memory. 

Because current AMD GPUs don't support page faults in VRAM, the HSA application 
needs to pin its allocated memory in VRAM before dispatching the GPU kernel. 

To allocate and pin the VRAM, HSA applications call amdkfd's IOCTLs that use 
the TTM subsystem to allocate and pin BOs on VRAM.

Up until now, this is similar to a graphic application allocating memory on 
VRAM through radeon. However, in radeon, the CS is done through the driver's 
IOCTL. Therefore, the radeon driver can put a fence packet after every CS to 
enable the TTM to know if a BO is currently in use by a CS. 

In contrast, in HSA the CS is done through usermode queues. Because of that 
reason amdkfd can *not* put a fence packet after each CS and of course we 
can't trust the userspace to do it. Therefore, the Linux kernel does *not* 
have the visibility whether a BO is currently in use or not. 

This creates a problem when dealing with a memory pressure on a system that 
runs both HSA applications and VRAM-consuming graphic applications. When memory 
pressure occurs due to VRAM allocations requests from graphics applications, 
the graphic CS can fail because HSA BOs are pinned in VRAM and can't be swapped 
out to GART/System memory, even if the BOs are currently idle. In addition, 
there can also be a situation where an HSA-only system has memory pressure due 
to fragmentation in the VRAM.

4.  Analyzing the use-cases

The following describes different scenarios of system behavior regarding 
VRAM usage:

- Graphics needs a buffer in a specific range (several cases for that). This 
  means that *all* VRAM allocations must be evicted, no matter what (including 
  HSA).

- Graphics is to be prioritized over HSA (e.g. desktop computer case). All 
  graphics allocations take precedence over HSA. i.e. HSA must always yield 
  to TTM asking to evict BOs.

- Graphics is not important or not even existant (e.g. server). Then, HSA 
  eviction can fail. However, even in this case there might still be VRAM 
  fragmentation problem that will prevent HSA pinning.

5.  Proposed mechanism

The proposed mechanism is composed of two parts:

- Policy set by the system admin
- Allowing the TTM to evict HSA BOs

5.a. Policy

Because we need to support different scenarios as described above, I suggest 
to give the system admin the ability to select the VRAM usage policy. This 
selection will dictate the behavior of amdkfd in this regard.

The policy could be one of the following options:
- VRAM usage: prefer graphics applications
- VRAM usage: Prefer HSA applications

When the first option is chosen (prefer graphics), upon *each* request to 
evict BO from VRAM, amdkfd will respond as if the BO is idle.

When the second option is chosen (prefer HSA), upon *each* request to 
evict BO from VRAM, amdkfd will respond as if the BO is in use.

Because this is a new policy that we might want to tweak in the future, I think 
that it should currently be accessed only through debugfs. Once things are 
mature enough and people will fill confident in it, this policy can be turned 
to either a kernel parameter or sysfs attribute or both.

The default policy, IMO, should be "prefer graphics applications".

Note that even with the policy set to "prefer graphics", we must not evict 
the BOs of the PT/PD

5.b. Eviction process

To allow TTM to evict a BO from VRAM, amdkfd effectively needs to preempt 
a running usermode queue. On Carrizo we can preempt a queue whenever we want.
However, when using Kaveri we could run into problems when trying to preempt 
a queue. 
The problems can appear in the case where a shader takes a very long 
time to complete (hundreds of ms), or in the rare case where someone wrote 
an infinite shader (bug or otherwise). In those cases, Kaveri will fail to 
preempt the queue, amdkfd will indicate a failure (dmesg) and the CP 
will probably be stuck.

In those cases, the only option left for the driver is to perform an operation
called "kill all waves". This would terminate all the running waves and allow 
the CP to preempt the queues. 

In addition, the BOs that are created need to set the callback function of 
the resv. point to amdkfd. However, for the BOs of the PT/PD, we need to set 
a different callback function so we could prevent the eviction of those BOs.

The suggested algorithm for eviction is (in case policy is to prefer graphics):

  - TTM calls amdkfd callback, asking if a BO is idle
  - amdkfd preempts user space queue and removes it from run-list
	- in case the preemption is stuck, amdkfd kills the wave.
  - amdkfd tells TTM that the BO is idle
  - TTM evict buffer to GART
  - amdkfd updates GPUVM page table and does all necessary TLB flushing
  - amdkfd restores user space queue

6.  Conclusion

The current status of the code is that the four IOCTLs mentioned in 
point 1 are partially implemented. The mechanism described here is not 
implemented yet as I first wanted to get some response.

So although part of the code is ready, I would like to publish the patches 
as a single patch-set.

I would like to thank RH's Jerome Glisse for helping me with this RFC.

Comments and flames are welcome.

Thanks,
	Oded