[RFC] VRAM allocations with amdkfd+ttm for HSA processes
Oded Gabbay
oded.gabbay at amd.com
Tue Feb 24 06:56:12 PST 2015
In a nutshell:
This RFC proposes a control mechanism for VRAM (GPU local memory) memory
pinning that is initiated by HSA processes. This control mechanism is proposed
in order to prevent starvation of graphic applications due to high VRAM usage
by HSA processes.
TOC:
----------------------------------------------------------------
1. amdkfd's VRAM-related IOCTLs overview
2. TTM BOs migration overview
3. The why
4. Analyzing the use-cases
5. Proposed mechanism
6. Conclusion
----------------------------------------------------------------
1. amdkfd's VRAM-related IOCTLs overview:
amdkfd provides four IOCTLs for VRAM allocation & mapping (the names below are
presented just for convinience and can be changed until the final
implementation) :
- Allocate memory on VRAM -> AMDKFD_IOC_ALLOC_MEMORY_ON_GPU
- Free memory on VRAM -> AMDKFD_IOC_FREE_MEMORY_ON_GPU
- Map memory to GPU -> AMDKFD_IOC_MAP_MEMORY_TO_GPU
- Unmap memory to GPU -> AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU
An HSA process which needs to use VRAM, first calls the
AMDKFD_IOC_ALLOC_MEMORY_ON_GPU IOCTL. This IOCTL allocates a list of BOs
(Buffer Objects) that represent the amount of memory the HSA process wanted to
allocate.
e.g. If a single BO represent 1MB of VRAM, than amdkfd will allocate a list
of 100 BOs for an allocation request of 100MB of VRAM.
Before the memory can be used, the HSA process needs to call the
AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL. This IOCTL pins the relevant BOs (part or
all of the BOs that were created in the alloc IOCTL) and updates the PT/PD of
the GPUVM.
e.g. In regard to the previous example, if the HSA process wants to
dispatch a kernel that will use the last 10MB (of the 100MB it allocated), then
amdkfd will pin the last ten BOs in the list.
After the GPU kernel has finished using the memory, the HSA process needs to
call the AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU IOCTL. This IOCTL unpins the BOs and
updates the PT/PD of the GPUVM.
If the HSA process wants to dispatch another GPU kernel which will use the
same memory, than it can again call the AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL.
After the kernel finishes, the HSA process needs to call the
AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU IOCTL.
Finally, when the memory has no more use, the HSA process needs to call the
AMDKFD_IOC_FREE_MEMORY_ON_GPU IOCTL. This IOCTL destroys the BOs. This action
will also be performed on process tear-down.
The important point to remember is that once the HSA process calls the
AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL and amdkfd pins a list of BOs, than from
amdkfd's POV, those BOs are in use and must not be unpinned & moved, even if
they are currently idle (not used by a GPU kernel).
2. TTM subsystem overview:
For those unfamiliar with TTM, here is a short overview regarding migration of
BOs in TTM (Note, this is a simplistic overview):
Every BO has a reservation point (fence) attached to it. When the GPU
has finished working with that BO, it writes to its resv. point to signal the
work has been done and the BO is now idle. To enable this mechanism, the
graphic driver (radeon) dispatches a fence packet after each CS.
TTM maintains an LRU list of BOs. All the BOs are on that list, regardless if
they are in use or idle, pinned or unpinned. When TTM encounters a memory
pressure situation (e.g. it tries to pin a BO on VRAM but does not have enough
space), it walks over the LRU list and tries to evict BOs who are placed in
VRAM *and* are idle (meaning that they can be migrated to GART or system
memory) until it has enough space for the new request.
How TTM finds out if a BO is idle or not ? It checks its reservation point. If
it is signaled, then the BO is idle and can be migrated. If not, that BO is
still in use. The check is done in two stages. First, TTM does a simple check
that asks if a fence is signaled or not and this one is called in atomic
context, so the device driver can't block. The second check is the
wait_until_signaled and that function is can block, but there is a timeout
enforced by TTM.
What is a reservation point ? It is a generic Linux kernel mechanism to
allow sharing of fences between different device drivers. In our case, TTM
assigns a reservation point to every BO. When TTM checks the BO's reservation
point, it actually calls a callback function of that resv. point that tells it
if the resv. point's fence has been signaled.
The callback function is implemented by the entity using the BO. e.g. radeon
driver. When that callback is called, radeon needs to respond whether that BO
is idle or not. radeon has that information because it dispatches a fence
packet after each CS. That way, when the GPU kernel has finished, the GPU
handles the fence packet and writes to that fence. When radeon checks if a BO
is idle, it actually checks if its fence has been written to by the GPU.
Now, back to the migration process. If the BO is in use, TTM just moves to the
next BO on the LRU list. If the BO is idle, TTM migrates it to GART or system
memory to clear space for the new BO.
If there is not enough memory for the new request after passing over the entire
LRU list, TTM fails the new BO validation request.
3. The why:
HSA userspace applications sometimes need to use VRAM (GPU local memory) for
their operation. This is especially true when running on discrete GPUs,
which have a high bandwidth on-chip memory.
Because current AMD GPUs don't support page faults in VRAM, the HSA application
needs to pin its allocated memory in VRAM before dispatching the GPU kernel.
To allocate and pin the VRAM, HSA applications call amdkfd's IOCTLs that use
the TTM subsystem to allocate and pin BOs on VRAM.
Up until now, this is similar to a graphic application allocating memory on
VRAM through radeon. However, in radeon, the CS is done through the driver's
IOCTL. Therefore, the radeon driver can put a fence packet after every CS to
enable the TTM to know if a BO is currently in use by a CS.
In contrast, in HSA the CS is done through usermode queues. Because of that
reason amdkfd can *not* put a fence packet after each CS and of course we
can't trust the userspace to do it. Therefore, the Linux kernel does *not*
have the visibility whether a BO is currently in use or not.
This creates a problem when dealing with a memory pressure on a system that
runs both HSA applications and VRAM-consuming graphic applications. When memory
pressure occurs due to VRAM allocations requests from graphics applications,
the graphic CS can fail because HSA BOs are pinned in VRAM and can't be swapped
out to GART/System memory, even if the BOs are currently idle. In addition,
there can also be a situation where an HSA-only system has memory pressure due
to fragmentation in the VRAM.
4. Analyzing the use-cases
The following describes different scenarios of system behavior regarding
VRAM usage:
- Graphics needs a buffer in a specific range (several cases for that). This
means that *all* VRAM allocations must be evicted, no matter what (including
HSA).
- Graphics is to be prioritized over HSA (e.g. desktop computer case). All
graphics allocations take precedence over HSA. i.e. HSA must always yield
to TTM asking to evict BOs.
- Graphics is not important or not even existant (e.g. server). Then, HSA
eviction can fail. However, even in this case there might still be VRAM
fragmentation problem that will prevent HSA pinning.
5. Proposed mechanism
The proposed mechanism is composed of two parts:
- Policy set by the system admin
- Allowing the TTM to evict HSA BOs
5.a. Policy
Because we need to support different scenarios as described above, I suggest
to give the system admin the ability to select the VRAM usage policy. This
selection will dictate the behavior of amdkfd in this regard.
The policy could be one of the following options:
- VRAM usage: prefer graphics applications
- VRAM usage: Prefer HSA applications
When the first option is chosen (prefer graphics), upon *each* request to
evict BO from VRAM, amdkfd will respond as if the BO is idle.
When the second option is chosen (prefer HSA), upon *each* request to
evict BO from VRAM, amdkfd will respond as if the BO is in use.
Because this is a new policy that we might want to tweak in the future, I think
that it should currently be accessed only through debugfs. Once things are
mature enough and people will fill confident in it, this policy can be turned
to either a kernel parameter or sysfs attribute or both.
The default policy, IMO, should be "prefer graphics applications".
Note that even with the policy set to "prefer graphics", we must not evict
the BOs of the PT/PD
5.b. Eviction process
To allow TTM to evict a BO from VRAM, amdkfd effectively needs to preempt
a running usermode queue. On Carrizo we can preempt a queue whenever we want.
However, when using Kaveri we could run into problems when trying to preempt
a queue.
The problems can appear in the case where a shader takes a very long
time to complete (hundreds of ms), or in the rare case where someone wrote
an infinite shader (bug or otherwise). In those cases, Kaveri will fail to
preempt the queue, amdkfd will indicate a failure (dmesg) and the CP
will probably be stuck.
In those cases, the only option left for the driver is to perform an operation
called "kill all waves". This would terminate all the running waves and allow
the CP to preempt the queues.
In addition, the BOs that are created need to set the callback function of
the resv. point to amdkfd. However, for the BOs of the PT/PD, we need to set
a different callback function so we could prevent the eviction of those BOs.
The suggested algorithm for eviction is (in case policy is to prefer graphics):
- TTM calls amdkfd callback, asking if a BO is idle
- amdkfd preempts user space queue and removes it from run-list
- in case the preemption is stuck, amdkfd kills the wave.
- amdkfd tells TTM that the BO is idle
- TTM evict buffer to GART
- amdkfd updates GPUVM page table and does all necessary TLB flushing
- amdkfd restores user space queue
6. Conclusion
The current status of the code is that the four IOCTLs mentioned in
point 1 are partially implemented. The mechanism described here is not
implemented yet as I first wanted to get some response.
So although part of the code is ready, I would like to publish the patches
as a single patch-set.
I would like to thank RH's Jerome Glisse for helping me with this RFC.
Comments and flames are welcome.
Thanks,
Oded
More information about the dri-devel
mailing list