[RFC] VRAM allocations with amdkfd+ttm for HSA processes

Oded Gabbay oded.gabbay at amd.com
Tue Mar 3 01:38:59 PST 2015



On 02/24/2015 04:56 PM, Oded Gabbay wrote:
> In a nutshell:
>
> This RFC proposes a control mechanism for VRAM (GPU local memory) memory
> pinning that is initiated by HSA processes. This control mechanism is proposed
> in order to prevent starvation of graphic applications due to high VRAM usage
> by HSA processes.
>
> TOC:
> ----------------------------------------------------------------
> 1.  amdkfd's VRAM-related IOCTLs overview
> 2.  TTM BOs migration overview
> 3.  The why
> 4.  Analyzing the use-cases
> 5.  Proposed mechanism
> 6.  Conclusion
> ----------------------------------------------------------------
>
> 1. amdkfd's VRAM-related IOCTLs overview:
>
> amdkfd provides four IOCTLs for VRAM allocation & mapping (the names below are
> presented just for convinience and can be changed until the final
> implementation) :
>
>    - Allocate memory on VRAM	-> AMDKFD_IOC_ALLOC_MEMORY_ON_GPU
>    - Free memory on VRAM		-> AMDKFD_IOC_FREE_MEMORY_ON_GPU
>    - Map memory to GPU		-> AMDKFD_IOC_MAP_MEMORY_TO_GPU
>    - Unmap memory to GPU		-> AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU
>
> An HSA process which needs to use VRAM, first calls the
> AMDKFD_IOC_ALLOC_MEMORY_ON_GPU IOCTL. This IOCTL allocates a list of BOs
> (Buffer Objects) that represent the amount of memory the HSA process wanted to
> allocate.
> e.g. If a single BO represent 1MB of VRAM, than amdkfd will allocate a list
> of 100 BOs for an allocation request of 100MB of VRAM.
>
> Before the memory can be used, the HSA process needs to call the
> AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL. This IOCTL pins the relevant BOs (part or
> all of the BOs that were created in the alloc IOCTL) and updates the PT/PD of
> the GPUVM.
> e.g. In regard to the previous example, if the HSA process wants to
> dispatch a kernel that will use the last 10MB (of the 100MB it allocated), then
> amdkfd will pin the last ten BOs in the list.
>
> After the GPU kernel has finished using the memory, the HSA process needs to
> call the AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU IOCTL. This IOCTL unpins the BOs and
> updates the PT/PD of the GPUVM.
>
> If the HSA process wants to dispatch another GPU kernel which will use the
> same memory, than it can again call the AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL.
> After the kernel finishes, the HSA process needs to call the
> AMDKFD_IOC_UNMAP_MEMORY_FROM_GPU IOCTL.
>
> Finally, when the memory has no more use, the HSA process needs to call the
> AMDKFD_IOC_FREE_MEMORY_ON_GPU IOCTL. This IOCTL destroys the BOs. This action
> will also be performed on process tear-down.
>
> The important point to remember is that once the HSA process calls the
> AMDKFD_IOC_MAP_MEMORY_TO_GPU IOCTL and amdkfd pins a list of BOs, than from
> amdkfd's POV, those BOs are in use and must not be unpinned & moved, even if
> they are currently idle (not used by a GPU kernel).
>
> 2. TTM subsystem overview:
>
> For those unfamiliar with TTM, here is a short overview regarding migration of
> BOs in TTM (Note, this is a simplistic overview):
>
> Every BO has a reservation point (fence) attached to it. When the GPU
> has finished working with that BO, it writes to its resv. point to signal the
> work has been done and the BO is now idle. To enable this mechanism, the
> graphic driver (radeon) dispatches a fence packet after each CS.
>
> TTM maintains an LRU list of BOs. All the BOs are on that list, regardless if
> they are in use or idle, pinned or unpinned. When TTM encounters a memory
> pressure situation (e.g. it tries to pin a BO on VRAM but does not have enough
> space), it walks over the LRU list and tries to evict BOs who are placed in
> VRAM *and* are idle (meaning that they can be migrated to GART or system
> memory) until it has enough space for the new request.
>
> How TTM finds out if a BO is idle or not ? It checks its reservation point. If
> it is signaled, then the BO is idle and can be migrated. If not, that BO is
> still in use. The check is done in two stages. First, TTM does a simple check
> that asks if a fence is signaled or not and this one is called in atomic
> context, so the device driver can't block. The second check is the
> wait_until_signaled and that function is can block, but there is a timeout
> enforced by TTM.
>
> What is a reservation point ? It is a generic Linux kernel mechanism to
> allow sharing of fences between different device drivers. In our case, TTM
> assigns a reservation point to every BO. When TTM checks the BO's reservation
> point, it actually calls a callback function of that resv. point that tells it
> if the resv. point's fence has been signaled.
>
> The callback function is implemented by the entity using the BO. e.g. radeon
> driver. When that callback is called, radeon needs to respond whether that BO
> is idle or not. radeon has that information because it dispatches a fence
> packet after each CS. That way, when the GPU kernel has finished, the GPU
> handles the fence packet and writes to that fence. When radeon checks if a BO
> is idle, it actually checks if its fence has been written to by the GPU.
>
> Now, back to the migration process. If the BO is in use, TTM just moves to the
> next BO on the LRU list. If the BO is idle, TTM migrates it to GART or system
> memory to clear space for the new BO.
>
> If there is not enough memory for the new request after passing over the entire
> LRU list, TTM fails the new BO validation request.
>
> 3. The why:
>
> HSA userspace applications sometimes need to use VRAM (GPU local memory) for
> their operation. This is especially true when running on discrete GPUs,
> which have a high bandwidth on-chip memory.
>
> Because current AMD GPUs don't support page faults in VRAM, the HSA application
> needs to pin its allocated memory in VRAM before dispatching the GPU kernel.
>
> To allocate and pin the VRAM, HSA applications call amdkfd's IOCTLs that use
> the TTM subsystem to allocate and pin BOs on VRAM.
>
> Up until now, this is similar to a graphic application allocating memory on
> VRAM through radeon. However, in radeon, the CS is done through the driver's
> IOCTL. Therefore, the radeon driver can put a fence packet after every CS to
> enable the TTM to know if a BO is currently in use by a CS.
>
> In contrast, in HSA the CS is done through usermode queues. Because of that
> reason amdkfd can *not* put a fence packet after each CS and of course we
> can't trust the userspace to do it. Therefore, the Linux kernel does *not*
> have the visibility whether a BO is currently in use or not.
>
> This creates a problem when dealing with a memory pressure on a system that
> runs both HSA applications and VRAM-consuming graphic applications. When memory
> pressure occurs due to VRAM allocations requests from graphics applications,
> the graphic CS can fail because HSA BOs are pinned in VRAM and can't be swapped
> out to GART/System memory, even if the BOs are currently idle. In addition,
> there can also be a situation where an HSA-only system has memory pressure due
> to fragmentation in the VRAM.
>
> 4.  Analyzing the use-cases
>
> The following describes different scenarios of system behavior regarding
> VRAM usage:
>
> - Graphics needs a buffer in a specific range (several cases for that). This
>    means that *all* VRAM allocations must be evicted, no matter what (including
>    HSA).
>
> - Graphics is to be prioritized over HSA (e.g. desktop computer case). All
>    graphics allocations take precedence over HSA. i.e. HSA must always yield
>    to TTM asking to evict BOs.
>
> - Graphics is not important or not even existant (e.g. server). Then, HSA
>    eviction can fail. However, even in this case there might still be VRAM
>    fragmentation problem that will prevent HSA pinning.
>
> 5.  Proposed mechanism
>
> The proposed mechanism is composed of two parts:
>
> - Policy set by the system admin
> - Allowing the TTM to evict HSA BOs
>
> 5.a. Policy
>
> Because we need to support different scenarios as described above, I suggest
> to give the system admin the ability to select the VRAM usage policy. This
> selection will dictate the behavior of amdkfd in this regard.
>
> The policy could be one of the following options:
> - VRAM usage: prefer graphics applications
> - VRAM usage: Prefer HSA applications
>
> When the first option is chosen (prefer graphics), upon *each* request to
> evict BO from VRAM, amdkfd will respond as if the BO is idle.
>
> When the second option is chosen (prefer HSA), upon *each* request to
> evict BO from VRAM, amdkfd will respond as if the BO is in use.
>
> Because this is a new policy that we might want to tweak in the future, I think
> that it should currently be accessed only through debugfs. Once things are
> mature enough and people will fill confident in it, this policy can be turned
> to either a kernel parameter or sysfs attribute or both.
>
> The default policy, IMO, should be "prefer graphics applications".
>
> Note that even with the policy set to "prefer graphics", we must not evict
> the BOs of the PT/PD
>
> 5.b. Eviction process
>
> To allow TTM to evict a BO from VRAM, amdkfd effectively needs to preempt
> a running usermode queue. On Carrizo we can preempt a queue whenever we want.
> However, when using Kaveri we could run into problems when trying to preempt
> a queue.
> The problems can appear in the case where a shader takes a very long
> time to complete (hundreds of ms), or in the rare case where someone wrote
> an infinite shader (bug or otherwise). In those cases, Kaveri will fail to
> preempt the queue, amdkfd will indicate a failure (dmesg) and the CP
> will probably be stuck.
>
> In those cases, the only option left for the driver is to perform an operation
> called "kill all waves". This would terminate all the running waves and allow
> the CP to preempt the queues.
>
> In addition, the BOs that are created need to set the callback function of
> the resv. point to amdkfd. However, for the BOs of the PT/PD, we need to set
> a different callback function so we could prevent the eviction of those BOs.
>
> The suggested algorithm for eviction is (in case policy is to prefer graphics):
>
>    - TTM calls amdkfd callback, asking if a BO is idle
>    - amdkfd preempts user space queue and removes it from run-list
> 	- in case the preemption is stuck, amdkfd kills the wave.
>    - amdkfd tells TTM that the BO is idle
>    - TTM evict buffer to GART
>    - amdkfd updates GPUVM page table and does all necessary TLB flushing
>    - amdkfd restores user space queue
>
> 6.  Conclusion
>
> The current status of the code is that the four IOCTLs mentioned in
> point 1 are partially implemented. The mechanism described here is not
> implemented yet as I first wanted to get some response.
>
> So although part of the code is ready, I would like to publish the patches
> as a single patch-set.
>
> I would like to thank RH's Jerome Glisse for helping me with this RFC.
>
> Comments and flames are welcome.
>
> Thanks,
> 	Oded
> _______________________________________________
> dri-devel mailing list
> dri-devel at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>

Hi,
I had a chat with Daniel Vetter on IRC about my RFC. Daniel suggested that I 
would post the transcript here so it would help start a discussion. To 
summarize, Daniel talks about how he thought of controlling both compute and gfx 
works, in regard to memory pinning, which is different than what I suggested.
So here is the (somewhat edited) transcript :

danvet: On a high-level the algo you describe is pretty much what we have in 
mind for svm without hw pagefault support on i915 i.e. userspace creates a list 
of bo it needs, assigned to a hw context. As long as we have that hw context 
running we must keep all these bo on that list mapped. For unmapping we need to 
stop the hw context before we can touch any of the bo. The eviction policy seems 
a bit heavy-handed though

gabbayo: So what I want is some kind of policy mechanism to let the user decide 
if he wants to prefer gfx or compute.

danvet: either mode will be too severe I think

gabbayo: yeah, maybe a slider :) ? like some kind of threshold ?

danvet: what I have in mind for i915 is that when there's no memory pressure we 
don't do anything at all, keep the overhead low. If there is pressure we need to 
auto-balance between svm jobs and normal gpu workloads. I think that should be 
done by regularly updating the position of svm buffers on the lru, so that the 
aging matches between hsa and gpu workloads, or svm and gpu on i915

danvet: And then when ttm/gem decides to evict something, and it's a svm/hsa 
buffer then we stop the context so that we can evict that single buffer. There's 
lots of interactions though, you also need scheduling fairness between gpu and 
hsa. On i915 that part is easy since it will be the same hw engine, so we can 
just do a normal scheduler (fifo to begin with).

danvet: Generally for memory management you want a unified lru. Otherwise 
fairness is impossible. Giving users a tunable means it will always suck since 
no one will touch it

gabbayo: Although, if we are talking about dedicated machines for compute 
(HSA/svm), I think there is a chance users will indeed tune it. For regular 
users, I agree that they will probably keep the default.

danvet: For a dedicated hsa machine the default should allow everything for hsa, 
and the same default should allow everything for gpu workloads. As long as your 
solution is reasonable fair that should be possible by just giving everyone 
proportional to their needs


More information about the dri-devel mailing list