Documentation about AMD's HSA implementation?

Tue Feb 13 23:42:24 UTC 2018

>-----Original Message-----
>From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of
>Ming Yang
>Sent: Tuesday, February 13, 2018 4:59 PM
>To: Kuehling, Felix
>Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org
>Subject: Re: Documentation about AMD's HSA implementation?
>
>That's very helpful, thanks!
>
>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling <felix.kuehling at amd.com>
>wrote:
>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>> Thanks for the suggestions!  But I might ask several specific
>>> questions, as I can't find the answer in those documents, to give
>>> myself a quick start if that's okay. Pointing me to the
>>> files/functions would be good enough.  Any explanations are
>>> appreciated.   My purpose is to hack it with different scheduling
>>> policy with real-time and predictability consideration.
>>>
>>> - Where/How is the packet scheduler implemented?  How are packets
>>> from multiple queues scheduled?  What about scheduling packets from
>>> queues in different address spaces?
>>
>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>> micro engine. Within each pipe the queues are time-multiplexed.
>
>Please correct me if I'm wrong.  CP is computing processor, like the Execution
>Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler multiplexing
>queues, in order to hide memory latency.

CP is one step back from that - it's a "command processor" which reads command packets from driver (PM4 format) or application (AQL format) then manages the execution of each command on the GPU. A typical packet might be "dispatch", which initiates a compute operation on an N-dimensional array, or "draw" which initiates the rendering of an array of triangles. Those compute and render commands then generate a (typically) large number of wavefronts which are multiplexed on the shader core (by SQ IIRC). Most of our recent GPUs have one micro engine for graphics ("ME") and two for compute ("MEC"). Marketing refers to each pipe on an MEC block as an "ACE".
>
>>
>> If we need more than 24 queues, or if we have more than 8 processes,
>> the hardware scheduler (HWS) adds another layer scheduling, basically
>> round-robin between batches of 24 queues or 8 processes. Once you get
>> into such an over-subscribed scenario your performance and GPU
>> utilization can suffers quite badly.
>
>HWS is also implemented in the firmware that's closed-source?

Correct - HWS is implemented in the MEC microcode. We also include a simple SW scheduler in the open source driver code, however. 
>
>>
>>>
>>> - I noticed the new support of concurrency of multi-processes in the
>>> archive of this mailing list.  Could you point me to the code that
>>> implements this?
>>
>> That's basically just a switch that tells the firmware that it is
>> allowed to schedule queues from different processes at the same time.
>> The upper limit is the number of VMIDs that HWS can work with. It
>> needs to assign a unique VMID to each process (each VMID representing
>> a separate address space, page table, etc.). If there are more
>> processes than VMIDs, the HWS has to time-multiplex.
>
>HWS dispatch packets in their order of becoming the head of the queue, i.e.,
>being pointed by the read_index? So in this way it's FIFO.  Or round-robin
>between queues? You mentioned round-robin over batches in the over-
>subscribed scenario.

Round robin between sets of queues. The HWS logic generates sets as follows:

1. "set resources" packet from driver tells scheduler how many VMIDs and HW queues it can use

2. "runlist" packet from driver provides list of processes and list of queues for each process

3. if multi-process switch not set, HWS schedules as many queues from the first process in the runlist as it has HW queues (see #1)

4. at the end of process quantum (set by driver) either switch to next process (if all queues from first process have been scheduled) or schedule next set of queues from the same process

5. when all queues from all processes have been scheduled and run for a process quantum, go back to the start of the runlist and repeat

If the multi-process switch is set, and the number of queues for a process is less than the number of HW queues available, then in step #3 above HWS will start scheduling queues for additional processes, using a different VMID for each process, and continue until it either runs out of VMIDs or HW queues (or reaches the end of the runlist). All of the queues and processes would then run together for a process quantum before switching to the next queue set.

>
>This might not be a big deal for performance, but it matters for predictability
>and real-time analysis.

Agreed. In general you would not want to overcommit either VMIDs or HW queues in a real-time scenario, and for hard real time you would probably want to limit to a single queue per pipe since the MEC also multiplexes between HW queues on a pipe even without HWS. 

>
>>
>>>
>>> - Also another related question -- where/how is the
>>> preemption/context switch between packets/queues implemented?
>>
>> As long as you don't oversubscribe the available VMIDs, there is no
>> real context switching. Everything can run concurrently. When you
>> start oversubscribing HW queues or VMIDs, the HWS firmware will start
>> multiplexing. This is all handled inside the firmware and is quite
>> transparent even to KFD.
>
>I see.  So the preemption in at least AMD's implementation is not switching
>out the executing kernel, but just letting new kernels to run concurrently with
>the existing ones.  This means the performance is degraded when too many
>workloads are submitted.  The running kernels leave the GPU only when they
>are done.

Both - you can have multiple kernels executing concurrently (each generating multiple threads in the shader core) AND switch out the currently executing set of kernels via preemption. 

>
>Is there any reason for not preempting/switching out the existing kernel,
>besides context switch overheads?  NVIDIA is not providing this option either.
>Non-preemption hurts the real-time property in terms of priority inversion.  I
>understand preemption should not be massively used but having such an
>option may help a lot for real-time systems.

If I understand you correctly, you can have it either way depending on the number of queues you enable simultaneously. At any given time you are typically only going to be running the kernels from one queue on each pipe, ie with 3 pipes and 24 queues you would typically only be running 3 kernels at a time. This seemed like a good compromise between scalability and efficiency. 

>
>>
>> KFD interacts with the HWS firmware through the HIQ (HSA interface
>> queue). It supports packets for unmapping queues, we can send it a new
>> runlist (basically a bunch of map-process and map-queue packets). The
>> interesting files to look at are kfd_packet_manager.c,
>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>>
>
>So in this way, if we want to implement different scheduling policy, we should
>control the submission of packets to the queues in runtime/KFD, before
>getting to the firmware.  Because it's out of access once it's submitted to the
>HWS in the firmware.

Correct - there is a tradeoff between "easily scheduling lots of work" and fine-grained control. Limiting the number of queues you run simultaneously is another way of taking back control. 

You're probably past this, but you might find the original introduction to KFD useful in some way:

https://lwn.net/Articles/605153/

>
>Best,
>Mark
>
>> Regards,
>>   Felix
>>
>>>
>>> Thanks in advance!
>>>
>>> Best,
>>> Mark
>>>
>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling <felix.kuehling at amd.com>
>wrote:
>>>> There is also this: https://gpuopen.com/professional-compute/, which
>>>> give pointer to several libraries and tools that built on top of ROCm.
>>>>
>>>> Another thing to keep in mind is, that ROCm is diverging from the
>>>> strict HSA standard in some important ways. For example the HSA
>>>> standard includes HSAIL as an intermediate representation that gets
>>>> finalized on the target system, whereas ROCm compiles directly to native
>GPU ISA.
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander
><Alexander.Deucher at amd.com> wrote:
>>>>> The ROCm documentation is probably a good place to start:
>>>>>
>>>>> https://rocm.github.io/documentation.html
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>> ________________________________
>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of
>>>>> Ming Yang <minos.future at gmail.com>
>>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>>>> To: amd-gfx at lists.freedesktop.org
>>>>> Subject: Documentation about AMD's HSA implementation?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm interested in HSA and excited when I found AMD's fully
>>>>> open-stack ROCm supporting it. Before digging into the code, I
>>>>> wonder if there's any documentation available about AMD's HSA
>>>>> implementation, either book, whitepaper, paper, or documentation.
>>>>>
>>>>> I did find helpful materials about HSA, including HSA standards on
>>>>> this page
>>>>> (http://www.hsafoundation.com/standards/) and a nice book about
>HSA
>>>>> (Heterogeneous System Architecture A New Compute Platform
>Infrastructure).
>>>>> But regarding the documentation about AMD's implementation, I
>>>>> haven't found anything yet.
>>>>>
>>>>> Please let me know if there are ones publicly accessible. If no,
>>>>> any suggestions on learning the implementation of specific system
>>>>> components, e.g., queue scheduling.
>>>>>
>>>>> Best,
>>>>> Mark
>>
>_______________________________________________
>amd-gfx mailing list
>amd-gfx at lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx