Documentation about AMD's HSA implementation?

Tue Feb 13 23:45:07 UTC 2018

>-----Original Message-----
>From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of
>Bridgman, John
>Sent: Tuesday, February 13, 2018 6:42 PM
>To: Ming Yang; Kuehling, Felix
>Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org
>Subject: RE: Documentation about AMD's HSA implementation?
>
>
>
>>-----Original Message-----
>>From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf
>>Of Ming Yang
>>Sent: Tuesday, February 13, 2018 4:59 PM
>>To: Kuehling, Felix
>>Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org
>>Subject: Re: Documentation about AMD's HSA implementation?
>>
>>That's very helpful, thanks!
>>
>>On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling
>><felix.kuehling at amd.com>
>>wrote:
>>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>>> Thanks for the suggestions!  But I might ask several specific
>>>> questions, as I can't find the answer in those documents, to give
>>>> myself a quick start if that's okay. Pointing me to the
>>>> files/functions would be good enough.  Any explanations are
>>>> appreciated.   My purpose is to hack it with different scheduling
>>>> policy with real-time and predictability consideration.
>>>>
>>>> - Where/How is the packet scheduler implemented?  How are packets
>>>> from multiple queues scheduled?  What about scheduling packets from
>>>> queues in different address spaces?
>>>
>>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>>> micro engine. Within each pipe the queues are time-multiplexed.
>>
>>Please correct me if I'm wrong.  CP is computing processor, like the
>>Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp) scheduler
>>multiplexing queues, in order to hide memory latency.
>
>CP is one step back from that - it's a "command processor" which reads
>command packets from driver (PM4 format) or application (AQL format) then
>manages the execution of each command on the GPU. A typical packet might
>be "dispatch", which initiates a compute operation on an N-dimensional array,
>or "draw" which initiates the rendering of an array of triangles. Those
>compute and render commands then generate a (typically) large number of
>wavefronts which are multiplexed on the shader core (by SQ IIRC). Most of
>our recent GPUs have one micro engine for graphics ("ME") and two for
>compute ("MEC"). Marketing refers to each pipe on an MEC block as an "ACE".

I missed one important point - "CP" refers to the combination of ME, MEC(s) and a few other related blocks.

>>
>>>
>>> If we need more than 24 queues, or if we have more than 8 processes,
>>> the hardware scheduler (HWS) adds another layer scheduling, basically
>>> round-robin between batches of 24 queues or 8 processes. Once you get
>>> into such an over-subscribed scenario your performance and GPU
>>> utilization can suffers quite badly.
>>
>>HWS is also implemented in the firmware that's closed-source?
>
>Correct - HWS is implemented in the MEC microcode. We also include a simple
>SW scheduler in the open source driver code, however.
>>
>>>
>>>>
>>>> - I noticed the new support of concurrency of multi-processes in the
>>>> archive of this mailing list.  Could you point me to the code that
>>>> implements this?
>>>
>>> That's basically just a switch that tells the firmware that it is
>>> allowed to schedule queues from different processes at the same time.
>>> The upper limit is the number of VMIDs that HWS can work with. It
>>> needs to assign a unique VMID to each process (each VMID representing
>>> a separate address space, page table, etc.). If there are more
>>> processes than VMIDs, the HWS has to time-multiplex.
>>
>>HWS dispatch packets in their order of becoming the head of the queue,
>>i.e., being pointed by the read_index? So in this way it's FIFO.  Or
>>round-robin between queues? You mentioned round-robin over batches in
>>the over- subscribed scenario.
>
>Round robin between sets of queues. The HWS logic generates sets as
>follows:
>
>1. "set resources" packet from driver tells scheduler how many VMIDs and
>HW queues it can use
>
>2. "runlist" packet from driver provides list of processes and list of queues for
>each process
>
>3. if multi-process switch not set, HWS schedules as many queues from the
>first process in the runlist as it has HW queues (see #1)
>
>4. at the end of process quantum (set by driver) either switch to next process
>(if all queues from first process have been scheduled) or schedule next set of
>queues from the same process
>
>5. when all queues from all processes have been scheduled and run for a
>process quantum, go back to the start of the runlist and repeat
>
>If the multi-process switch is set, and the number of queues for a process is
>less than the number of HW queues available, then in step #3 above HWS will
>start scheduling queues for additional processes, using a different VMID for
>each process, and continue until it either runs out of VMIDs or HW queues (or
>reaches the end of the runlist). All of the queues and processes would then
>run together for a process quantum before switching to the next queue set.
>
>>
>>This might not be a big deal for performance, but it matters for
>>predictability and real-time analysis.
>
>Agreed. In general you would not want to overcommit either VMIDs or HW
>queues in a real-time scenario, and for hard real time you would probably
>want to limit to a single queue per pipe since the MEC also multiplexes
>between HW queues on a pipe even without HWS.
>
>>
>>>
>>>>
>>>> - Also another related question -- where/how is the
>>>> preemption/context switch between packets/queues implemented?
>>>
>>> As long as you don't oversubscribe the available VMIDs, there is no
>>> real context switching. Everything can run concurrently. When you
>>> start oversubscribing HW queues or VMIDs, the HWS firmware will start
>>> multiplexing. This is all handled inside the firmware and is quite
>>> transparent even to KFD.
>>
>>I see.  So the preemption in at least AMD's implementation is not
>>switching out the executing kernel, but just letting new kernels to run
>>concurrently with the existing ones.  This means the performance is
>>degraded when too many workloads are submitted.  The running kernels
>>leave the GPU only when they are done.
>
>Both - you can have multiple kernels executing concurrently (each generating
>multiple threads in the shader core) AND switch out the currently executing
>set of kernels via preemption.
>
>>
>>Is there any reason for not preempting/switching out the existing
>>kernel, besides context switch overheads?  NVIDIA is not providing this
>option either.
>>Non-preemption hurts the real-time property in terms of priority
>>inversion.  I understand preemption should not be massively used but
>>having such an option may help a lot for real-time systems.
>
>If I understand you correctly, you can have it either way depending on the
>number of queues you enable simultaneously. At any given time you are
>typically only going to be running the kernels from one queue on each pipe, ie
>with 3 pipes and 24 queues you would typically only be running 3 kernels at a
>time. This seemed like a good compromise between scalability and efficiency.
>
>>
>>>
>>> KFD interacts with the HWS firmware through the HIQ (HSA interface
>>> queue). It supports packets for unmapping queues, we can send it a
>>> new runlist (basically a bunch of map-process and map-queue packets).
>>> The interesting files to look at are kfd_packet_manager.c,
>>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>>>
>>
>>So in this way, if we want to implement different scheduling policy, we
>>should control the submission of packets to the queues in runtime/KFD,
>>before getting to the firmware.  Because it's out of access once it's
>>submitted to the HWS in the firmware.
>
>Correct - there is a tradeoff between "easily scheduling lots of work" and fine-
>grained control. Limiting the number of queues you run simultaneously is
>another way of taking back control.
>
>You're probably past this, but you might find the original introduction to KFD
>useful in some way:
>
>https://lwn.net/Articles/605153/
>
>>
>>Best,
>>Mark
>>
>>> Regards,
>>>   Felix
>>>
>>>>
>>>> Thanks in advance!
>>>>
>>>> Best,
>>>> Mark
>>>>
>>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling <felix.kuehling at amd.com>
>>wrote:
>>>>> There is also this: https://gpuopen.com/professional-compute/,
>>>>> which give pointer to several libraries and tools that built on top of
>ROCm.
>>>>>
>>>>> Another thing to keep in mind is, that ROCm is diverging from the
>>>>> strict HSA standard in some important ways. For example the HSA
>>>>> standard includes HSAIL as an intermediate representation that gets
>>>>> finalized on the target system, whereas ROCm compiles directly to
>>>>> native
>>GPU ISA.
>>>>>
>>>>> Regards,
>>>>>   Felix
>>>>>
>>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander
>><Alexander.Deucher at amd.com> wrote:
>>>>>> The ROCm documentation is probably a good place to start:
>>>>>>
>>>>>> https://rocm.github.io/documentation.html
>>>>>>
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> ________________________________
>>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf
>of
>>>>>> Ming Yang <minos.future at gmail.com>
>>>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>>>>> To: amd-gfx at lists.freedesktop.org
>>>>>> Subject: Documentation about AMD's HSA implementation?
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm interested in HSA and excited when I found AMD's fully
>>>>>> open-stack ROCm supporting it. Before digging into the code, I
>>>>>> wonder if there's any documentation available about AMD's HSA
>>>>>> implementation, either book, whitepaper, paper, or documentation.
>>>>>>
>>>>>> I did find helpful materials about HSA, including HSA standards on
>>>>>> this page
>>>>>> (http://www.hsafoundation.com/standards/) and a nice book about
>>HSA
>>>>>> (Heterogeneous System Architecture A New Compute Platform
>>Infrastructure).
>>>>>> But regarding the documentation about AMD's implementation, I
>>>>>> haven't found anything yet.
>>>>>>
>>>>>> Please let me know if there are ones publicly accessible. If no,
>>>>>> any suggestions on learning the implementation of specific system
>>>>>> components, e.g., queue scheduling.
>>>>>>
>>>>>> Best,
>>>>>> Mark
>>>
>>_______________________________________________
>>amd-gfx mailing list
>>amd-gfx at lists.freedesktop.org
>>https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>_______________________________________________
>amd-gfx mailing list
>amd-gfx at lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/amd-gfx