Documentation about AMD's HSA implementation?

Tue Feb 13 22:31:55 UTC 2018

On 2018-02-13 04:58 PM, Ming Yang wrote:
> That's very helpful, thanks!
>
> On Tue, Feb 13, 2018 at 4:17 PM, Felix Kuehling <felix.kuehling at amd.com> wrote:
>> On 2018-02-13 04:06 PM, Ming Yang wrote:
>>> Thanks for the suggestions!  But I might ask several specific
>>> questions, as I can't find the answer in those documents, to give
>>> myself a quick start if that's okay. Pointing me to the
>>> files/functions would be good enough.  Any explanations are
>>> appreciated.   My purpose is to hack it with different scheduling
>>> policy with real-time and predictability consideration.
>>>
>>> - Where/How is the packet scheduler implemented?  How are packets from
>>> multiple queues scheduled?  What about scheduling packets from queues
>>> in different address spaces?
>> This is done mostly in firmware. The CP engine supports up to 32 queues.
>> We share those between KFD and AMDGPU. KFD gets 24 queues to use.
>> Usually that is 6 queues times 4 pipes. Pipes are threads in the CP
>> micro engine. Within each pipe the queues are time-multiplexed.
> Please correct me if I'm wrong.  CP is computing processor, like the
> Execution Engine in NVIDIA GPU. Pipe is like wavefront (warp)
> scheduler multiplexing queues, in order to hide memory latency.

CP stands for "command processor". There are multiple CP micro engines.
CPG for graphics, CPC for compute. This is not related to warps or wave
fronts. Those execute on the CUs (compute units). There are many CUs,
but only one CPC. The scheduling or dispatching of wavefronts to CUs is
yet another level of scheduling that I didn't talk about. Application
submit AQL dispatch packets to user mode queues. The CP processes those
dispatch packets and schedules the resulting wavefronts on CUs.

>
>> If we need more than 24 queues, or if we have more than 8 processes, the
>> hardware scheduler (HWS) adds another layer scheduling, basically
>> round-robin between batches of 24 queues or 8 processes. Once you get
>> into such an over-subscribed scenario your performance and GPU
>> utilization can suffers quite badly.
> HWS is also implemented in the firmware that's closed-source?

Yes.

>
>>> - I noticed the new support of concurrency of multi-processes in the
>>> archive of this mailing list.  Could you point me to the code that
>>> implements this?
>> That's basically just a switch that tells the firmware that it is
>> allowed to schedule queues from different processes at the same time.
>> The upper limit is the number of VMIDs that HWS can work with. It needs
>> to assign a unique VMID to each process (each VMID representing a
>> separate address space, page table, etc.). If there are more processes
>> than VMIDs, the HWS has to time-multiplex.
> HWS dispatch packets in their order of becoming the head of the queue,
> i.e., being pointed by the read_index? So in this way it's FIFO.  Or
> round-robin between queues? You mentioned round-robin over batches in
> the over-subscribed scenario.

Commands within a queue are handled in FIFO order. Commands on different
queues are not ordered with respect to each other.

When I talk about round robin of batches of queues, it means that the
pipes of the CP are executing up to 24 user mode queues. After the time
slice is up, the HWS preempts those queues, and loads another batch of
queues to the CP pipes. This goes on until all the queues in the runlist
have had some time on the GPU. Then the whole process starts over.

>
> This might not be a big deal for performance, but it matters for
> predictability and real-time analysis.
>
>>> - Also another related question -- where/how is the preemption/context
>>> switch between packets/queues implemented?
>> As long as you don't oversubscribe the available VMIDs, there is no real
>> context switching. Everything can run concurrently. When you start
>> oversubscribing HW queues or VMIDs, the HWS firmware will start
>> multiplexing. This is all handled inside the firmware and is quite
>> transparent even to KFD.
> I see.  So the preemption in at least AMD's implementation is not
> switching out the executing kernel, but just letting new kernels to
> run concurrently with the existing ones.  This means the performance
> is degraded when too many workloads are submitted.  The running
> kernels leave the GPU only when they are done.

No, that's not what I meant. As long as nothing is oversubscribed, you
don't have preemptions. But as soon as hardware queues or VMIDs are
oversubscribed, the HWS will need to preempt queues in order to let
other queues have some time on the hardware. Preempting a queue includes
preempting all the wavefronts that were dispatched by that queue. The
state of all the CUs is saved and later restored. We call this CWSR
(compute wave save/restore).

>
> Is there any reason for not preempting/switching out the existing
> kernel, besides context switch overheads?  NVIDIA is not providing
> this option either.  Non-preemption hurts the real-time property in
> terms of priority inversion.  I understand preemption should not be
> massively used but having such an option may help a lot for real-time
> systems.
>
>> KFD interacts with the HWS firmware through the HIQ (HSA interface
>> queue). It supports packets for unmapping queues, we can send it a new
>> runlist (basically a bunch of map-process and map-queue packets). The
>> interesting files to look at are kfd_packet_manager.c,
>> kfd_kernel_queue_<hw>.c and kfd_device_queue_manager.c.
>>
> So in this way, if we want to implement different scheduling policy,
> we should control the submission of packets to the queues in
> runtime/KFD, before getting to the firmware.  Because it's out of
> access once it's submitted to the HWS in the firmware.

Right. If you need more control over the scheduling, there is an option
to disable the HWS. This is currently more a debugging option and comes
with some side effects. For example we currently only support CWSR when
the HWS is enabled. And this currently disables queue and VMID
oversubscription. If you create too many queues or processes, queue or
KFD process creation just fails.

If you can control what's going on in user mode, you can achieve the
same by just not creating too many processes and queues.

Regards,
  Felix

>
> Best,
> Mark
>
>> Regards,
>>   Felix
>>
>>> Thanks in advance!
>>>
>>> Best,
>>> Mark
>>>
>>>> On 13 Feb 2018, at 2:56 PM, Felix Kuehling <felix.kuehling at amd.com> wrote:
>>>> There is also this: https://gpuopen.com/professional-compute/, which
>>>> give pointer to several libraries and tools that built on top of ROCm.
>>>>
>>>> Another thing to keep in mind is, that ROCm is diverging from the strict
>>>> HSA standard in some important ways. For example the HSA standard
>>>> includes HSAIL as an intermediate representation that gets finalized on
>>>> the target system, whereas ROCm compiles directly to native GPU ISA.
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>> On Tue, Feb 13, 2018 at 9:40 AM, Deucher, Alexander <Alexander.Deucher at amd.com> wrote:
>>>>> The ROCm documentation is probably a good place to start:
>>>>>
>>>>> https://rocm.github.io/documentation.html
>>>>>
>>>>>
>>>>> Alex
>>>>>
>>>>> ________________________________
>>>>> From: amd-gfx <amd-gfx-bounces at lists.freedesktop.org> on behalf of Ming Yang
>>>>> <minos.future at gmail.com>
>>>>> Sent: Tuesday, February 13, 2018 12:00 AM
>>>>> To: amd-gfx at lists.freedesktop.org
>>>>> Subject: Documentation about AMD's HSA implementation?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I'm interested in HSA and excited when I found AMD's fully open-stack ROCm
>>>>> supporting it. Before digging into the code, I wonder if there's any
>>>>> documentation available about AMD's HSA implementation, either book,
>>>>> whitepaper, paper, or documentation.
>>>>>
>>>>> I did find helpful materials about HSA, including HSA standards on this page
>>>>> (http://www.hsafoundation.com/standards/) and a nice book about HSA
>>>>> (Heterogeneous System Architecture A New Compute Platform Infrastructure).
>>>>> But regarding the documentation about AMD's implementation, I haven't found
>>>>> anything yet.
>>>>>
>>>>> Please let me know if there are ones publicly accessible. If no, any
>>>>> suggestions on learning the implementation of specific system components,
>>>>> e.g., queue scheduling.
>>>>>
>>>>> Best,
>>>>> Mark