Documentation about AMD's HSA implementation?

Tue Mar 20 01:06:59 UTC 2018

Thanks, John!

On Sat, Mar 17, 2018 at 4:17 PM, Bridgman, John <John.Bridgman at amd.com> wrote:
>
>>-----Original Message-----
>>From: Ming Yang [mailto:minos.future at gmail.com]
>>Sent: Saturday, March 17, 2018 12:35 PM
>>To: Kuehling, Felix; Bridgman, John
>>Cc: amd-gfx at lists.freedesktop.org
>>Subject: Re: Documentation about AMD's HSA implementation?
>>
>>Hi,
>>
>>After digging into documents and code, our previous discussion about GPU
>>workload scheduling (mainly HWS and ACE scheduling) makes a lot more
>>sense to me now.  Thanks a lot!  I'm writing this email to ask more questions.
>>Before asking, I first share a few links to the documents that are most helpful
>>to me.
>>
>>GCN (1st gen.?) architecture whitepaper
>>https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
>>Notes: ACE scheduling.
>>
>>Polaris architecture whitepaper (4th gen. GCN)
>>http://radeon.com/_downloads/polaris-whitepaper-4.8.16.pdf
>>Notes: ACE scheduling; HWS; quick response queue (priority assignment);
>>compute units reservation.
>>
>>AMDKFD patch cover letters:
>>v5: https://lwn.net/Articles/619581/
>>v1: https://lwn.net/Articles/605153/
>>
>>A comprehensive performance analysis of HSA and OpenCL 2.0:
>>http://ieeexplore.ieee.org/document/7482093/
>>
>>Partitioning resources of a processor (AMD patent)
>>https://patents.google.com/patent/US8933942B2/
>>Notes: Compute resources are allocated according to the resource
>>requirement percentage of the command.
>>
>>Here come my questions about ACE scheduling:
>>Many of my questions are about ACE scheduling because the firmware is
>>closed-source and how ACE schedules commands (queues) is not detailed
>>enough in these documents.  I'm not able to run experiments on Raven Ridge
>>yet.
>>
>>1. Wavefronts of one command scheduled by an ACE can be spread out to
>>multiple compute engines (shader arrays)?  This is quite confirmed by the
>>cu_mask setting, as cu_mask for one queue can cover CUs over multiple
>>compute engines.
>
> Correct, assuming the work associated with the command is not trivially small
> and so generates enough wavefronts to require multiple CU's.
>
>>
>>2.  If so, how is the competition resolved between commands scheduled by
>>ACEs?  What's the scheduling scheme?  For example, when each ACE has a
>>command ready to occupy 50% compute resources, are these 4 commands
>>each occupies 25%, or they execute in the round-robin with 50% resources at
>>a time?  Or just the first two scheduled commands execute and the later two
>>wait?
>
> Depends on how you measure compute resources, since each SIMD in a CU can
> have up to 10 separate wavefronts running on it as long as total register usage
> for all the threads does not exceed the number available in HW.
>
> If each ACE (let's say pipe for clarity) has enough work to put a single wavefront
> on 50% of the SIMDs then all of the work would get scheduled to the SIMDs (4
> SIMDs per CU) and run in a round-robin-ish manner as each wavefront was
> blocked waiting for memory access.
>
> If each pipe has enough work to fill 50% of the CPUs and all pipes/queues were
> assigned the same priority (see below) then the behaviour would be more like
> "each one would get 25% and each time a wavefront finished another one would
> be started".
>

This makes sense to me.  I will try some experiments once Raven Ridge is ready.

>>
>>3. If the barrier bit of the AQL packet is not set, does ACE schedule the
>>following command using the same scheduling scheme in #2?
>
> Not sure, barrier behaviour has paged so far out of my head that I'll have to skip
> this one.
>

This barrier bit is defined in HSA.  If it is set, the following
packet should wait until the current packet finish.  It's probably the
key implementing out-of-order execution of OpenCL, I'm not sure.  I
should be able to use the profiler to find out the answer once I can
run OpenCL on Raven Ridge.

>>
>>4. ACE takes 3 pipe priorities: low, medium, and high, even though AQL queue
>>has 7 priority levels, right?
>
> Yes-ish. Remember that there are multiple levels of scheduling going on here. At
> any given time a pipe is only processing work from one of the queues; queue
> priorities affect the pipe's round-robin-ing between queues in a way that I have
> managed to forget (but will try to find). There is a separate pipe priority, which
> IIRC is actually programmed per queue and takes effect when the pipe is active
> on that queue. There is also a global (IIRC) setting which adjusts how compute
> work and graphics work are prioritized against each other, giving options like
> making all compute lower priority than graphics or making only high priority
> compute get ahead of graphics.
>
> I believe the pipe priority is also referred to as SPI priority, since it affects
> the way SPI decides which pipe (graphics/compute) to accept work from
> next.
>
> This is all a bit complicated by a separate (global IIRC) option which randomizes
> priority settings in order to avoid deadlock in certain conditions. We used to
> have that enabled by default (believe it was needed for specific OpenCL
> programs) but not sure if it is still enabled - if so then most of the above gets
> murky because of the randomization.
>
> At first glance we do not enable randomization for Polaris or Vega but do for
> all of the older parts. Haven't looked at Raven yet.

Thanks for providing these details!

>
>>
>>5. Is this patent (https://patents.google.com/patent/US8933942B2/)
>>implemented?  How to set resource allocation percentage for
>>commands/queues?
>
> I don't remember seeing that being implemented in the drivers.
>
>>
>>If these features work well, I have confidence in AMD GPUs of providing very
>>nice real-time predictability.
>>
>>
>>Thanks,
>>Ming

Thanks,
Ming