[PATCH] drm/panfrost: Add "compute shader only" hint

Wed Aug 7 11:11:44 UTC 2019

On 06/08/2019 21:25, Alyssa Rosenzweig wrote:
>>> It's not obvious to me when it actually needs to be enabled. Besides the
>>> errata, it's only when... device_nr=1 for a compute-only job in kbase?
>>>
>>> I'm afraid I don't know nearly enough about how kbase plumbs CL to grok
>>> the signifiance...
>>
>> Figuring out the nr_core_groups was the complicated part of this as I
>> recall. Seems like we should at least figure out if we (or will need)
>> PANFROST_JD_REQ_CORE_GRP_MASK added to the UAPI as well.
> 
> I suspect this is something OpenCL/Vulkan specific. Hopefully Stephen
> can shine some light here :)

*switches torch on*...

Ok, this is actually a lot more complex than it first appears, so I'll
have to start with a bit of background:

Mali Midgard GPUs have 2 "thread creators" per core. There is one for
fragment threads and one for 'compute' threads (vertex work is
considered compute in this context).

The cores have a set number of threads (e.g. 256 for the early GPUs) and
they effectively get divided between fragment threads and compute
threads - it's effectively round-robin between the thread creators, but
I think there's some extra 'magic' in the hardware.

The idea is that for graphics you can run fragment and vertex workloads
at the same time on the core and make better use of the hardware (i.e.
fragment units are using the texturing hardware, which vertex is using
the ALU).

However two things stand in the way of this working nicely:

1. Core groups - this is a lovely design feature for hardware engineers,
but a pain for software. Basically you can have multiple sets of cores.
The cores in a set are coherent with each other, but they are not
coherent between sets. This is because each core group has it's own L2
cache.

To complicate things even further the tiler notationally exists within
core group 0, so is only coherent with that core group. This means that
if you have a vertex/tiler job chain it has to be run entirely within
core group 0 - or you will need to insert appropriate cache flushes. For
fragment work you generally don't need coherency between threads so this
isn't a problem and you can run over all the cores in all groups.

For compute (i.e. OpenCL) you probably care about coherency in a work
group, but you may have several independent jobs that can run in
parallel. In this case you can run some (coherent) work on core group 0,
and some other (independent but coherent) work on core group 1.

2. Starvation. For compute work it's common to insert barriers requiring
all threads to reach the same point in the shader before any thread can
progress. If your workgroup size (i.e. the number of threads which
synchronise on the barrier) is the same as the number of threads in the
core this means that all threads have to be allocated to compute before
the barrier can complete.

However if the compute thread creator is competing with the fragment
thread creator this can lead to the situation where compute threads are
idle waiting for fragment threads to complete.

This implies that running compute workloads with barriers at the same
time as fragment work on the same cores isn't very optimal.

</end of background>

kbase has several flags:

 * BASE_JD_REQ_COHERENT_GROUP - the job chain must be run on a coherent
set of cores. I.e. must be restricted to a single core group.

 * BASE_JD_REQ_ONLY_COMPUTE - the job chain is compute jobs and may
contain barriers.

 * BASE_JD_REQ_SPECIFIC_COHERENT_GROUP - we care about being on a
particular core group. device_nr is used to select which (and device_nr
is otherwise ignored)

In practice all this only really matters on the T62x GPU. All other GPUs
have only one core group[1]. So it only really makes sense to use JS2 on
the T62x where you want to use both JS1 and JS2 to run two independent
jobs: one on each core group.

Of course kbase makes all this into a maze of twisty little passages,
all alike! :)

Oh, and there is one hardware workaround (BASE_HW_ISSUE_8987) that uses
JS2. This is to avoid vertex and compute jobs landing on the same slot.
This affects T604 "dev15" only and is because some state was not
properly cleared between jobs.

Steve

[1] There might be multiple L2 caches in hardware, but they are coherent
and are logically a single L2 (only 1 bit set in L2_PRESENT).