[Mesa-dev] [PATCH v3 24/25] panfrost: Support batch pipelining

Fri Sep 6 14:11:26 UTC 2019

On Fri, 2019-09-06 at 07:40 -0400, Alyssa Rosenzweig wrote:
> I think we can simplify `panfrost_flush_draw_deps`. We need to flush
> any BOs that write where we read/write and any BOs that read where we
> write. Since we collect this information via add_bo, we can
> implement this logic generically, without requiring a special case
> for every kind of BO we might need to flush, which is verbose and easy
> to forget when adding new BOs later. You might need some extra tables in
> panfrost_batch.
> 
> ----
> 
> On design more generally:
> 
> I don't think we want to trigger any flushes at draw time. Rather, we
> want to trigger at flush time. Conceptually, right before we send a
> batch to the GPU, we ensure all of the other batches it needs have been
> sent first and there is a barrier between them (via wait_bo).
> 
> The first consequence of delaying is that CPU-side logic can proceed
> without being stalled on results.
> 
> The second is that different batches can be _totally_ independent.
> Consider an app that does the following passes:
> 
> [FBO 1: Render a depth map of an object ]
> [FBO 2: Render a normal map of that object ]
> [Scanout: Render using the depth/normal maps as textures ]
> 
> In this case, the app should generate CPU-side batches for all three
> render targets at once. Then, when flush() is called, fbo #1 and fbo #2
> should be submitted and waited upon so they execute concurrently, then
> scanout is submitted and waited. This should be a little faster,
> especially paired with _NEXT changes in the kernel. CC'ing Steven to
> ensure the principle is sound.

Yes, this is how the hardware was designed to be used. The idea is that
the vertex processing can be submitted into the hardware back-to-back
(using the _NEXT registers) and then the fragment shading of e.g. FBO 1
can overlap with the vertex processing of FBO 2.

> We can model this with a dependency graph, where batches are nodes and
> the dependency of a batch X on a batch Y is represented as an edge from
> Y to X. So this is a directed arrow graph. For well-behaved apps, the
> graph must be acyclic (why?).

Again, this is how kbase is designed. kbase refers to the hardware job
chains as "atoms" (because we also have "soft-jobs" that are software
only equivalents executed in the kernel). The base_jd_atom_v2 structure
has two dependencies and we have a "dependency only atom" to allow
greater fan-out.

The submission mechanism ensures the graph is acyclic by submitting the
atoms one-by-one and not allowing a dependency on an atom which hasn't
been submitted yet (aside: this is then completely broken by the
existence of other synchronisation mechanisms which can introduce
cycles).

> This touches on the idea of topological sorting: a topological sort of
> the dependency graph is a valid order to submit the batches in. So
> hypothetically, you could delay everything to flush time, construct the
> dependency graph, do a topological sort, and then submit/wait in the
> order of the sort.

Be aware that the topological sort needs to be intelligent. For
instance frames are (usually) logically independent in the sort, but
you really don't want the GPU to do the work for frame N+1 before doing
frame N.

kbase actually has two different types of dependency "data dependency"
and "order dependency". "data dependency" is where the job uses the
output of a previous job (the 'real' dependencies). "order dependency"
is where there is a meaningful order but the jobs are actually
independent.

The main benefit of the two types is that it enables better recovery in
case of errors (e.g. if frame N fails to render, we can still render
frame N+1 even though we have an ordering dependency between them).

Steve