[Mesa-dev] [PATCH v3 24/25] panfrost: Support batch pipelining

Fri Sep 6 11:40:17 UTC 2019

I think we can simplify `panfrost_flush_draw_deps`. We need to flush
any BOs that write where we read/write and any BOs that read where we
write. Since we collect this information via add_bo, we can
implement this logic generically, without requiring a special case
for every kind of BO we might need to flush, which is verbose and easy
to forget when adding new BOs later. You might need some extra tables in
panfrost_batch.

----

On design more generally:

I don't think we want to trigger any flushes at draw time. Rather, we
want to trigger at flush time. Conceptually, right before we send a
batch to the GPU, we ensure all of the other batches it needs have been
sent first and there is a barrier between them (via wait_bo).

The first consequence of delaying is that CPU-side logic can proceed
without being stalled on results.

The second is that different batches can be _totally_ independent.
Consider an app that does the following passes:

[FBO 1: Render a depth map of an object ]
[FBO 2: Render a normal map of that object ]
[Scanout: Render using the depth/normal maps as textures ]

In this case, the app should generate CPU-side batches for all three
render targets at once. Then, when flush() is called, fbo #1 and fbo #2
should be submitted and waited upon so they execute concurrently, then
scanout is submitted and waited. This should be a little faster,
especially paired with _NEXT changes in the kernel. CC'ing Steven to
ensure the principle is sound.

We can model this with a dependency graph, where batches are nodes and
the dependency of a batch X on a batch Y is represented as an edge from
Y to X. So this is a directed arrow graph. For well-behaved apps, the
graph must be acyclic (why?).

This touches on the idea of topological sorting: a topological sort of
the dependency graph is a valid order to submit the batches in. So
hypothetically, you could delay everything to flush time, construct the
dependency graph, do a topological sort, and then submit/wait in the
order of the sort.

But more interesting will be to extend to the concurrent FBO case, an
algorithm for which follows simply from topological sorting:

---

0. Create the dependency graph. Cull nodes that are not connected to the
node we're trying to flush (the scanout batch). In other words, reduce
the graph to its component containing the flushed node. See also
https://en.wikipedia.org/wiki/Connected_component_(graph_theory)#Algorithms

1. For each node with no incoming edges (=batch with no dependencies),
submit this batch. Remove it from the dependency graph, removing all
outgoing edges. Add it to a set of submitted batches.

3. For each submitted batch, wait on that batch.

4. Jump back to step #1 until there are no more nodes with no incoming
edges.

---

Intuitively, the idea is "submit as much as we can all at once, then
wait for it. Keep doing that until we submitted everything we need."

A bit more formally, nodes with no edges have no unsatisfied
dependencies by definition, so we can submit them in any order. We
choose to submit these first. We are allowed to submit a wait at any
time. Once we wait on a batch, it is complete, so any batches that
depend on it have that dependency satisfied, represented by removing the
edge from the dependency graph.

Do note that the subtlety of the termination condition: no more nodes
with no incoming edges. This makes proving that the algorithm halts
easy, since every iteration either removes a node or halts, and there
are a finite integral non-negative number of nodes.

* Whether this is a useful optimization is greatly dependent on the
  hardware. The Arm guys can chime in here, but I do know the GPU has
  some parallel execution capabilities so this shouldn't be a total
  waste.