[Mesa-dev] [PATCH v3 24/25] panfrost: Support batch pipelining

Boris Brezillon boris.brezillon at collabora.com
Fri Sep 6 12:03:31 UTC 2019


On Fri, 6 Sep 2019 07:40:17 -0400
Alyssa Rosenzweig <alyssa at rosenzweig.io> wrote:

> I think we can simplify `panfrost_flush_draw_deps`. We need to flush
> any BOs that write where we read/write and any BOs that read where we
> write. Since we collect this information via add_bo, we can
> implement this logic generically, without requiring a special case
> for every kind of BO we might need to flush, which is verbose and easy
> to forget when adding new BOs later. You might need some extra tables in
> panfrost_batch.

With the current design where deps are flushed before issuing draw/clear
job, the existing add_bo() calls happen too late. This being said,
we could add BOs earlier and store the type of access in batch->bos
(turn it into a hash table where the key is the BO and the data
contains the flags). With that in place, we'd be able to automatically
add BOs to the ctx->{write,read}_bos hash tables.

Now, if we go for the dep graph solution, that's probably a non issue,
since deps can be added at any point as long as they are described
before the flush happens.

> 
> ----
> 
> On design more generally:
> 
> I don't think we want to trigger any flushes at draw time. Rather, we
> want to trigger at flush time. Conceptually, right before we send a
> batch to the GPU, we ensure all of the other batches it needs have been
> sent first and there is a barrier between them (via wait_bo).

I agree, and actually had this rework on my TODO list.

> 
> The first consequence of delaying is that CPU-side logic can proceed
> without being stalled on results.
> 
> The second is that different batches can be _totally_ independent.
> Consider an app that does the following passes:
> 
> [FBO 1: Render a depth map of an object ]
> [FBO 2: Render a normal map of that object ]
> [Scanout: Render using the depth/normal maps as textures ]
> 
> In this case, the app should generate CPU-side batches for all three
> render targets at once. Then, when flush() is called, fbo #1 and fbo #2
> should be submitted and waited upon so they execute concurrently, then
> scanout is submitted and waited.

Yes, also thought about that. We'd need to move the out_sync object
to the batch to make that possible, but that's definitely an
improvement I had in mind.

> This should be a little faster,
> especially paired with _NEXT changes in the kernel. CC'ing Steven to
> ensure the principle is sound.

Haven't looked at that patch yet.

> 
> We can model this with a dependency graph, where batches are nodes and
> the dependency of a batch X on a batch Y is represented as an edge from
> Y to X. So this is a directed arrow graph. For well-behaved apps, the
> graph must be acyclic (why?).
> 
> This touches on the idea of topological sorting: a topological sort of
> the dependency graph is a valid order to submit the batches in. So
> hypothetically, you could delay everything to flush time, construct the
> dependency graph, do a topological sort, and then submit/wait in the
> order of the sort.
> 
> But more interesting will be to extend to the concurrent FBO case, an
> algorithm for which follows simply from topological sorting:
> 
> ---
> 
> 0. Create the dependency graph. Cull nodes that are not connected to the
> node we're trying to flush (the scanout batch). In other words, reduce
> the graph to its component containing the flushed node. See also
> https://en.wikipedia.org/wiki/Connected_component_(graph_theory)#Algorithms
> 
> 1. For each node with no incoming edges (=batch with no dependencies),
> submit this batch. Remove it from the dependency graph, removing all
> outgoing edges. Add it to a set of submitted batches.
> 
> 3. For each submitted batch, wait on that batch.
> 
> 4. Jump back to step #1 until there are no more nodes with no incoming
> edges.
> 
> ---
> 
> Intuitively, the idea is "submit as much as we can all at once, then
> wait for it. Keep doing that until we submitted everything we need."
> 
> A bit more formally, nodes with no edges have no unsatisfied
> dependencies by definition, so we can submit them in any order. We
> choose to submit these first. We are allowed to submit a wait at any
> time. Once we wait on a batch, it is complete, so any batches that
> depend on it have that dependency satisfied, represented by removing the
> edge from the dependency graph.
> 
> Do note that the subtlety of the termination condition: no more nodes
> with no incoming edges. This makes proving that the algorithm halts
> easy, since every iteration either removes a node or halts, and there
> are a finite integral non-negative number of nodes.
> 
> * Whether this is a useful optimization is greatly dependent on the
>   hardware. The Arm guys can chime in here, but I do know the GPU has
>   some parallel execution capabilities so this shouldn't be a total
>   waste.

Thanks for the detailed explanation. I'll look into that. This being
said, I was wondering if we shouldn't merge this patch (after I
addressed your first comment maybe) before getting involved in a more
advanced solution (which I agree is what we should aim for).


More information about the mesa-dev mailing list