[PATCH] drm/vc4: improve throughput by pipelining binning and rendering jobs

Sun Mar 6 10:20:23 UTC 2016

Hi Eric,

On Sat, Mar 5, 2016 at 7:17 AM, Eric Anholt <eric at anholt.net> wrote:
 > Varad Gautam <varadgautam at gmail.com> writes:
 >
 >>  The hardware provides us with separate threads for binning and
 >>  rendering, and the existing model waits for them both to complete
 >>  before submitting the next job.
 >>
 >>  Splitting the binning and rendering submissions reduces idle time
 >>  and gives us approx 20-30% speedup with several x11perf tests.
 >
 > This patch is:
 >
 > Reviewed-by: Eric Anholt <eric at anholt.net.
 >
 > Which tests did you find improved, specifically?  I'm seeing 
openarena
 > improved by 1.01897% +/- 0.247857% (n=16).  x11perf -aa24text and
 > -copypixwin looked like they had about the same level of improvement.

Here's a sample of the speedups I've noticed with x11perf:

without queue  with queue    % delta  test
-(reps/sec)-   -(reps/sec)-  ---      ---
1840000        2360000       28.26%   10x10 tiled rectangle (17x15 tile)
1920000        2440000       27.08%   10x10 tiled rectangle (4x4 tile)
1340000        1620000       20.90%   10x10 tiled rectangle (216x208
tile)
9900000        11900000      20.20%   10-pixel line
1310000        1570000       19.85%   10x10 tiled rectangle (161x145
tile)
2800000        3270000       16.79%   10x10 rectangle
2720000        3140000       15.44%   100-pixel vertical line segment
876000         1010000       15.30%   100-pixel line segment (2 kids)
199000         229000        15.08%   Circulate Unmapped window (200
kids)
1190000        1350000       13.45%   100-pixel line segment (1 kid)
176000         199000        13.07%   500-pixel line segment
172000         194000        12.79%   500-pixel line
116000         129000        11.21%   Destroy window via parent (100
kids)
2030000        2250000       10.84%   100-pixel horizontal line segment
635000         697000         9.76%   100-pixel line segment (3 kids)

 >
 > This conflicts with a change in -fixes.  I think this means that it
 > should go in -next once -fixes gets pulled in that.
 >
 > Peter Brown had suggested to me at one point that we could queue up
 > multiple jobs at once by patching the last few bytes of the current
 > job
 > to jump to the next one.  I haven't fully thought through how you'd
 > interlock to make sure that the CL wasn't going to execute the old
 > contents before you go to sleep, but it has the promise of being able
 > to
 > mask out the flush/frame done interrupts.

A rough idea is to keep track of our current job's start address (which
may be the previous job's jump destination) and resubmit from here if we
come back from sleep. Will see if I can build up on this.

Thanks,
Varad