[Intel-gfx] igt/gem_exec_nop parallel test: why it isn't useful

Thu Sep 1 16:51:09 UTC 2016

The gem_exec_nop test generally works by submitting batches to an engine 
as fast as possible for a fixed time, then finally calling gem_sync() to 
wait for the last submitted batch to complete. The time-per-batch is 
then calculated as the total elapsed time, divided by the total number 
of batches submitted.

The problem with this approach as a measurement of driver overhead, or 
latency (or anything else) is that the amount of work involved in 
submitting a batch is not a simple constant; in particular, it depends 
on the state of the various queues in the execution path. And it has the 
rather strange characteristic that if the GPU runs slightly faster, the 
driver may go much slower!

The main reason here is the lite-restore mechanism, although it 
interacts with dual-submission and the details of handling the 
completion interrupt. In particular, lite-restore means that it can be 
much cheaper to add a request to an engine that's already (or still) 
busy with a previous request than to send a new request to an idle engine.

For example, imagine that it takes the (test/CPU/driver) 2us to prepare 
a request up to the point of submission, but another 4us to push it into 
the submission port. Also assume that once started, this batch takes 3us 
to execute on the GPU, and handling the completion takes the driver 
another 2us of CPU time. Then the stream of requests will produce a 
pattern like this:

t0:      batch 1: 6us from user to h/w (idle->busy)
t0+6us:  GPU now running batch 1
t0+8us:  batch 2: 2us from user to queue (not submitted)
t0+9us:  GPU finished; IRQ handler samples queue (batch 2)
t0+10us: batch 3: 2us from user to queue (not submitted)
t0+11us: IRQ handler submits tail of batch 2
t0+12us: batch 4: 2us from user to queue (not submitted)
t0+14us: batch 5: 2us from user to queue (not submitted)
t0+15us: GPU now running batch 2
t0+16us: batch 6: 2us from user to queue (not submitted)
t0+18us: GPU finished; IRQ handler samples queue (batch 6)
t0+18us: batch 7: 2us from user to queue (not submitted)
t0+20us: batch 8: 2us from user to queue (not submitted)
t0+20us: IRQ handler coalesces requests, submits tail of batch 6
t0+20us: batch 9: 2us from user to queue (not submitted)
t0+22us: batch 10: 2us from user to queue (not submitted)
t0+24us: GPU now running batches 3-6
t0+24us: batch 11: 2us from user to queue (not submitted)
t0+26us: batch 12: 2us from user to queue (not submitted)
t0+28us: batch 13: 2us from user to queue (not submitted)
t0+30us: batch 14: 2us from user to queue (not submitted)
t0+32us: batch 15: 2us from user to queue (not submitted)
t0+34us: batch 16: 2us from user to queue (not submitted)
t0+36us: GPU finished; IRQ handler samples queue (batch 16)
t0+36us: batch 17: 2us from user to queue (not submitted)
t0+38us: batch 18: 2us from user to queue (not submitted)
t0+38us: IRQ handler coalesces requests, submits tail of batch 16
t0+40us: batch 19: 2us from user to queue (not submitted)
t0+42us: batch 20: 2us from user to queue (not submitted)
t0+42us: GPU now running batches 7-16

Thus, after the first few, *all* requests will be coalesced, and only a 
few of them will incur the overhead of writing to the ELSP or handling a 
context-complete interrupt. With the CPU generating a new batch every 
2us and the GPU taking 3us/batch to execute them, the queue of 
outstanding requests will get longer and longer until the ringbuffer is 
nearly full, but the write to the ELSP will happen ever more rarely.

When we measure the overall time for the process, we will find the 
result is 3us/batch, i.e. the GPU batch execution time. The coalescing 
means that all the driver *and hardware* overheads are *completely* hidden.

Now consider what happens if the batches are generated and submitted 
slightly slower, only one every 4us:

t1:      batch 1: 6us from user to h/w (idle->busy)
t1+6us:  GPU now running batch 1
t1+9us:  GPU finished; IRQ handler samples queue (empty)
t1+10us: batch 2: 6us from user to h/w (idle->busy)
t1+16us: GPU now running batch 2
t1+19us: GPU finished; IRQ handler samples queue (empty)
t1+20us: batch 3: 6us from user to h/w (idle->busy)
etc

This hits the worst case, where *every* batch submission needs to go 
through the most expensive path (and in doing so, delays the creation of 
the next workload, so we will never get out of this pattern). Our 
measurement will therefore show 10us/batch.

*IF* we didn't have a BKL, it would be reasonable to expect that a 
suitable multi-threaded program on a CPU with more h/w threads than GPU 
engines could submit batches on any set of engines in parallel, and for 
each thread and engine, the execution time would be essentially 
independent of which engines were running concurrently.

Unfortunately, though, that lock-free scenario is not what we have 
today. The BKL means that only one thread can submit at a time (and in 
any case, the test program isn't multi-threaded). Therefore, if the test 
can generate and submit batches at a rate of one every 2us (as in the 
first "GOOD" scenario above), but those batches are being split across 
two different engines, it results in an effective submission rate of one 
per 4us, and flips into the second "BAD" scenario as a result.

The conclusion, then, is that the parallel execution part of this test 
as written today isn't really measuring a meaningful quantity, and the 
pass-fail criterion in particular isn't telling us anything useful about 
the overhead (or latency) of various parts of the submission path.

I've written another test variant, which explores the NO-OP execution 
time as a function of both batch buffer size and the number of 
consecutive submissions to the same engine before switching to the next 
(burst size). Typical results look something like this:

IGT-Version: 1.15-gd09ad86 (x86_64) (Linux: 
4.8.0-rc4-dsg-00786-g9a8bc43-dsg-test-32 x86_64)
Time to exec 8-byte batch:	  3.136µs (ring=render)
Time to exec 8-byte batch:	  1.294µs (ring=bsd)
Time to exec 8-byte batch:	  1.263µs (ring=blt)
Time to exec 8-byte batch:	  1.276µs (ring=vebox)
Time to exec 8-byte batch:	  1.745µs (ring=all, sequential)
Time to exec 8-byte batch:	  5.605µs (ring=all, parallel/1)
Time to exec 8-byte batch:	  5.583µs (ring=all, parallel/2)
Time to exec 8-byte batch:	  4.780µs (ring=all, parallel/4)
Time to exec 8-byte batch:	  3.870µs (ring=all, parallel/8)
Time to exec 8-byte batch:	  2.883µs (ring=all, parallel/16)
Time to exec 8-byte batch:	  2.155µs (ring=all, parallel/32)
Time to exec 8-byte batch:	  1.560µs (ring=all, parallel/64)
Time to exec 8-byte batch:	  1.221µs (ring=all, parallel/128)
Time to exec 8-byte batch:	  1.302µs (ring=all, parallel/256)
Time to exec 8-byte batch:	  1.417µs (ring=all, parallel/512)
Time to exec 8-byte batch:	  1.624µs (ring=all, parallel/1024)
Time to exec 8-byte batch:	  1.680µs (ring=all, parallel/2048)

Time to exec 4Kbyte batch:	 12.588µs (ring=render)
Time to exec 4Kbyte batch:	 11.291µs (ring=bsd)
Time to exec 4Kbyte batch:	 11.837µs (ring=blt)
Time to exec 4Kbyte batch:	 11.355µs (ring=vebox)
Time to exec 4Kbyte batch:	 11.770µs (ring=all, sequential)
Time to exec 4Kbyte batch:	 11.109µs (ring=all, parallel/1)
Time to exec 4Kbyte batch:	 11.094µs (ring=all, parallel/2)
Time to exec 4Kbyte batch:	 11.087µs (ring=all, parallel/4)
Time to exec 4Kbyte batch:	 11.046µs (ring=all, parallel/8)
Time to exec 4Kbyte batch:	 10.984µs (ring=all, parallel/16)
Time to exec 4Kbyte batch:	 10.957µs (ring=all, parallel/32)
Time to exec 4Kbyte batch:	 10.942µs (ring=all, parallel/64)
Time to exec 4Kbyte batch:	 10.928µs (ring=all, parallel/128)
Time to exec 4Kbyte batch:	 11.118µs (ring=all, parallel/256)
Time to exec 4Kbyte batch:	 11.359µs (ring=all, parallel/512)
Time to exec 4Kbyte batch:	 11.562µs (ring=all, parallel/1024)
Time to exec 4Kbyte batch:	 11.663µs (ring=all, parallel/2048)

which clearly shows the effect of failing to coalesce (small) requests. 
But even this doesn't really reveal the numbers that would be of most 
interest i.e. minimum/typical/maximum values for
1. overhead from execbuf call to submission queue
2. latency from execbuf to h/w execution start (if queue empty)
3. latency from h/w completion to ELSP update
4. overhead of completion processing
5. etc

.Dave.