[Intel-gfx] igt/gem_exec_nop parallel test: why it isn't useful
Dave Gordon
david.s.gordon at intel.com
Thu Sep 1 16:51:09 UTC 2016
The gem_exec_nop test generally works by submitting batches to an engine
as fast as possible for a fixed time, then finally calling gem_sync() to
wait for the last submitted batch to complete. The time-per-batch is
then calculated as the total elapsed time, divided by the total number
of batches submitted.
The problem with this approach as a measurement of driver overhead, or
latency (or anything else) is that the amount of work involved in
submitting a batch is not a simple constant; in particular, it depends
on the state of the various queues in the execution path. And it has the
rather strange characteristic that if the GPU runs slightly faster, the
driver may go much slower!
The main reason here is the lite-restore mechanism, although it
interacts with dual-submission and the details of handling the
completion interrupt. In particular, lite-restore means that it can be
much cheaper to add a request to an engine that's already (or still)
busy with a previous request than to send a new request to an idle engine.
For example, imagine that it takes the (test/CPU/driver) 2us to prepare
a request up to the point of submission, but another 4us to push it into
the submission port. Also assume that once started, this batch takes 3us
to execute on the GPU, and handling the completion takes the driver
another 2us of CPU time. Then the stream of requests will produce a
pattern like this:
t0: batch 1: 6us from user to h/w (idle->busy)
t0+6us: GPU now running batch 1
t0+8us: batch 2: 2us from user to queue (not submitted)
t0+9us: GPU finished; IRQ handler samples queue (batch 2)
t0+10us: batch 3: 2us from user to queue (not submitted)
t0+11us: IRQ handler submits tail of batch 2
t0+12us: batch 4: 2us from user to queue (not submitted)
t0+14us: batch 5: 2us from user to queue (not submitted)
t0+15us: GPU now running batch 2
t0+16us: batch 6: 2us from user to queue (not submitted)
t0+18us: GPU finished; IRQ handler samples queue (batch 6)
t0+18us: batch 7: 2us from user to queue (not submitted)
t0+20us: batch 8: 2us from user to queue (not submitted)
t0+20us: IRQ handler coalesces requests, submits tail of batch 6
t0+20us: batch 9: 2us from user to queue (not submitted)
t0+22us: batch 10: 2us from user to queue (not submitted)
t0+24us: GPU now running batches 3-6
t0+24us: batch 11: 2us from user to queue (not submitted)
t0+26us: batch 12: 2us from user to queue (not submitted)
t0+28us: batch 13: 2us from user to queue (not submitted)
t0+30us: batch 14: 2us from user to queue (not submitted)
t0+32us: batch 15: 2us from user to queue (not submitted)
t0+34us: batch 16: 2us from user to queue (not submitted)
t0+36us: GPU finished; IRQ handler samples queue (batch 16)
t0+36us: batch 17: 2us from user to queue (not submitted)
t0+38us: batch 18: 2us from user to queue (not submitted)
t0+38us: IRQ handler coalesces requests, submits tail of batch 16
t0+40us: batch 19: 2us from user to queue (not submitted)
t0+42us: batch 20: 2us from user to queue (not submitted)
t0+42us: GPU now running batches 7-16
Thus, after the first few, *all* requests will be coalesced, and only a
few of them will incur the overhead of writing to the ELSP or handling a
context-complete interrupt. With the CPU generating a new batch every
2us and the GPU taking 3us/batch to execute them, the queue of
outstanding requests will get longer and longer until the ringbuffer is
nearly full, but the write to the ELSP will happen ever more rarely.
When we measure the overall time for the process, we will find the
result is 3us/batch, i.e. the GPU batch execution time. The coalescing
means that all the driver *and hardware* overheads are *completely* hidden.
Now consider what happens if the batches are generated and submitted
slightly slower, only one every 4us:
t1: batch 1: 6us from user to h/w (idle->busy)
t1+6us: GPU now running batch 1
t1+9us: GPU finished; IRQ handler samples queue (empty)
t1+10us: batch 2: 6us from user to h/w (idle->busy)
t1+16us: GPU now running batch 2
t1+19us: GPU finished; IRQ handler samples queue (empty)
t1+20us: batch 3: 6us from user to h/w (idle->busy)
etc
This hits the worst case, where *every* batch submission needs to go
through the most expensive path (and in doing so, delays the creation of
the next workload, so we will never get out of this pattern). Our
measurement will therefore show 10us/batch.
*IF* we didn't have a BKL, it would be reasonable to expect that a
suitable multi-threaded program on a CPU with more h/w threads than GPU
engines could submit batches on any set of engines in parallel, and for
each thread and engine, the execution time would be essentially
independent of which engines were running concurrently.
Unfortunately, though, that lock-free scenario is not what we have
today. The BKL means that only one thread can submit at a time (and in
any case, the test program isn't multi-threaded). Therefore, if the test
can generate and submit batches at a rate of one every 2us (as in the
first "GOOD" scenario above), but those batches are being split across
two different engines, it results in an effective submission rate of one
per 4us, and flips into the second "BAD" scenario as a result.
The conclusion, then, is that the parallel execution part of this test
as written today isn't really measuring a meaningful quantity, and the
pass-fail criterion in particular isn't telling us anything useful about
the overhead (or latency) of various parts of the submission path.
I've written another test variant, which explores the NO-OP execution
time as a function of both batch buffer size and the number of
consecutive submissions to the same engine before switching to the next
(burst size). Typical results look something like this:
IGT-Version: 1.15-gd09ad86 (x86_64) (Linux:
4.8.0-rc4-dsg-00786-g9a8bc43-dsg-test-32 x86_64)
Time to exec 8-byte batch: 3.136µs (ring=render)
Time to exec 8-byte batch: 1.294µs (ring=bsd)
Time to exec 8-byte batch: 1.263µs (ring=blt)
Time to exec 8-byte batch: 1.276µs (ring=vebox)
Time to exec 8-byte batch: 1.745µs (ring=all, sequential)
Time to exec 8-byte batch: 5.605µs (ring=all, parallel/1)
Time to exec 8-byte batch: 5.583µs (ring=all, parallel/2)
Time to exec 8-byte batch: 4.780µs (ring=all, parallel/4)
Time to exec 8-byte batch: 3.870µs (ring=all, parallel/8)
Time to exec 8-byte batch: 2.883µs (ring=all, parallel/16)
Time to exec 8-byte batch: 2.155µs (ring=all, parallel/32)
Time to exec 8-byte batch: 1.560µs (ring=all, parallel/64)
Time to exec 8-byte batch: 1.221µs (ring=all, parallel/128)
Time to exec 8-byte batch: 1.302µs (ring=all, parallel/256)
Time to exec 8-byte batch: 1.417µs (ring=all, parallel/512)
Time to exec 8-byte batch: 1.624µs (ring=all, parallel/1024)
Time to exec 8-byte batch: 1.680µs (ring=all, parallel/2048)
Time to exec 4Kbyte batch: 12.588µs (ring=render)
Time to exec 4Kbyte batch: 11.291µs (ring=bsd)
Time to exec 4Kbyte batch: 11.837µs (ring=blt)
Time to exec 4Kbyte batch: 11.355µs (ring=vebox)
Time to exec 4Kbyte batch: 11.770µs (ring=all, sequential)
Time to exec 4Kbyte batch: 11.109µs (ring=all, parallel/1)
Time to exec 4Kbyte batch: 11.094µs (ring=all, parallel/2)
Time to exec 4Kbyte batch: 11.087µs (ring=all, parallel/4)
Time to exec 4Kbyte batch: 11.046µs (ring=all, parallel/8)
Time to exec 4Kbyte batch: 10.984µs (ring=all, parallel/16)
Time to exec 4Kbyte batch: 10.957µs (ring=all, parallel/32)
Time to exec 4Kbyte batch: 10.942µs (ring=all, parallel/64)
Time to exec 4Kbyte batch: 10.928µs (ring=all, parallel/128)
Time to exec 4Kbyte batch: 11.118µs (ring=all, parallel/256)
Time to exec 4Kbyte batch: 11.359µs (ring=all, parallel/512)
Time to exec 4Kbyte batch: 11.562µs (ring=all, parallel/1024)
Time to exec 4Kbyte batch: 11.663µs (ring=all, parallel/2048)
which clearly shows the effect of failing to coalesce (small) requests.
But even this doesn't really reveal the numbers that would be of most
interest i.e. minimum/typical/maximum values for
1. overhead from execbuf call to submission queue
2. latency from execbuf to h/w execution start (if queue empty)
3. latency from h/w completion to ELSP update
4. overhead of completion processing
5. etc
.Dave.
More information about the Intel-gfx
mailing list