[Intel-gfx] igt/gem_exec_nop parallel test: why it isn't useful

Thu Sep 1 20:00:32 UTC 2016

On Thu, Sep 01, 2016 at 05:51:09PM +0100, Dave Gordon wrote:
> The gem_exec_nop test generally works by submitting batches to an
> engine as fast as possible for a fixed time, then finally calling
> gem_sync() to wait for the last submitted batch to complete. The
> time-per-batch is then calculated as the total elapsed time, divided
> by the total number of batches submitted.
> 
> The problem with this approach as a measurement of driver overhead,
> or latency (or anything else) is that the amount of work involved in
> submitting a batch is not a simple constant; in particular, it
> depends on the state of the various queues in the execution path.
> And it has the rather strange characteristic that if the GPU runs
> slightly faster, the driver may go much slower!
> 
> The main reason here is the lite-restore mechanism, although it
> interacts with dual-submission and the details of handling the
> completion interrupt. In particular, lite-restore means that it can
> be much cheaper to add a request to an engine that's already (or
> still) busy with a previous request than to send a new request to an
> idle engine.
> 
> For example, imagine that it takes the (test/CPU/driver) 2us to
> prepare a request up to the point of submission, but another 4us to
> push it into the submission port. Also assume that once started,
> this batch takes 3us to execute on the GPU, and handling the
> completion takes the driver another 2us of CPU time. Then the stream
> of requests will produce a pattern like this:
> 
> t0:      batch 1: 6us from user to h/w (idle->busy)
> t0+6us:  GPU now running batch 1
> t0+8us:  batch 2: 2us from user to queue (not submitted)
> t0+9us:  GPU finished; IRQ handler samples queue (batch 2)
> t0+10us: batch 3: 2us from user to queue (not submitted)
> t0+11us: IRQ handler submits tail of batch 2
> t0+12us: batch 4: 2us from user to queue (not submitted)
> t0+14us: batch 5: 2us from user to queue (not submitted)
> t0+15us: GPU now running batch 2
> t0+16us: batch 6: 2us from user to queue (not submitted)
> t0+18us: GPU finished; IRQ handler samples queue (batch 6)
> t0+18us: batch 7: 2us from user to queue (not submitted)
> t0+20us: batch 8: 2us from user to queue (not submitted)
> t0+20us: IRQ handler coalesces requests, submits tail of batch 6
> t0+20us: batch 9: 2us from user to queue (not submitted)
> t0+22us: batch 10: 2us from user to queue (not submitted)
> t0+24us: GPU now running batches 3-6
> t0+24us: batch 11: 2us from user to queue (not submitted)
> t0+26us: batch 12: 2us from user to queue (not submitted)
> t0+28us: batch 13: 2us from user to queue (not submitted)
> t0+30us: batch 14: 2us from user to queue (not submitted)
> t0+32us: batch 15: 2us from user to queue (not submitted)
> t0+34us: batch 16: 2us from user to queue (not submitted)
> t0+36us: GPU finished; IRQ handler samples queue (batch 16)
> t0+36us: batch 17: 2us from user to queue (not submitted)
> t0+38us: batch 18: 2us from user to queue (not submitted)
> t0+38us: IRQ handler coalesces requests, submits tail of batch 16
> t0+40us: batch 19: 2us from user to queue (not submitted)
> t0+42us: batch 20: 2us from user to queue (not submitted)
> t0+42us: GPU now running batches 7-16
> 
> Thus, after the first few, *all* requests will be coalesced, and
> only a few of them will incur the overhead of writing to the ELSP or
> handling a context-complete interrupt. With the CPU generating a new
> batch every 2us and the GPU taking 3us/batch to execute them, the
> queue of outstanding requests will get longer and longer until the
> ringbuffer is nearly full, but the write to the ELSP will happen
> ever more rarely.
> 
> When we measure the overall time for the process, we will find the
> result is 3us/batch, i.e. the GPU batch execution time. The
> coalescing means that all the driver *and hardware* overheads are
> *completely* hidden.
> 
> Now consider what happens if the batches are generated and submitted
> slightly slower, only one every 4us:
> 
> t1:      batch 1: 6us from user to h/w (idle->busy)
> t1+6us:  GPU now running batch 1
> t1+9us:  GPU finished; IRQ handler samples queue (empty)
> t1+10us: batch 2: 6us from user to h/w (idle->busy)
> t1+16us: GPU now running batch 2
> t1+19us: GPU finished; IRQ handler samples queue (empty)
> t1+20us: batch 3: 6us from user to h/w (idle->busy)
> etc
> 
> This hits the worst case, where *every* batch submission needs to go
> through the most expensive path (and in doing so, delays the
> creation of the next workload, so we will never get out of this
> pattern). Our measurement will therefore show 10us/batch.
> 
> *IF* we didn't have a BKL, it would be reasonable to expect that a
> suitable multi-threaded program on a CPU with more h/w threads than
> GPU engines could submit batches on any set of engines in parallel,
> and for each thread and engine, the execution time would be
> essentially independent of which engines were running concurrently.
> 
> Unfortunately, though, that lock-free scenario is not what we have
> today. The BKL means that only one thread can submit at a time (and
> in any case, the test program isn't multi-threaded). Therefore, if
> the test can generate and submit batches at a rate of one every 2us
> (as in the first "GOOD" scenario above), but those batches are being
> split across two different engines, it results in an effective
> submission rate of one per 4us, and flips into the second "BAD"
> scenario as a result.
> 
> The conclusion, then, is that the parallel execution part of this
> test as written today isn't really measuring a meaningful quantity,
> and the pass-fail criterion in particular isn't telling us anything
> useful about the overhead (or latency) of various parts of the
> submission path.
> 
> I've written another test variant, which explores the NO-OP
> execution time as a function of both batch buffer size and the
> number of consecutive submissions to the same engine before
> switching to the next (burst size). Typical results look something
> like this:

They already exist as well.

Do please look again at what test you are complaining about.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre