[Intel-gfx] igt/gem_exec_nop parallel test: why it isn't useful

Fri Sep 2 11:02:46 UTC 2016

On 01/09/16 21:00, Chris Wilson wrote:
> On Thu, Sep 01, 2016 at 05:51:09PM +0100, Dave Gordon wrote:
>> The gem_exec_nop test generally works by submitting batches to an
>> engine as fast as possible for a fixed time, then finally calling
>> gem_sync() to wait for the last submitted batch to complete. The
>> time-per-batch is then calculated as the total elapsed time, divided
>> by the total number of batches submitted.
>>
>> The problem with this approach as a measurement of driver overhead,
>> or latency (or anything else) is that the amount of work involved in
>> submitting a batch is not a simple constant; in particular, it
>> depends on the state of the various queues in the execution path.
>> And it has the rather strange characteristic that if the GPU runs
>> slightly faster, the driver may go much slower!
>>
>> The main reason here is the lite-restore mechanism, although it
>> interacts with dual-submission and the details of handling the
>> completion interrupt. In particular, lite-restore means that it can
>> be much cheaper to add a request to an engine that's already (or
>> still) busy with a previous request than to send a new request to an
>> idle engine.
>>
>> For example, imagine that it takes the (test/CPU/driver) 2us to
>> prepare a request up to the point of submission, but another 4us to
>> push it into the submission port. Also assume that once started,
>> this batch takes 3us to execute on the GPU, and handling the
>> completion takes the driver another 2us of CPU time. Then the stream
>> of requests will produce a pattern like this:
>>
>> t0:      batch 1: 6us from user to h/w (idle->busy)
>> t0+6us:  GPU now running batch 1
>> t0+8us:  batch 2: 2us from user to queue (not submitted)
>> t0+9us:  GPU finished; IRQ handler samples queue (batch 2)
>> t0+10us: batch 3: 2us from user to queue (not submitted)
>> t0+11us: IRQ handler submits tail of batch 2
>> t0+12us: batch 4: 2us from user to queue (not submitted)
>> t0+14us: batch 5: 2us from user to queue (not submitted)
>> t0+15us: GPU now running batch 2
>> t0+16us: batch 6: 2us from user to queue (not submitted)
>> t0+18us: GPU finished; IRQ handler samples queue (batch 6)
>> t0+18us: batch 7: 2us from user to queue (not submitted)
>> t0+20us: batch 8: 2us from user to queue (not submitted)
>> t0+20us: IRQ handler coalesces requests, submits tail of batch 6
>> t0+20us: batch 9: 2us from user to queue (not submitted)
>> t0+22us: batch 10: 2us from user to queue (not submitted)
>> t0+24us: GPU now running batches 3-6
>> t0+24us: batch 11: 2us from user to queue (not submitted)
>> t0+26us: batch 12: 2us from user to queue (not submitted)
>> t0+28us: batch 13: 2us from user to queue (not submitted)
>> t0+30us: batch 14: 2us from user to queue (not submitted)
>> t0+32us: batch 15: 2us from user to queue (not submitted)
>> t0+34us: batch 16: 2us from user to queue (not submitted)
>> t0+36us: GPU finished; IRQ handler samples queue (batch 16)
>> t0+36us: batch 17: 2us from user to queue (not submitted)
>> t0+38us: batch 18: 2us from user to queue (not submitted)
>> t0+38us: IRQ handler coalesces requests, submits tail of batch 16
>> t0+40us: batch 19: 2us from user to queue (not submitted)
>> t0+42us: batch 20: 2us from user to queue (not submitted)
>> t0+42us: GPU now running batches 7-16
>>
>> Thus, after the first few, *all* requests will be coalesced, and
>> only a few of them will incur the overhead of writing to the ELSP or
>> handling a context-complete interrupt. With the CPU generating a new
>> batch every 2us and the GPU taking 3us/batch to execute them, the
>> queue of outstanding requests will get longer and longer until the
>> ringbuffer is nearly full, but the write to the ELSP will happen
>> ever more rarely.
>>
>> When we measure the overall time for the process, we will find the
>> result is 3us/batch, i.e. the GPU batch execution time. The
>> coalescing means that all the driver *and hardware* overheads are
>> *completely* hidden.
>>
>> Now consider what happens if the batches are generated and submitted
>> slightly slower, only one every 4us:
>>
>> t1:      batch 1: 6us from user to h/w (idle->busy)
>> t1+6us:  GPU now running batch 1
>> t1+9us:  GPU finished; IRQ handler samples queue (empty)
>> t1+10us: batch 2: 6us from user to h/w (idle->busy)
>> t1+16us: GPU now running batch 2
>> t1+19us: GPU finished; IRQ handler samples queue (empty)
>> t1+20us: batch 3: 6us from user to h/w (idle->busy)
>> etc
>>
>> This hits the worst case, where *every* batch submission needs to go
>> through the most expensive path (and in doing so, delays the
>> creation of the next workload, so we will never get out of this
>> pattern). Our measurement will therefore show 10us/batch.
>>
>> *IF* we didn't have a BKL, it would be reasonable to expect that a
>> suitable multi-threaded program on a CPU with more h/w threads than
>> GPU engines could submit batches on any set of engines in parallel,
>> and for each thread and engine, the execution time would be
>> essentially independent of which engines were running concurrently.
>>
>> Unfortunately, though, that lock-free scenario is not what we have
>> today. The BKL means that only one thread can submit at a time (and
>> in any case, the test program isn't multi-threaded). Therefore, if
>> the test can generate and submit batches at a rate of one every 2us
>> (as in the first "GOOD" scenario above), but those batches are being
>> split across two different engines, it results in an effective
>> submission rate of one per 4us, and flips into the second "BAD"
>> scenario as a result.
>>
>> The conclusion, then, is that the parallel execution part of this
>> test as written today isn't really measuring a meaningful quantity,
>> and the pass-fail criterion in particular isn't telling us anything
>> useful about the overhead (or latency) of various parts of the
>> submission path.
>>
>> I've written another test variant, which explores the NO-OP
>> execution time as a function of both batch buffer size and the
>> number of consecutive submissions to the same engine before
>> switching to the next (burst size). Typical results look something
>> like this:
>
> They already exist as well.

I expect so, but they are not being used to gate upstreaming of patches 
to the submission paths. I wanted a test that would show how the 
positive feedback loop in submission timing causes the driver to 
abruptly flip between a best-case pattern (when workloads are generated 
faster than they are completed) and a worst-case pattern (when it takes 
longer to submit one batch to *each* engine sequentially than it takes 
*one* engine to complete one batch).

> Do please look again at what test you are complaining about.
> -Chris

The one that contains this unjustified assertion:

/* The rate limiting step is how fast the slowest engine can
  * its queue of requests, if we wait upon a full ring all dispatch
  * is frozen. So in general we cannot go faster than the slowest
  * engine, but we should equally not go any slower.
  */
igt_assert_f(time < max + 10*min/9, /* ensure parallel execution */
    "Average time (%.3fus) exceeds expecation for parallel execution 
(min %.3fus, max %.3fus; limit set at %.3fus)\n",
    1e6*time, 1e6*min, 1e6*max, 1e6*(max + 10*min/9));

because as explained above, there is no reasonable expectation that 
dispatching batches to multiple engines in parallel will result in more 
batches being executed in the same time, and with a purely serial test 
process, every expectation that the average time per batch will increase.

The rate-limiting step *would* be how fast the slowest engine could 
process if its queue of requests *if* the batches took long enough that 
the CPU could always queue more work for every engine before the 
previous workload completed; but with tiny workloads the CPU does not 
keep up and submission overhead increases because the driver must then 
do *more* work to *restart* the engine if it has become idle.

(And not even mentioning how the engine may have decided, after a 
certain period of idleness, to initiate a context save which must be 
completed before a new command can be accepted, even if the new command 
uses the same context. At least it doesn't then reload the same context, 
AFAICT).

.Dave.