[Intel-gfx] [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU

Thu Jun 25 06:02:41 PDT 2015

On Thu, Jun 25, 2015 at 7:02 AM, Gupta, Sourab <sourab.gupta at intel.com> wrote:
> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
>> On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
>> > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta at intel.com wrote:
>> > > From: Sourab Gupta <sourab.gupta at intel.com>
>> > >
>> > > To collect timestamps around any GPU workload, we need to insert
>> > > commands to capture them into the ringbuffer. Therefore, during the stop event
>> > > call, we need to wait for GPU to complete processing the last request for
>> > > which these commands were inserted.
>> > > We need to ensure this processing is done before event_destroy callback which
>> > > deallocates the buffer for holding the data.
>> >
>> > There's no reason for this to be synchronous. Just that you need an
>> > active reference on the output buffer to be only released after the
>> > final request.
>>
>> Yeah I think the interaction between OA sampling and GEM is the critical
>> piece here for both patch series. Step one is to have a per-pmu lock to
>> keep track of data private to OA and mmio based sampling as a starting
>> point. Then we need to figure out how to structure the connection without
>> OA/PMU and gem trampling over each another's feet.

>From the initial i915_oa code I think the notable locks are:

the struct perf_event_context::lock spin lock taken by events/core.c
before calling any of our pmu methods (with the exception of
event_init and event->destroy()) so we at least don't need to worry
about pmu methods trampling each other.

I think we should only take struct_mutex to meet the existing
assumptions of gem code we're using. Since most pmu methods are in
interrupt context our calling into gem is generally constrained to
event_init or event->destroy() without having to use workers.

we have a pmu spin lock to protect OA register state, notably because
besides the pmu methods we also have hooks into gem context switching
or pinning/unpinning that may trigger updates to the OACONTROL state
vs e.g. disabling OACONTROL in pmu->stop() which may concurrently on
different cpus.

we have an oa_buffer.flush_lock spin lock since in addition to a
hrtimer periodically forwarding samples from OABUFFER to perf,
userspace may also explicitly request a flush via fsync() or we might
flush at pmu->stop(). As a further detail here, if the hrtimer
contends with userspace flushing the hrtimer won't ever actually wait
on flush_lock, assuming it's not necessary to then have a back-to-back
flush while the buffer will probably be empty.

I have to admit I haven't really scrutinized the locking details of
Sourab's work yet, but it may make sense to introduce some additional
exclusion scheme for managing access to to the fifo of pending RPC
commands. I need to double check with Sourab, but I think we'd already
discussed some changes to how forwarding of these RPC based reports
will be managed where I thought we might be able to avoid the need for
a worker and struct_mutex locking while forwarding. I think Sourab may
have just preferred to send something out before this refactor to try
and solicit broader, high level feedback on his work sooner. I imagine
if we do refactor to avoid the worker for forwarding though that will
affect the locking details regarding the fifo structure.

Regards,
- Robert

>>
>> Wrt requests those are refcounted already and just waiting on them doesn't
>> need dev->struct_mutex. That should be all you need, assuming you do
>> correctly refcount them ...
>> -Daniel
>
> Hi Daniel/Chris,
>
> Thanks for your feedback. I acknowledge the issue with increasing
> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
> of data private to OA in my next version of patches.
>
> W.r.t. having the synchronous wait during event stop, I thought through
> the method of having active reference on output buffer. This will let us
> have a delayed buffer destroy (controlled by decrement of output buffer
> refcount as and when gpu requests complete). This will be decoupled from
> the event stop here. But that is not the only thing at play here.
>
> The event stop is expected to disable the OA unit (which it is doing
> right now). In my usecase, I can't disable OA unit, till the time GPU
> processes the MI_RPC requests scheduled already, which otherwise may
> lead to hangs.
> So, I'm waiting synchronously for requests to complete before disabling
> OA unit.
>
> Also, the event_destroy callback should be destroying the buffers
> associated with event. (Right now, in current design, event_destroy
> callback waits for the above operations to be completed before
> proceeding to destroy the buffer).
>
> If I try to create a function which destroys the output buffer according
> to all references being released, all these operations will have to be
> performed together in that function, in this serial order (so that when
> refcount drops to 0, i.e. requests have completed, OA unit will be
> disabled, and then the buffer destroyed). This would lead to these
> operations being decoupled from the event_stop and event_destroy
> functions. This may be non-intentional behaviour w.r.t. these callbacks
> from userspace perspective.
>
> Robert, any thoughts?
>

I think a few pertinent details here that inform how we need to handle this are:

1) Almost all the pmu methods are called in atomic context (except
event_init) as they are invoked via events/core.c via an
inter-processor interrupt so waiting for a completion in event_destroy
isn't really an option for us.