[Intel-gfx] [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU

Thu Jun 25 04:47:17 PDT 2015

On Thu, Jun 25, 2015 at 9:27 AM, Gupta, Sourab <sourab.gupta at intel.com> wrote:
> On Thu, 2015-06-25 at 07:42 +0000, Daniel Vetter wrote:
>> On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
>> > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
>> > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
>> > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta at intel.com wrote:
>> > > > > From: Sourab Gupta <sourab.gupta at intel.com>
>> > > > >
>> > > > > To collect timestamps around any GPU workload, we need to insert
>> > > > > commands to capture them into the ringbuffer. Therefore, during the stop event
>> > > > > call, we need to wait for GPU to complete processing the last request for
>> > > > > which these commands were inserted.
>> > > > > We need to ensure this processing is done before event_destroy callback which
>> > > > > deallocates the buffer for holding the data.
>> > > >
>> > > > There's no reason for this to be synchronous. Just that you need an
>> > > > active reference on the output buffer to be only released after the
>> > > > final request.
>> > >
>> > > Yeah I think the interaction between OA sampling and GEM is the critical
>> > > piece here for both patch series. Step one is to have a per-pmu lock to
>> > > keep track of data private to OA and mmio based sampling as a starting
>> > > point. Then we need to figure out how to structure the connection without
>> > > OA/PMU and gem trampling over each another's feet.
>> > >
>> > > Wrt requests those are refcounted already and just waiting on them doesn't
>> > > need dev->struct_mutex. That should be all you need, assuming you do
>> > > correctly refcount them ...
>> > > -Daniel
>> >
>> > Hi Daniel/Chris,
>> >
>> > Thanks for your feedback. I acknowledge the issue with increasing
>> > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
>> > of data private to OA in my next version of patches.
>>
>> If you create new locking schemes please coordinate with Rob about them.
>> Better if we come up with a consistent locking scheme for all of OA. That
>> ofc doesn't exclude that we'll have per-pmu locking, just that the locking
>> design should match if it makes sense. The example I'm thinking of is
>> drrs&psr which both use the frontbuffer tracking code and both have their
>> private mutex. But the overall approach to locking and cleaning up the
>> async work is the same.
>>
> Ok, I'll coordinate with Robert about a consistent locking scheme here.
>
>> > W.r.t. having the synchronous wait during event stop, I thought through
>> > the method of having active reference on output buffer. This will let us
>> > have a delayed buffer destroy (controlled by decrement of output buffer
>> > refcount as and when gpu requests complete). This will be decoupled from
>> > the event stop here. But that is not the only thing at play here.
>> >
>> > The event stop is expected to disable the OA unit (which it is doing
>> > right now). In my usecase, I can't disable OA unit, till the time GPU
>> > processes the MI_RPC requests scheduled already, which otherwise may
>> > lead to hangs.
>> > So, I'm waiting synchronously for requests to complete before disabling
>> > OA unit.
>> >
>> > Also, the event_destroy callback should be destroying the buffers
>> > associated with event. (Right now, in current design, event_destroy
>> > callback waits for the above operations to be completed before
>> > proceeding to destroy the buffer).
>> >
>> > If I try to create a function which destroys the output buffer according
>> > to all references being released, all these operations will have to be
>> > performed together in that function, in this serial order (so that when
>> > refcount drops to 0, i.e. requests have completed, OA unit will be
>> > disabled, and then the buffer destroyed). This would lead to these
>> > operations being decoupled from the event_stop and event_destroy
>> > functions. This may be non-intentional behaviour w.r.t. these callbacks
>> > from userspace perspective.
>> >
>> > Robert, any thoughts?
>>
>> Yeah now idea whether those perf callbacks are allowed to stall for a
>> longer time. If not we could simply push the cleanup/OA disabling into an
>> async work.
>> -Daniel
>
> Hi Daniel,
> Actually right now, the stop_event callback is not stalled. Instead I'm
> scheduling an async work fn to do the job. The event destroy is then
> waiting for the work fn to finish processing, before proceeding on to
> destroy the buffer.
> I'm also thinking through Chris' suggestions here and will try to come
> up with solution based on them.

A notable detail here is that most pmu methods are called in atomic
context by events/core.c via inter-processor interrupts (as a somewhat
unintended side effect of userspace opening i915_oa events as
specific-cpu events perf will make sure all methods are invoked on
that specific cpu). This means waiting in pmu->event_stop() isn't even
an option for us, though as noted it's only the destroy currently that
waits for the completion of the stop work fn. event_init() and
event->destroy() are two exceptions that are called in process
context. That said though, it does seem worth considering if we can
avoid waiting in the event_destroy() if not strictly necessary, and
chris's suggestion to follow a refcounting scheme sounds workable.

Regards,
- Robert

>
> Thanks,
> Sourab
>