[Intel-gfx] [RFC 5/7] drm/i915: Wait for GPU to finish before event stop in Gen Perf PMU

Thu Jun 25 10:31:52 PDT 2015

On Thu, Jun 25, 2015 at 9:02 AM, Chris Wilson <chris at chris-wilson.co.uk> wrote:
> On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote:
>> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote:
>> > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote:
>> > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta at intel.com wrote:
>> > > > From: Sourab Gupta <sourab.gupta at intel.com>
>> > > >
>> > > > To collect timestamps around any GPU workload, we need to insert
>> > > > commands to capture them into the ringbuffer. Therefore, during the stop event
>> > > > call, we need to wait for GPU to complete processing the last request for
>> > > > which these commands were inserted.
>> > > > We need to ensure this processing is done before event_destroy callback which
>> > > > deallocates the buffer for holding the data.
>> > >
>> > > There's no reason for this to be synchronous. Just that you need an
>> > > active reference on the output buffer to be only released after the
>> > > final request.
>> >
>> > Yeah I think the interaction between OA sampling and GEM is the critical
>> > piece here for both patch series. Step one is to have a per-pmu lock to
>> > keep track of data private to OA and mmio based sampling as a starting
>> > point. Then we need to figure out how to structure the connection without
>> > OA/PMU and gem trampling over each another's feet.
>> >
>> > Wrt requests those are refcounted already and just waiting on them doesn't
>> > need dev->struct_mutex. That should be all you need, assuming you do
>> > correctly refcount them ...
>> > -Daniel
>>
>> Hi Daniel/Chris,
>>
>> Thanks for your feedback. I acknowledge the issue with increasing
>> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track
>> of data private to OA in my next version of patches.
>>
>> W.r.t. having the synchronous wait during event stop, I thought through
>> the method of having active reference on output buffer. This will let us
>> have a delayed buffer destroy (controlled by decrement of output buffer
>> refcount as and when gpu requests complete). This will be decoupled from
>> the event stop here. But that is not the only thing at play here.
>>
>> The event stop is expected to disable the OA unit (which it is doing
>> right now). In my usecase, I can't disable OA unit, till the time GPU
>> processes the MI_RPC requests scheduled already, which otherwise may
>> lead to hangs.
>> So, I'm waiting synchronously for requests to complete before disabling
>> OA unit.
>
> There's no issue here. Just hook into the retire-notification callback
> for the last active request of the oa buffer. If you are using LRI to
> disable the OA Context writing to the buffer, you can simply create a
> request for the LRI and use the active reference as the retire
> notification. (Otoh, if the oa buffer is inactive you can simply do the
> decoupling immediately.) You shouldn't need a object-free-notification
> callback aiui.
>
>> Also, the event_destroy callback should be destroying the buffers
>> associated with event. (Right now, in current design, event_destroy
>> callback waits for the above operations to be completed before
>> proceeding to destroy the buffer).
>>
>> If I try to create a function which destroys the output buffer according
>> to all references being released, all these operations will have to be
>> performed together in that function, in this serial order (so that when
>> refcount drops to 0, i.e. requests have completed, OA unit will be
>> disabled, and then the buffer destroyed). This would lead to these
>> operations being decoupled from the event_stop and event_destroy
>> functions. This may be non-intentional behaviour w.r.t. these callbacks
>> from userspace perspective.
>
> When the perf event is stopped, you stop generating perf events. But the
> buffer may still be alive, and may be resurrected if you a new event is
> started, but you will want to be sure to ensure that pending oa writes
> are ignored.
> -Chris

Just to summarise some things I know Sourab discussed with Chris on
IRC and I then also discussed with Sourab offline.

It does look like we can avoid any waiting in event->destroy(),
decoupling tracking of outstanding RPC commands+oacontrol disable and
the event active status. There's no strict reason OACONTROL needs to
be disabled when stopping or destroying an event if there are still
pending RPC commands, we just need to stop forwarding samples while
disabled and discard reports that land after an event is destroyed. We
can also update the hw.state and active state before OACONTROL is
disabled, whereas currently the code is updating this state outside of
core perf's context lock which isn't good and we might also end up
continuing to forward periodic samples from the oabuffer to perf when
the user expects the event to be disabled.

There was a discussion of updating oacontrol via LRIs, but my concern
here is with wanting to update oacontrol state for periodic sampling
use cases where it would seem awkward to handle that via LRIs.
Enabling/disabling periodic sampling can be considered orthogonal from
fully disabling oacontrol and it will be useful to use periodic
sampling in conjunction with RPC sampling. When we stop an event we
should be able to immediately stop periodic sampling even if we must
leave OA enabled for a pending RPC command. If the event is started
again we can quickly re-enable periodic sampling but it might be
awkward to consider an in-flight LRI we know is going to disable
oacontrol soon. Related to using LRIs though was the idea of using the
LRI request to help manage the lifetime of the destination buffer, by
adding the vmas of the dest buffer to the request's active list. It
seems like we should be able to do this kind of thing with the
requests associated with the MI_RPC commands instead.

Thinking about the details of waiting for the last RPC command before
destroying the dest buffer and disabling OACONTROL these are the
requirements I see:
- we want free the dest buffer in a finite time, since it's large
(i.e. don't want to assume it's ok to keep around if allocated once)
- must wait for last RPC commands to complete before disabling
oacontrol (and can free the dest buffer at the same point)
- in any subsequent event_init() we need to be able to verify we don't
/still/ have pending RPC commands, because an incompatible oacontrol
requirement at this point is a corner case where we really do have to
block and wait for those RPC commands to complete or say return
-EBUSY.

Without the retire-notification api that Chris has been experimenting
with (that could provide us a convenient callback when our RPC
commands retire), the current plan is to still schedule some work in
event->destroy() to wait for the last outstanding RPC command to
retire before disabling oacontrol as well as free the RPC destination
buffer, but notably there's no need to block on its completion. I
imagine this could later easily be adapted to work with a
retire-notification api, and maybe Sourab could even experiment with
Chris's wip code for this to compare.

In the end, I think the only time we'll really need to block waiting
for outstanding requests is in event_init() in cases where there are
still pending RPC commands and the previous oacontrol report-format is
smaller than the newly requested format (which should basically be
never I guess if in practice, most users will be requesting the
largest report format... and technically I suppose there are even lots
of cases where you could safely allow the report size to increase if
you know there's adequate space in the old dest buffer; though
probably not worth fussing about.).

Besides avoiding blocking in event->destroy(); I'd been thinking we
wouldn't need the worker for forwarding the RPC reports to perf and
could instead just use a hrtimer like we do for periodic samples.
Talking this through with Sourab though, he tried this, and it looks
like it would be awkward due to needing to drop the references on
request structs which may in-turn need to take struct_mutex to finally
free.

In terms of locking while forwarding samples and in the insert_cmd
hooks, for the fifo structures tracking pending commands we should
stop using struct_mutex (except if necessary for calling into gem) and
we should also avoid locking around all of the forwarding work with
any mutex that (if anything) just needs to protect updates to the fifo
itself. I guess there's some reasonable lock free way to handle adding
and removing from these fifos, but haven't considered that carefully
and don't have a strong opinion on particular details here.

Ok I think that pretty much summarises what was discussed offline,

Regards,
- Robert

>
> --
> Chris Wilson, Intel Open Source Technology Centre