[Intel-gfx] [PATCH 03/12] drm/i915: Framework for capturing command stream based OA reports and ctx id info.

Wed Aug 2 04:25:57 UTC 2017

From: sourab gupta [mailto:sourabgupta at gmail.com]
Sent: Wednesday, August 2, 2017 8:17 AM
To: Landwerlin, Lionel G <lionel.g.landwerlin at intel.com>
Cc: Kamble, Sagar A <sagar.a.kamble at intel.com>; intel-gfx at lists.freedesktop.org; Sourab Gupta <sourab.gupta at intel.com>
Subject: Re: [Intel-gfx] [PATCH 03/12] drm/i915: Framework for capturing command stream based OA reports and ctx id info.

On Wed, Aug 2, 2017 at 2:28 AM, Lionel Landwerlin <lionel.g.landwerlin at intel.com<mailto:lionel.g.landwerlin at intel.com>> wrote:
On 01/08/17 19:05, sourab gupta wrote:

On Tue, Aug 1, 2017 at 2:59 PM, Kamble, Sagar A <sagar.a.kamble at intel.com<mailto:sagar.a.kamble at intel.com>> wrote:

-----Original Message-----
From: Landwerlin, Lionel G
Sent: Monday, July 31, 2017 9:16 PM
To: Kamble, Sagar A <sagar.a.kamble at intel.com<mailto:sagar.a.kamble at intel.com>>; intel-gfx at lists.freedesktop.org<mailto:intel-gfx at lists.freedesktop.org>
Cc: Sourab Gupta <sourab.gupta at intel.com<mailto:sourab.gupta at intel.com>>
Subject: Re: [Intel-gfx] [PATCH 03/12] drm/i915: Framework for capturing command stream based OA reports and ctx id info.

On 31/07/17 08:59, Sagar Arun Kamble wrote:
> From: Sourab Gupta <sourab.gupta at intel.com<mailto:sourab.gupta at intel.com>>
>
> This patch introduces a framework to capture OA counter reports associated
> with Render command stream. We can then associate the reports captured
> through this mechanism with their corresponding context id's. This can be
> further extended to associate any other metadata information with the
> corresponding samples (since the association with Render command stream
> gives us the ability to capture these information while inserting the
> corresponding capture commands into the command stream).
>
> The OA reports generated in this way are associated with a corresponding
> workload, and thus can be used the delimit the workload (i.e. sample the
> counters at the workload boundaries), within an ongoing stream of periodic
> counter snapshots.
>
> There may be usecases wherein we need more than periodic OA capture mode
> which is supported currently. This mode is primarily used for two usecases:
>      - Ability to capture system wide metrics, alongwith the ability to map
>        the reports back to individual contexts (particularly for HSW).
>      - Ability to inject tags for work, into the reports. This provides
>        visibility into the multiple stages of work within single context.
>
> The userspace will be able to distinguish between the periodic and CS based
> OA reports by the virtue of source_info sample field.
>
> The command MI_REPORT_PERF_COUNT can be used to capture snapshots of OA
> counters, and is inserted at BB boundaries.
> The data thus captured will be stored in a separate buffer, which will
> be different from the buffer used otherwise for periodic OA capture mode.
> The metadata information pertaining to snapshot is maintained in a list,
> which also has offsets into the gem buffer object per captured snapshot.
> In order to track whether the gpu has completed processing the node,
> a field pertaining to corresponding gem request is added, which is tracked
> for completion of the command.
>
> Both periodic and CS based reports are associated with a single stream
> (corresponding to render engine), and it is expected to have the samples
> in the sequential order according to their timestamps. Now, since these
> reports are collected in separate buffers, these are merge sorted at the
> time of forwarding to userspace during the read call.
>
> v2: Aligning with the non-perf interface (custom drm ioctl based). Also,
> few related patches are squashed together for better readability
>
> v3: Updated perf sample capture emit hook name. Reserving space upfront
> in the ring for emitting sample capture commands and using
> req->fence.seqno for tracking samples. Added SRCU protection for streams.
> Changed the stream last_request tracking to resv object. (Chris)
> Updated perf.sample_lock spin_lock usage to avoid softlockups. Moved
> stream to global per-engine structure. (Sagar)
> Update unpin and put in the free routines to i915_vma_unpin_and_release.
> Making use of perf stream cs_buffer vma resv instead of separate resv obj.
> Pruned perf stream vma resv during gem_idle. (Chris)
> Changed payload field ctx_id to u64 to keep all sample data aligned at 8
> bytes. (Lionel)
> stall/flush prior to sample capture is not added. Do we need to give this
> control to user to select whether to stall/flush at each sample?
>
> Signed-off-by: Sourab Gupta <sourab.gupta at intel.com<mailto:sourab.gupta at intel.com>>
> Signed-off-by: Robert Bragg <robert at sixbynine.org<mailto:robert at sixbynine.org>>
> Signed-off-by: Sagar Arun Kamble <sagar.a.kamble at intel.com<mailto:sagar.a.kamble at intel.com>>
> ---
>   drivers/gpu/drm/i915/i915_drv.h            |  101 ++-
>   drivers/gpu/drm/i915/i915_gem.c            |    1 +
>   drivers/gpu/drm/i915/i915_gem_execbuffer.c |    8 +
>   drivers/gpu/drm/i915/i915_perf.c           | 1185 ++++++++++++++++++++++------
>   drivers/gpu/drm/i915/intel_engine_cs.c     |    4 +
>   drivers/gpu/drm/i915/intel_ringbuffer.c    |    2 +
>   drivers/gpu/drm/i915/intel_ringbuffer.h    |    5 +
>   include/uapi/drm/i915_drm.h                |   15 +
>   8 files changed, 1073 insertions(+), 248 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 2c7456f..8b1cecf 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1985,6 +1985,24 @@ struct i915_perf_stream_ops {
>        * The stream will always be disabled before this is called.
>        */
>       void (*destroy)(struct i915_perf_stream *stream);
> +
> +     /*
> +      * @emit_sample_capture: Emit the commands in the command streamer
> +      * for a particular gpu engine.
> +      *
> +      * The commands are inserted to capture the perf sample data at
> +      * specific points during workload execution, such as before and after
> +      * the batch buffer.
> +      */
> +     void (*emit_sample_capture)(struct i915_perf_stream *stream,
> +                                 struct drm_i915_gem_request *request,
> +                                 bool preallocate);
> +};
> +

It seems the motivation for this following enum is mostly to deal with
the fact that engine->perf_srcu is set before the OA unit is configured.
Would it possible to set it later so that we get rid of the enum?
<Sagar> I will try to make this as just binary state. This enum is defining the state of the stream. I too got confused with purpose of IN_PROGRESS.
SRCU is used for synchronizing stream state check.
IN_PROGRESS will enable us to not advertently try to access the stream vma for inserting the samples, but I guess depending on disabled/enabled should
suffice.

Hi Sagar/Lionel,

Hi Sourab,

Thanks again for your input on this.

The purpose of the tristate was to workaround a particular kludge of
working with just enabled/disabled boolean state. I'll explain below.

Let's say we have only boolean state.
i915_perf_emit_sample_capture() function would depend on
stream->enabled in order to insert the MI_RPC command in RCS.
If you see i915_perf_enable_locked(), stream->enabled is set before
stream->ops->enable(). The stream->ops->enable() function actually
enables the OA hardware to capture reports, and if MI_RPC commands
are submitted before OA hw is enabled, it may hang the gpu.

Do you remember if this is documented anywhere?
I couldn't find anything in the MI_RPC instruction.
Sorry, I don't happen to remember any documentation. Probably, you can check this out by submitting MI_RPC without enabling OA.
Thing is gen7_oa_enable is depending on enabled to be true.
Lionel, can I change it to :

        if (!stream->enabled) {
                I915_WRITE(GEN7_OACONTROL,
                           (ctx_id & GEN7_OACONTROL_CTX_MASK) |
                           (period_exponent <<

With stream->enabled set to true at the end of i915_perf_enable_locked.

Also, we can't change the order of calling these operations inside
i915_perf_enable_locked() since gen7_update_oacontrol_locked()
function depends on stream->enabled flag to enable the OA
hw unit (i.e. it needs the flag to be true).

We can probably work around that by passing some arguments.

To workaround this problem, I introduced a tristate here.
If you can suggest some alternate solution to this problem,
we can remove this tristate kludge here.
Regards,
Sourab

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20170802/53db9180/attachment-0001.html>