[PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit

Wed May 4 12:24:31 UTC 2016

On Wed, May 04, 2016 at 10:49:53AM +0100, Robert Bragg wrote:
> On Wed, May 4, 2016 at 10:09 AM, Martin Peres <martin.peres at linux.intel.com>
> wrote:
> 
> > On 03/05/16 23:03, Robert Bragg wrote:
> >
> >>
> >>
> >> On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert at sixbynine.org
> >> <mailto:robert at sixbynine.org>> wrote:
> >>
> >>     Sorry for the delay replying to this, I missed it.
> >>
> >>     On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <martin.peres at free.fr
> >>     <mailto:martin.peres at free.fr>> wrote:
> >>
> >>         On 20/04/16 17:23, Robert Bragg wrote:
> >>
> >>             Gen graphics hardware can be set up to periodically write
> >>             snapshots of
> >>             performance counters into a circular buffer via its
> >> Observation
> >>             Architecture and this patch exposes that capability to
> >>             userspace via the
> >>             i915 perf interface.
> >>
> >>             Cc: Chris Wilson <chris at chris-wilson.co.uk
> >>             <mailto:chris at chris-wilson.co.uk>>
> >>             Signed-off-by: Robert Bragg <robert at sixbynine.org
> >>             <mailto:robert at sixbynine.org>>
> >>             Signed-off-by: Zhenyu Wang <zhenyuw at linux.intel.com
> >>             <mailto:zhenyuw at linux.intel.com>>
> >>
> >>             ---
> >>               drivers/gpu/drm/i915/i915_drv.h         |  56 +-
> >>               drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
> >>               drivers/gpu/drm/i915/i915_perf.c        | 940
> >>             +++++++++++++++++++++++++++++++-
> >>               drivers/gpu/drm/i915/i915_reg.h         | 338 ++++++++++++
> >>               include/uapi/drm/i915_drm.h             |  70 ++-
> >>               5 files changed, 1408 insertions(+), 20 deletions(-)
> >>
> >>             +
> >>             +
> >>             +       /* It takes a fairly long time for a new MUX
> >>             configuration to
> >>             +        * be be applied after these register writes. This
> >> delay
> >>             +        * duration was derived empirically based on the
> >>             render_basic
> >>             +        * config but hopefully it covers the maximum
> >>             configuration
> >>             +        * latency...
> >>             +        */
> >>             +       mdelay(100);
> >>
> >>
> >>         With such a HW and SW design, how can we ever expose hope to get
> >> any
> >>         kind of performance when we are trying to monitor different
> >>         metrics on each
> >>         draw call? This may be acceptable for system monitoring, but it
> >>         is problematic
> >>         for the GL extensions :s
> >>
> >>
> >>         Since it seems like we are going for a perf API, it means that
> >>         for every change
> >>         of metrics, we need to flush the commands, wait for the GPU to
> >>         be done, then
> >>         program the new set of metrics via an IOCTL, wait 100 ms, and
> >>         then we may
> >>         resume rendering ... until the next change. We are talking about
> >>         a latency of
> >>         6-7 frames at 60 Hz here... this is non-negligeable...
> >>
> >>
> >>         I understand that we have a ton of counters and we may hide
> >>         latency by not
> >>         allowing using more than half of the counters for every draw
> >>         call or frame, but
> >>         even then, this 100ms delay is killing this approach altogether.
> >>
> >>
> >>
> >>
> >> So revisiting this to double check how things fail with my latest
> >> driver/tests without the delay, I apparently can't reproduce test
> >> failures without the delay any more...
> >>
> >> I think the explanation is that since first adding the delay to the
> >> driver I also made the the driver a bit more careful to not forward
> >> spurious reports that look invalid due to a zeroed report id field, and
> >> that mechanism keeps the unit tests happy, even though there are still
> >> some number of invalid reports generated if we don't wait.
> >>
> >> One problem with simply having no delay is that the driver prints an
> >> error if it sees an invalid reports so I get a lot of 'Skipping
> >> spurious, invalid OA report' dmesg spam. Also this was intended more as
> >> a last resort mechanism, and I wouldn't feel too happy about squashing
> >> the error message and potentially sweeping other error cases under the
> >> carpet.
> >>
> >> Experimenting to see if the delay can at least be reduced, I brought the
> >> delay up in millisecond increments and found that although I still see a
> >> lot of spurious reports only waiting 1 or 5 milliseconds, at 10
> >> milliseconds its reduced quite a bit and at 15 milliseconds I don't seem
> >> to have any errors.
> >>
> >> 15 milliseconds is still a long time, but at least not as long as 100.
> >>
> >
> > OK, so the issue does not come from the HW after all, great!
> >
> 
> Erm, I'm not sure that's a conclusion we can make here...
> 
> The upshot here was really just reducing the delay from 100ms to 15ms.
> Previously I arrived at a workable delay by jumping the delay in orders of
> magnitude with 10ms failing, 100ms passing and I didn't try and refine it
> further. Here I've looked at delays between 10 and 100ms.
> 
> The other thing is observing that because the kernel is checking for
> invalid reports (generated by the hardware) before forwarding to userspace
> the lack of a delay no longer triggers i-g-t failures because the invalid
> data won't reach i-g-t any more - though the invalid reports are still a
> thing to avoid.
> 
> 
> > Now, my main question is, why are spurious events generated when changing
> > the MUX's value? I can understand that we would need to ignore the reading
> > that came right after the change, but other than this,  I am a bit at a
> > loss.
> >
> 
> The MUX selects 16 signals that the OA unit can turn into 16 separate
> counters by basically counting the signal changes. (there's some fancy
> fixed function logic that can affect this but that's the general idea).
> 
> If the MUX is in the middle of being re-programmed then some subset of
> those 16 signals are for who knows what.
> 
> After programming the MUX we will go on to configure the OA unit and the
> tests will enable periodic sampling which (if we have no delay) will sample
> the OA counters that are currently being fed by undefined signals.
> 
> So as far as that goes it makes sense to me to expect bad data if we don't
> wait for the MUX config to land properly. Something I don't really know is
> how come we're seemingly lucky to have the reports be cleanly invalid with
> a zero report-id, instead of just having junk data that would be harder to
> recognise.

Yeah this mdelay story sounds realy scary. Few random comments:
- msleep instead of mdelay please
- no dmesg noise above debug level for stuff that we know can happen -
  dmesg noise counts as igt failures
- reading 0 sounds more like bad synchronization. Have you tried quiescing
  the entire gpu (to make sure nothing is happen) and disabling OA, then
  updating, and then restarting? At least on a very quick look I didn't
  spot that. Random delays freak me out a bit, but wouldn't be surprised
  if really needed.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch