[Intel-gfx] [PATCH 5/9] drm/i915: Enable i915 perf stream for Haswell OA unit

Wed May 4 13:24:14 UTC 2016

On Wed, May 4, 2016 at 1:24 PM, Daniel Vetter <daniel at ffwll.ch> wrote:

> On Wed, May 04, 2016 at 10:49:53AM +0100, Robert Bragg wrote:
> > On Wed, May 4, 2016 at 10:09 AM, Martin Peres <
> martin.peres at linux.intel.com>
> > wrote:
> >
> > > On 03/05/16 23:03, Robert Bragg wrote:
> > >
> > >>
> > >>
> > >> On Tue, May 3, 2016 at 8:34 PM, Robert Bragg <robert at sixbynine.org
> > >> <mailto:robert at sixbynine.org>> wrote:
> > >>
> > >>     Sorry for the delay replying to this, I missed it.
> > >>
> > >>     On Sat, Apr 23, 2016 at 11:34 AM, Martin Peres <
> martin.peres at free.fr
> > >>     <mailto:martin.peres at free.fr>> wrote:
> > >>
> > >>         On 20/04/16 17:23, Robert Bragg wrote:
> > >>
> > >>             Gen graphics hardware can be set up to periodically write
> > >>             snapshots of
> > >>             performance counters into a circular buffer via its
> > >> Observation
> > >>             Architecture and this patch exposes that capability to
> > >>             userspace via the
> > >>             i915 perf interface.
> > >>
> > >>             Cc: Chris Wilson <chris at chris-wilson.co.uk
> > >>             <mailto:chris at chris-wilson.co.uk>>
> > >>             Signed-off-by: Robert Bragg <robert at sixbynine.org
> > >>             <mailto:robert at sixbynine.org>>
> > >>             Signed-off-by: Zhenyu Wang <zhenyuw at linux.intel.com
> > >>             <mailto:zhenyuw at linux.intel.com>>
> > >>
> > >>             ---
> > >>               drivers/gpu/drm/i915/i915_drv.h         |  56 +-
> > >>               drivers/gpu/drm/i915/i915_gem_context.c |  24 +-
> > >>               drivers/gpu/drm/i915/i915_perf.c        | 940
> > >>             +++++++++++++++++++++++++++++++-
> > >>               drivers/gpu/drm/i915/i915_reg.h         | 338
> ++++++++++++
> > >>               include/uapi/drm/i915_drm.h             |  70 ++-
> > >>               5 files changed, 1408 insertions(+), 20 deletions(-)
> > >>
> > >>             +
> > >>             +
> > >>             +       /* It takes a fairly long time for a new MUX
> > >>             configuration to
> > >>             +        * be be applied after these register writes. This
> > >> delay
> > >>             +        * duration was derived empirically based on the
> > >>             render_basic
> > >>             +        * config but hopefully it covers the maximum
> > >>             configuration
> > >>             +        * latency...
> > >>             +        */
> > >>             +       mdelay(100);
> > >>
> > >>
> > >>         With such a HW and SW design, how can we ever expose hope to
> get
> > >> any
> > >>         kind of performance when we are trying to monitor different
> > >>         metrics on each
> > >>         draw call? This may be acceptable for system monitoring, but
> it
> > >>         is problematic
> > >>         for the GL extensions :s
> > >>
> > >>
> > >>         Since it seems like we are going for a perf API, it means that
> > >>         for every change
> > >>         of metrics, we need to flush the commands, wait for the GPU to
> > >>         be done, then
> > >>         program the new set of metrics via an IOCTL, wait 100 ms, and
> > >>         then we may
> > >>         resume rendering ... until the next change. We are talking
> about
> > >>         a latency of
> > >>         6-7 frames at 60 Hz here... this is non-negligeable...
> > >>
> > >>
> > >>         I understand that we have a ton of counters and we may hide
> > >>         latency by not
> > >>         allowing using more than half of the counters for every draw
> > >>         call or frame, but
> > >>         even then, this 100ms delay is killing this approach
> altogether.
> > >>
> > >>
> > >>
> > >>
> > >> So revisiting this to double check how things fail with my latest
> > >> driver/tests without the delay, I apparently can't reproduce test
> > >> failures without the delay any more...
> > >>
> > >> I think the explanation is that since first adding the delay to the
> > >> driver I also made the the driver a bit more careful to not forward
> > >> spurious reports that look invalid due to a zeroed report id field,
> and
> > >> that mechanism keeps the unit tests happy, even though there are still
> > >> some number of invalid reports generated if we don't wait.
> > >>
> > >> One problem with simply having no delay is that the driver prints an
> > >> error if it sees an invalid reports so I get a lot of 'Skipping
> > >> spurious, invalid OA report' dmesg spam. Also this was intended more
> as
> > >> a last resort mechanism, and I wouldn't feel too happy about squashing
> > >> the error message and potentially sweeping other error cases under the
> > >> carpet.
> > >>
> > >> Experimenting to see if the delay can at least be reduced, I brought
> the
> > >> delay up in millisecond increments and found that although I still
> see a
> > >> lot of spurious reports only waiting 1 or 5 milliseconds, at 10
> > >> milliseconds its reduced quite a bit and at 15 milliseconds I don't
> seem
> > >> to have any errors.
> > >>
> > >> 15 milliseconds is still a long time, but at least not as long as 100.
> > >>
> > >
> > > OK, so the issue does not come from the HW after all, great!
> > >
> >
> > Erm, I'm not sure that's a conclusion we can make here...
> >
> > The upshot here was really just reducing the delay from 100ms to 15ms.
> > Previously I arrived at a workable delay by jumping the delay in orders
> of
> > magnitude with 10ms failing, 100ms passing and I didn't try and refine it
> > further. Here I've looked at delays between 10 and 100ms.
> >
> > The other thing is observing that because the kernel is checking for
> > invalid reports (generated by the hardware) before forwarding to
> userspace
> > the lack of a delay no longer triggers i-g-t failures because the invalid
> > data won't reach i-g-t any more - though the invalid reports are still a
> > thing to avoid.
> >
> >
> > > Now, my main question is, why are spurious events generated when
> changing
> > > the MUX's value? I can understand that we would need to ignore the
> reading
> > > that came right after the change, but other than this,  I am a bit at a
> > > loss.
> > >
> >
> > The MUX selects 16 signals that the OA unit can turn into 16 separate
> > counters by basically counting the signal changes. (there's some fancy
> > fixed function logic that can affect this but that's the general idea).
> >
> > If the MUX is in the middle of being re-programmed then some subset of
> > those 16 signals are for who knows what.
> >
> > After programming the MUX we will go on to configure the OA unit and the
> > tests will enable periodic sampling which (if we have no delay) will
> sample
> > the OA counters that are currently being fed by undefined signals.
> >
> > So as far as that goes it makes sense to me to expect bad data if we
> don't
> > wait for the MUX config to land properly. Something I don't really know
> is
> > how come we're seemingly lucky to have the reports be cleanly invalid
> with
> > a zero report-id, instead of just having junk data that would be harder
> to
> > recognise.
>
> Yeah this mdelay story sounds realy scary. Few random comments:
> - msleep instead of mdelay please
>

yup this was a mistake I'd forgotten to fix in this series, but is fixed in
the last series I sent after chris noticed too.

actually in my latest (yesterday after experimenting further with the delay
requirements) I'm using usleep_range for a delay between 15 and 20
milliseconds which seems to be enough.

> - no dmesg noise above debug level for stuff that we know can happen -
>   dmesg noise counts as igt failures
>

okey. I don't think I have anything above debug level, unless things are
going badly wrong.

Just double checking though has made me think twice about a WARN_ON in
gen7_oa_read for !oa_buffer_addr, which would be a bad situation but should
either be removed (never expected), be a BUG_ON (since the code would deref
NULL anyway) or more gracefully return an error if seen.

I currently have some DRM_DRIVER_DEBUG errors for cases where userspace
messes up what properties it gives to open a stream - hopefully that sound
ok? I've found it quite helpful to have a readable error for otherwise
vague EINVAL type errors.

I have a debug message I print if we see an invalid HW report, which
*shouldn't* happen but can happen (e.g. if the MUX delay or tail margin
aren't well tuned) and it's helpful to have the feedback, in case we end up
in a situation where we see this kind of message too frequently which might
indicate an issue to investigate.

> - reading 0 sounds more like bad synchronization.

I suppose I haven't thoroughly considered if we should return zero in any
case  - normally that would imply EOF so we get to choose what that implies
here. I don't think the driver should ever return 0 from read() currently.

A few concious choices re: read() return values have been:

- never ever return partial records (or rather even if a partial record
were literally copied into the userspace buffer, but an error were hit in
the middle of copying a full sample then that record wouldn't be accounted
for in the byte count returned.)

- Don't throw away records successfully copied, due to a later error. This
simplifies error handling paths internally and reporting
EAGAIN/ENOSPC/EFAULT errors and means data isn't lost. The precedence for
what we return is 1) did we successfully copy some reports? report bytes
copied for complete records. 2) did we encounter an error? report that if
so. 3) return -EAGAIN. (though for a blocking fd this will be handled
internally).

> Have you tried quiescing

the entire gpu (to make sure nothing is happen) and disabling OA, then
>   updating, and then restarting? At least on a very quick look I didn't
>   spot that. Random delays freak me out a bit, but wouldn't be surprised
>   if really needed.
>

Experimenting yesterday, it seems I can at least reduce the delay to around
15ms (granted that's still pretty huge), and it's also workable to have
userspace sleep for this time (despite the mdelay I originally went with)

Haven't tried this, but yeah could be interesting to experiment if the MUX
config lands faster in different situation such as when the HW is idle.

Thanks,
- Robert

>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20160504/c0a76005/attachment-0001.html>