[Intel-gfx] [PATCH 1/4] drm/i915/perf: rework aging tail workaround

Thu Mar 19 07:44:40 UTC 2020

Discussed with Umesh today. Below is what we came up with.

On Tue, 17 Mar 2020 17:03:30 -0700, Lionel Landwerlin wrote:
> On 16/03/2020 21:23, Dixit, Ashutosh wrote:
> > On Thu, 12 Mar 2020 16:04:59 -0700, Umesh Nerlige Ramappa wrote:
> >> From: Lionel Landwerlin <lionel.g.landwerlin at intel.com>
> >>
> >> We're about to introduce an options to open the perf stream, giving
> >> the user ability to configure how often it wants the kernel to poll
> >> the OA registers for available data.
> >>
> >> Right now the workaround against the OA tail pointer race condition
> >> requires at least twice the internal kernel polling timer to make any
> >> data available.
> >>
> >> This changes introduce checks on the OA data written into the circular
> >> buffer to make as much data as possible available on the first
> >> iteration of the polling timer.
> > Excellent, this absolutely needs to be done, I was thinking it may be
> > possible even with the approach taken in the original code but the approach
> > here looks better. It is also a nice cleanup.
> >
> >> @@ -507,64 +487,76 @@ static bool oa_buffer_check_unlocked(struct i915_perf_stream *stream)
> >>
> >>	now = ktime_get_mono_fast_ns();
> >>
> >> -	/* Update the aged tail
> >> -	 *
> >> -	 * Flip the tail pointer available for read()s once the aging tail is
> >> -	 * old enough to trust that the corresponding data will be visible to
> >> -	 * the CPU...
> >> -	 *
> >> -	 * Do this before updating the aging pointer in case we may be able to
> >> -	 * immediately start aging a new pointer too (if new data has become
> >> -	 * available) without needing to wait for a later hrtimer callback.
> >> -	 */
> >> -	if (aging_tail != INVALID_TAIL_PTR &&
> >> -	    ((now - stream->oa_buffer.aging_timestamp) >
> >> -	     OA_TAIL_MARGIN_NSEC)) {
> >> -
> >> -		aged_idx ^= 1;
> >> -		stream->oa_buffer.aged_tail_idx = aged_idx;
> >> +	if (hw_tail == stream->oa_buffer.aging_tail) {
> >> +		/* If the HW tail hasn't move since the last check and the HW
> >> +		 * tail has been aging for long enough, declare it the new
> >> +		 * tail.
> >> +		 */
> > Is this really needed? True we will never return the last report but maybe
> > it's ok, save some code?
>
> It doesn't look like a lot of code :)

OK, leave as is in the patch.

>
> >
> >> +		if ((now - stream->oa_buffer.aging_timestamp) >
> > Do we need to initialize 'aging_timestamp = ktime_get_mono_fast_ns()' when
> > the stream is enabled?
>
>
> Aye... I think aging_tail should probably be initialized to
> INVALID_TAIL_PTR in init_oa_buffer() vfuncs.
>
> So that on the first call it never matches : if (hw_tail ==
> stream->oa_buffer.aging_tail)
>
> And that aging_timestamp is set properly.

OK, let's do this.

> >
> >> +		    OA_TAIL_MARGIN_NSEC) {
> >> +			stream->oa_buffer.tail =
> >> +				stream->oa_buffer.aging_tail;
> >> +		}
> >> +	} else {
> >> +		u32 head, tail, landed_report_heads;
> >>
> >> -		aged_tail = aging_tail;
> >> +		/* NB: The head we observe here might effectively be a little out of
> >> +		 * date (between head and tails[aged_idx].offset if there is currently
> >> +		 * a read() in progress.
> >> +		 */
> > Is this comment correct? Aren't we are taking the same lock when updating
> > head after the read?
>
>
> It seems the mention of tails[aged_idx].offset is definitely out of date.
>
>
> That being said, the read is concurrent to this function (only the update
> of head is locked).
>
> That means the read code can be going through the reports while we're going
> through a different set of reports in the same OA buffer.

OK, reword the comment so that it doesn't look like it's a locking problem.

>
>
> >
> >> +		head = stream->oa_buffer.head - gtt_offset;
> >>
> >> -		/* Mark that we need a new pointer to start aging... */
> >> -		stream->oa_buffer.tails[!aged_idx].offset = INVALID_TAIL_PTR;
> >> -		aging_tail = INVALID_TAIL_PTR;
> >> -	}
> >> +		hw_tail -= gtt_offset;
> >>
> >> -	/* Update the aging tail
> >> -	 *
> >> -	 * We throttle aging tail updates until we have a new tail that
> >> -	 * represents >= one report more data than is already available for
> >> -	 * reading. This ensures there will be enough data for a successful
> >> -	 * read once this new pointer has aged and ensures we will give the new
> >> -	 * pointer time to age.
> >> -	 */
> >> -	if (aging_tail == INVALID_TAIL_PTR &&
> >> -	    (aged_tail == INVALID_TAIL_PTR ||
> >> -	     OA_TAKEN(hw_tail, aged_tail) >= report_size)) {
> >> -		struct i915_vma *vma = stream->oa_buffer.vma;
> >> -		u32 gtt_offset = i915_ggtt_offset(vma);
> >> -
> >> -		/* Be paranoid and do a bounds check on the pointer read back
> >> -		 * from hardware, just in case some spurious hardware condition
> >> -		 * could put the tail out of bounds...
> >> +		/* Walk the stream backward until we find at least 2 reports
> >> +		 * with dword 0 & 1 not at 0. Since the circular buffer
> >> +		 * pointers progress by increments of 64 bytes and that
> >> +		 * reports can be up to 256 bytes long, we can't tell whether
> >> +		 * a report has fully landed in memory before the first 2
> >> +		 * dwords of the following report have effectively landed.
> >> +		 *
> >> +		 * This is assuming that the writes of the OA unit land in
> >> +		 * memory in the order they were written to.
> >> +		 * If not : (╯°□°）╯︵ ┻━┻
> >>		 */
> >> -		if (hw_tail >= gtt_offset &&
> >> -		    hw_tail < (gtt_offset + OA_BUFFER_SIZE)) {
> >> -			stream->oa_buffer.tails[!aged_idx].offset =
> >> -				aging_tail = hw_tail;
> >> -			stream->oa_buffer.aging_timestamp = now;
> >> -		} else {
> >> -			drm_err(&stream->perf->i915->drm,
> >> -				"Ignoring spurious out of range OA buffer tail pointer = %x\n",
> >> -				hw_tail);
> >> +		landed_report_heads = 0;
> >> +		while (OA_TAKEN(tail, head) >= report_size) {

Rather than OA_TAKEN(tail, head) this should probably be
OA_TAKEN(tail, old_tail), investigating.

> >> +			u32 previous_tail = (tail - report_size) & (OA_BUFFER_SIZE - 1);
> >> +			u8 *report = stream->oa_buffer.vaddr + previous_tail;
> >> +			u32 *report32 = (void *) report;
> > nit: instead of the 2 statements above we can just have a single statement,
> > because stream->oa_buffer.vaddr is u8*:
> >
> >	u32 *report32 = (void *)(stream->oa_buffer.vaddr + previous_tail);
>
>
> Looks right :)
>
>
> >
> >> +
> >> +			/* Head of the report indicated by the HW tail register has
> >> +			 * indeed landed into memory.
> >> +			 */
> >> +			if (report32[0] != 0 || report32[1] != 0) {
> > Just wondering if it is sufficient to check just one of these two (report
> > reason and timestamp)?
>
>
> We could be really unlucky and get on the timestamp = 0.
>
> But if that's the case, we might hit the timeout from above soon enough.
>
> Or we'll see the tail updated and probably find another report that
> matches.

OK, leave as is.

>
>
> >
> >> +				landed_report_heads++;
> > Theoretically we should be checking for "two consecutive" reports, not just
> > any two as the code is doing? Or maybe it's ok to just check once, i.e. if
> > data at previous tail has landed we can break out of the loop?
>
>
> Yeah, I think the code isn't quite "two consecutive".
>
>
> Now that you mention it. It should be "find a report with dwords[0,1] != 0
> and consider the report before that fully landed".
>
> Again assuming the memory controller makes things visible in order they
> were written.

Ok, just need to look at one not at "two consecutive".

>
> >
> >> +
> >> +				if (landed_report_heads >= 2)
> >> +					break;
> >> +			}
> >> +
> >> +			tail = previous_tail;
> >>		}
> >> +
> >> +		if (abs(tail - hw_tail) >= (2 * report_size)) {
> > Shouldn't it be accounted that this is a circular not a linear buffer?
>
>
> I think you're right.

Will fix.

>
>
> >
> >> +			if (__ratelimit(&stream->perf->tail_pointer_race)) {
> >> +				DRM_NOTE("unlanded report(s) head=0x%x "
> >> +					 "tail=0x%x hw_tail=0x%x\n",
> >> +					 head, tail, hw_tail);
> > Because this is a ratelimited output, should we also include a count here
> > to indicate how often this is happening?
>
>
> I think ratelimit() will print some counts when it happens too often.

OK.

>
>
> >
> >> +			}
> >> +		}
> >> +
> >> +		stream->oa_buffer.tail = gtt_offset + tail;
> >> +		stream->oa_buffer.aging_tail = gtt_offset + hw_tail;
> >> +		stream->oa_buffer.aging_timestamp = now;
> >>	}
> >>
> >>	spin_unlock_irqrestore(&stream->oa_buffer.ptr_lock, flags);
> > Hmm, I think I would compute the value to return below and then drop the
> > lock, just in case read moves the head?
>
>
> Correct! Thanks for spotting this.

Actually now I am thinking leaving as is. Note that the update of head
after read is atomic so there is no problem of picking up a inconsistent
head. This way we pick up an actual head, not the un-updated one, and will
probably report mostly correctly if there actually are new reports to be
read.

In reality the circular buffer should be lockless, but that we can look at
that in the future.

>
>
> >
> >> -	return aged_tail == INVALID_TAIL_PTR ?
> >> -		false : OA_TAKEN(aged_tail, head) >= report_size;
> >> +	return OA_TAKEN(stream->oa_buffer.tail - gtt_offset,
> >> +			stream->oa_buffer.head - gtt_offset) >= report_size;
> >> +
> > Delete extra line, run checkpatch?
> >
> >> @@ -838,13 +821,10 @@ static int gen8_append_oa_reports(struct i915_perf_stream *stream,
> >>		}
> >>
> >>		/*
> >> -		 * The above reason field sanity check is based on
> >> -		 * the assumption that the OA buffer is initially
> >> -		 * zeroed and we reset the field after copying so the
> >> -		 * check is still meaningful once old reports start
> >> -		 * being overwritten.
> >> +		 * Clear out the first 2 dword as a mean to detect unlanded
> >> +		 * reports.
> >>		 */
> >> -		report32[0] = 0;
> >> +		report32[0] = report32[1] = 0;
> > Again, how about initialization here? That is, since we are doing this when
> > copying to user buffer, is the gem object allocated in alloc_oa_buffer()
> > zero'd out to begin with?
>
>
> memset to 0 in init_oa_buffer vfuncs.

OK.

>
>
> >
> >> @@ -1559,8 +1528,8 @@ static void gen12_init_oa_buffer(struct i915_perf_stream *stream)
> >>			   gtt_offset & GEN12_OAG_OATAILPTR_MASK);
> >>
> >>	/* Mark that we need updated tail pointers to read from... */
> >> -	stream->oa_buffer.tails[0].offset = INVALID_TAIL_PTR;
> >> -	stream->oa_buffer.tails[1].offset = INVALID_TAIL_PTR;
> >> +	stream->oa_buffer.aging_tail =
> >> +		stream->oa_buffer.tail = gtt_offset;
> > What is the reason for initializing the head and tail first to gtt_offset
> > and later subtracting it out? If hw_tail read from the HW has this offset
> > it seems simpler to just subtract out gtt_offset from hw_tail but leave all
> > these software head/tail not have gtt_offset (initialized to 0)? Though I
> > think this is all over the place so it is probably better left to a
> > separate later patch. So maybe leave as is for now.
>
>
> Quite possibly why I left the gtt_offset in there :)
>
> No recollection though...

Leave as is for this series.

>
>
> >
> > Thanks!
> > --
> > Ashutosh
>
>