[Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.

Chris Wilson chris.p.wilson at intel.com
Fri Mar 20 10:06:32 UTC 2020


Quoting Francisco Jerez (2020-03-20 02:46:19)
> Francisco Jerez <currojerez at riseup.net> writes:
> 
> > Francisco Jerez <currojerez at riseup.net> writes:
> >
> >> Chris Wilson <chris at chris-wilson.co.uk> writes:
> >>
> >>> Quoting Francisco Jerez (2020-03-10 21:41:55)
> >>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>>> index b9b3f78f1324..a5d7a80b826d 100644
> >>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
> >>>>         /* we need to manually load the submit queue */
> >>>>         if (execlists->ctrl_reg)
> >>>>                 writel(EL_CTRL_LOAD, execlists->ctrl_reg);
> >>>> +
> >>>> +       if (execlists_num_ports(execlists) > 1 &&
> >>> pending[1] is always defined, the minimum submission is one slot, with
> >>> pending[1] as the sentinel NULL.
> >>>
> >>>> +           execlists->pending[1] &&
> >>>> +           !atomic_xchg(&execlists->overload, 1))
> >>>> +               intel_gt_pm_active_begin(&engine->i915->gt);
> >>>
> >>> engine->gt
> >>>
> >>
> >> Applied your suggestions above locally, will probably wait to have a few
> >> more changes batched up before sending a v2.
> >>
> >>>>  }
> >>>>  
> >>>>  static bool ctx_single_port_submission(const struct intel_context *ce)
> >>>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
> >>>>         clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
> >>>>  
> >>>>         WRITE_ONCE(execlists->active, execlists->inflight);
> >>>> +
> >>>> +       if (atomic_xchg(&execlists->overload, 0)) {
> >>>> +               struct intel_engine_cs *engine =
> >>>> +                       container_of(execlists, typeof(*engine), execlists);
> >>>> +               intel_gt_pm_active_end(&engine->i915->gt);
> >>>> +       }
> >>>>  }
> >>>>  
> >>>>  static inline void
> >>>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
> >>>>                         /* port0 completed, advanced to port1 */
> >>>>                         trace_ports(execlists, "completed", execlists->active);
> >>>>  
> >>>> +                       if (atomic_xchg(&execlists->overload, 0))
> >>>> +                               intel_gt_pm_active_end(&engine->i915->gt);
> >>>
> >>> So this looses track if we preempt a dual-ELSP submission with a
> >>> single-ELSP submission (and never go back to dual).
> >>>
> >>
> >> Yes, good point.  You're right that if a dual-ELSP submission gets
> >> preempted by a single-ELSP submission "overload" will remain signaled
> >> until the first completion interrupt arrives (e.g. from the preempting
> >> submission).
> >>
> >>> If you move this to the end of the loop and check
> >>>
> >>> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
> >>>     intel_gt_pm_active_end(engine->gt);
> >>>
> >>> so that it covers both preemption/promotion and completion.
> >>>
> >>
> >> That sounds reasonable.
> >>
> >>> However, that will fluctuate quite rapidly. (And runs the risk of
> >>> exceeding the sentinel.)
> >>>
> >>> An alternative approach would be to couple along
> >>> schedule_in/schedule_out
> >>>
> >>> atomic_set(overload, -1);
> >>>
> >>> __execlists_schedule_in:
> >>>     if (!atomic_fetch_inc(overload)
> >>>             intel_gt_pm_active_begin(engine->gt);
> >>> __execlists_schedule_out:
> >>>     if (!atomic_dec_return(overload)
> >>>             intel_gt_pm_active_end(engine->gt);
> >>>
> >>> which would mean we are overloaded as soon as we try to submit an
> >>> overlapping ELSP.
> >>>
> >>
> >> That sounds good to me too, and AFAICT would have roughly the same
> >> behavior as this metric except for the preemption corner case you
> >> mention above.  I'll try this and verify that I get approximately the
> >> same performance numbers.
> >>
> >
> > This suggestion seems to lead to some minor regressions, I'm
> > investigating the issue.  Will send a v2 as soon as I have something
> > along the lines of what you suggested running with equivalent
> > performance to v1.
> 
> I think I've figured out why both of the alternatives we were talking
> about above lead to a couple percent regressions in latency-sensitive
> workloads: In some scenarios it's possible for execlist_dequeue() to
> execute after the GPU has gone idle, but before we've processed the
> corresponding CSB entries, particularly when called from the
> submit_queue() path.  In that case __execlists_schedule_in() will think
> that the next request is overlapping, and tell CPU power management to
> relax, even though the GPU is starving intermittently.
> 
> How about we do the same:
> 
> |       if (atomic_xchg(&execlists->overload, 0))
> |               intel_gt_pm_active_end(engine->gt);
> 
> as in this patch from process_csb() in response to each completion CSB
> entry, which ensures that the system is considered non-GPU-bound as soon
> as the first context completes.  Subsequently if another CSB entry
> signals a dual-ELSP active-to-idle transition or a dual-ELSP preemption
> we call intel_gt_pm_active_begin() directly from process_csb().  If we
> hit a single-ELSP preemption CSB entry we call intel_gt_pm_active_end()
> instead, in order to avoid the problem you pointed out in your previous
> email.
> 
> How does that sound to you?  [Still need to verify that it has
> comparable performance to this patch overall.]

Sounds like we're trying to compensate for ksoftirqd latency, which is a
killer overall. How about something as simple as

execlists_submit_ports:
	tasklet_hi_schedule(&execlists->tasklet);

which will then be run immediately from local context at the end of the
direct submission... Unless it's already queued on another CPU. Instead
of waiting for that, we may manually try to kick it locally.

As your latency governor is kicked from a worker, iirc, we should still
be executing before it has a chance to process a partial update. I hope.

Anyway if it is the ksoftirqd latency hurting here, it's not a problem
uniquely to the governor and I would like to improve it :)
-Chris
---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


More information about the Intel-gfx mailing list