[Intel-gfx] [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost
Belgaumkar, Vinay
vinay.belgaumkar at intel.com
Sun May 15 05:46:25 UTC 2022
On 5/6/2022 9:43 AM, John Harrison wrote:
> On 5/6/2022 00:18, Tvrtko Ursulin wrote:
>> On 05/05/2022 19:36, John Harrison wrote:
>>> On 5/5/2022 10:21, Belgaumkar, Vinay wrote:
>>>> On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>>>>> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>>>>>> SLPC min/max frequency updates require H2G calls. We are seeing
>>>>>> timeouts when GuC channel is backed up and it is unable to respond
>>>>>> in a timely fashion causing warnings and affecting CI.
>>>>>
>>>>> Is it the "Unable to force min freq" error? Do you have a link to
>>>>> the GitLab issue to add to commit message?
>>>> We don't have a specific error for this one, but have seen similar
>>>> issues with other H2G which are blocking.
>>>>>
>>>>>> This is seen when waitboosting happens during a stress test.
>>>>>> this patch updates the waitboost path to use a non-blocking
>>>>>> H2G call instead, which returns as soon as the message is
>>>>>> successfully transmitted.
>>>>>
>>>>> AFAIU with this approach, when CT channel is congested, you
>>>>> instead achieve silent dropping of the waitboost request, right?
>>>> We are hoping it makes it, but just not waiting for it to complete.
>>> We are not 'hoping it makes it'. We know for a fact that it will
>>> make it. We just don't know when. The issue is not about whether the
>>> waitboost request itself gets dropped/lost it is about the ack that
>>> comes back. The GuC will process the message and it will send an
>>> ack. It's just a question of whether the i915 driver has given up
>>> waiting for it yet. And if it has, then you get the initial 'timed
>>> out waiting for ack' followed by a later 'got unexpected ack' message.
>>>
>>> Whereas, if we make the call asynchronous, there is no ack. i915
>>> doesn't bother waiting and it won't get surprised later.
>>>
>>> Also, note that this is only an issue when GuC itself is backed up.
>>> Normally that requires the creation/destruction of large numbers of
>>> contexts in rapid succession (context management is about the
>>> slowest thing we do with GuC). Some of the IGTs and selftests do
>>> that with thousands of contexts all at once. Those are generally
>>> where we see this kind of problem. It would be highly unlikely (but
>>> not impossible) to hit it in real world usage.
>>
>> Goto ->
>>
>>> The general design philosophy of H2G messages is that asynchronous
>>> mode should be used for everything if at all possible. It is fire
>>> and forget and will all get processed in the order sent (same as
>>> batch buffer execution, really). Synchronous messages should only be
>>> used when an ack/status is absolutely required. E.g. start of day
>>> initialisation or things like TLB invalidation where we need to know
>>> that a cache has been cleared/flushed before updating memory from
>>> the CPU.
>>>
>>> John.
>>>
>>>
>>>>>
>>>>> It sounds like a potentially important feedback from the field to
>>>>> lose so easily. How about you added drm_notice to the worker when
>>>>> it fails?
>>>>>
>>>>> Or simply a "one line patch" to replace i915_probe_error (!?) with
>>>>> drm_notice and keep the blocking behavior. (I have no idea what is
>>>>> the typical time to drain the CT buffer, and so to decide whether
>>>>> waiting or dropping makes more sense for effectiveness of
>>>>> waitboosting.)
>>>>>
>>>>> Or since the congestion /should not/ happen in production, then
>>>>> the argument is why complicate with more code, in which case going
>>>>> with one line patch is an easy way forward?
>>
>> Here. Where I did hint I understood the "should not happen in
>> production angle".
>>
>> So statement is GuC is congested in processing requests, but the h2g
>> buffer is not congested so no chance intel_guc_send_nb() will fail
>> with no space in that buffer? Sounds a bit un-intuitive.
> That's two different things. The problem of no space in the H2G buffer
> is the same whether the call is sent blocking or non-blocking. The
> wait-for-space version is intel_guc_send_busy_loop() rather than
> intel_guc_send_nb(). NB: _busy_loop is a wrapper around _nb, so the
> wait-for-space version is also non-blocking ;). If a non-looping
> version is used (blocking or otherwise) it will return -EBUSY if there
> is no space. So both the original SLPC call and this non-blocking
> version will still get an immediate EBUSY return code if the H2G
> channel is backed up completely.
>
> Whether the code should be handling EBUSY or not is another matter.
> Vinay, does anything higher up do a loop on EBUSY? If not, maybe it
> should be using the _busy_loop() call instead?
>
> The blocking vs non-blocking is about waiting for a response if the
> command is successfully sent. The blocking case will sit and spin for
> a reply, the non-blocking assumes success and expects an asynchronous
> error report on failure. The assumption being that the call can't fail
> unless something is already broken - i915 sending invalid data to GuC
> for example. And thus any failure is in the BUG_ON category rather
> than the try again with a different approach and/or try again later
> category.
>
> This is the point of the change. We are currently getting timeout
> errors when the H2G channel has space so the command can be sent, but
> the channel already contains a lot of slow operations. The command has
> been sent and will be processed successfully, it just takes longer
> than the i915 timeout. Given that we don't actually care about the
> completion response for this command, there is no point in either a)
> sitting in a loop waiting for it or b) complaining that it doesn't
> happen in a timely fashion. Hence the plan to make it non-blocking.
>
>>
>> Anyway, it sounds okay to me to use the non-blocking, but I would
>> like to see some logging if the unexpected does happen. Hence I was
>> suggesting the option of adding drm_notice logging if the send fails
>> from the worker. (Because I think other callers would already
>> propagate the error, like sysfs.)
>>
>> err = slpc_force_min_freq(slpc, slpc->boost_freq);
>> if (!err)
>> slpc->num_boosts++;
>> else
>> drm_notice(... "Failed to send waitboost request (%d)", err);
> The only error this should ever report would be EBUSY when the H2G
> channel is full. Anything else (ENODEV, EPIPE, etc.) means the system
> is already toast and bigger errors will likely have already have been
> reported.
>
> As above, maybe this should be looping on the EBUSY case. Presumably
> it is safe to do so if it was already looping waiting for the
> response. And then printing a notice level warning on more
> catastrophic errors seems reasonable.
Not sure if we need an extensive busy loop here. All we are trying to do
(through a tasklet) is bump the min freq. If that fails due to EBUSY, we
will just end up
retrying next time around.
Thanks,
Vinay.
>
> John.
>
>>
>> Something like that.
>>
>> Regards,
>>
>> Tvrtko
>>
>>
>>>> Even if we soften the blow here, the actual timeout error occurs in
>>>> the intel_guc_ct.c code, so we cannot hide that error anyways.
>>>> Making this call non-blocking will achieve both things.
>>>>
>>>> Thanks,
>>>>
>>>> Vinay.
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar at intel.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38
>>>>>> ++++++++++++++++-----
>>>>>> 1 file changed, 30 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>>> index 1db833da42df..c852f73cf521 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct
>>>>>> intel_guc_slpc *slpc)
>>>>>> return data->header.global_state;
>>>>>> }
>>>>>> +static int guc_action_slpc_set_param_nb(struct intel_guc *guc,
>>>>>> u8 id, u32 value)
>>>>>> +{
>>>>>> + u32 request[] = {
>>>>>> + GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>>>>>> + SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>>>>>> + id,
>>>>>> + value,
>>>>>> + };
>>>>>> + int ret;
>>>>>> +
>>>>>> + ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>>>>>> +
>>>>>> + return ret > 0 ? -EPROTO : ret;
>>>>>> +}
>>>>>> +
>>>>>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id,
>>>>>> u32 value)
>>>>>> +{
>>>>>> + struct intel_guc *guc = slpc_to_guc(slpc);
>>>>>> +
>>>>>> + GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>>>>>> +
>>>>>> + return guc_action_slpc_set_param_nb(guc, id, value);
>>>>>> +}
>>>>>> +
>>>>>> static int guc_action_slpc_set_param(struct intel_guc *guc, u8
>>>>>> id, u32 value)
>>>>>> {
>>>>>> u32 request[] = {
>>>>>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct
>>>>>> intel_guc_slpc *slpc, u32 freq)
>>>>>> */
>>>>>> with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>>>>> - ret = slpc_set_param(slpc,
>>>>>> - SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>>> - freq);
>>>>>> - if (ret)
>>>>>> - i915_probe_error(i915, "Unable to force min freq to
>>>>>> %u: %d",
>>>>>> - freq, ret);
>>>>>> + /* Non-blocking request will avoid stalls */
>>>>>> + ret = slpc_set_param_nb(slpc,
>>>>>> + SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>>> + freq);
>>>>>> }
>>>>>> return ret;
>>>>>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct
>>>>>> work_struct *work)
>>>>>> */
>>>>>> mutex_lock(&slpc->lock);
>>>>>> if (atomic_read(&slpc->num_waiters)) {
>>>>>> - slpc_force_min_freq(slpc, slpc->boost_freq);
>>>>>> - slpc->num_boosts++;
>>>>>> + if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>>>>>> + slpc->num_boosts++;
>>>>>> }
>>>>>> mutex_unlock(&slpc->lock);
>>>>>> }
>>>
>
More information about the Intel-gfx
mailing list