[Intel-gfx] [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost

Sun May 15 05:46:25 UTC 2022

On 5/6/2022 9:43 AM, John Harrison wrote:
> On 5/6/2022 00:18, Tvrtko Ursulin wrote:
>> On 05/05/2022 19:36, John Harrison wrote:
>>> On 5/5/2022 10:21, Belgaumkar, Vinay wrote:
>>>> On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>>>>> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>>>>>> SLPC min/max frequency updates require H2G calls. We are seeing
>>>>>> timeouts when GuC channel is backed up and it is unable to respond
>>>>>> in a timely fashion causing warnings and affecting CI.
>>>>>
>>>>> Is it the "Unable to force min freq" error? Do you have a link to 
>>>>> the GitLab issue to add to commit message?
>>>> We don't have a specific error for this one, but have seen similar 
>>>> issues with other H2G which are blocking.
>>>>>
>>>>>> This is seen when waitboosting happens during a stress test.
>>>>>> this patch updates the waitboost path to use a non-blocking
>>>>>> H2G call instead, which returns as soon as the message is
>>>>>> successfully transmitted.
>>>>>
>>>>> AFAIU with this approach, when CT channel is congested, you 
>>>>> instead achieve silent dropping of the waitboost request, right?
>>>> We are hoping it makes it, but just not waiting for it to complete.
>>> We are not 'hoping it makes it'. We know for a fact that it will 
>>> make it. We just don't know when. The issue is not about whether the 
>>> waitboost request itself gets dropped/lost it is about the ack that 
>>> comes back. The GuC will process the message and it will send an 
>>> ack. It's just a question of whether the i915 driver has given up 
>>> waiting for it yet. And if it has, then you get the initial 'timed 
>>> out waiting for ack' followed by a later 'got unexpected ack' message.
>>>
>>> Whereas, if we make the call asynchronous, there is no ack. i915 
>>> doesn't bother waiting and it won't get surprised later.
>>>
>>> Also, note that this is only an issue when GuC itself is backed up. 
>>> Normally that requires the creation/destruction of large numbers of 
>>> contexts in rapid succession (context management is about the 
>>> slowest thing we do with GuC). Some of the IGTs and selftests do 
>>> that with thousands of contexts all at once. Those are generally 
>>> where we see this kind of problem. It would be highly unlikely (but 
>>> not impossible) to hit it in real world usage.
>>
>> Goto ->
>>
>>> The general design philosophy of H2G messages is that asynchronous 
>>> mode should be used for everything if at all possible. It is fire 
>>> and forget and will all get processed in the order sent (same as 
>>> batch buffer execution, really). Synchronous messages should only be 
>>> used when an ack/status is absolutely required. E.g. start of day 
>>> initialisation or things like TLB invalidation where we need to know 
>>> that a cache has been cleared/flushed before updating memory from 
>>> the CPU.
>>>
>>> John.
>>>
>>>
>>>>>
>>>>> It sounds like a potentially important feedback from the field to 
>>>>> lose so easily. How about you added drm_notice to the worker when 
>>>>> it fails?
>>>>>
>>>>> Or simply a "one line patch" to replace i915_probe_error (!?) with 
>>>>> drm_notice and keep the blocking behavior. (I have no idea what is 
>>>>> the typical time to drain the CT buffer, and so to decide whether 
>>>>> waiting or dropping makes more sense for effectiveness of 
>>>>> waitboosting.)
>>>>>
>>>>> Or since the congestion /should not/ happen in production, then 
>>>>> the argument is why complicate with more code, in which case going 
>>>>> with one line patch is an easy way forward?
>>
>> Here. Where I did hint I understood the "should not happen in 
>> production angle".
>>
>> So statement is GuC is congested in processing requests, but the h2g 
>> buffer is not congested so no chance intel_guc_send_nb() will fail 
>> with no space in that buffer? Sounds a bit un-intuitive.
> That's two different things. The problem of no space in the H2G buffer 
> is the same whether the call is sent blocking or non-blocking. The 
> wait-for-space version is intel_guc_send_busy_loop() rather than 
> intel_guc_send_nb(). NB: _busy_loop is a wrapper around _nb, so the 
> wait-for-space version is also non-blocking ;). If a non-looping 
> version is used (blocking or otherwise) it will return -EBUSY if there 
> is no space. So both the original SLPC call and this non-blocking 
> version will still get an immediate EBUSY return code if the H2G 
> channel is backed up completely.
>
> Whether the code should be handling EBUSY or not is another matter. 
> Vinay, does anything higher up do a loop on EBUSY? If not, maybe it 
> should be using the _busy_loop() call instead?
>
> The blocking vs non-blocking is about waiting for a response if the 
> command is successfully sent. The blocking case will sit and spin for 
> a reply, the non-blocking assumes success and expects an asynchronous 
> error report on failure. The assumption being that the call can't fail 
> unless something is already broken - i915 sending invalid data to GuC 
> for example. And thus any failure is in the BUG_ON category rather 
> than the try again with a different approach and/or try again later 
> category.
>
> This is the point of the change. We are currently getting timeout 
> errors when the H2G channel has space so the command can be sent, but 
> the channel already contains a lot of slow operations. The command has 
> been sent and will be processed successfully, it just takes longer 
> than the i915 timeout. Given that we don't actually care about the 
> completion response for this command, there is no point in either a) 
> sitting in a loop waiting for it or b) complaining that it doesn't 
> happen in a timely fashion. Hence the plan to make it non-blocking.
>
>>
>> Anyway, it sounds okay to me to use the non-blocking, but I would 
>> like to see some logging if the unexpected does happen. Hence I was 
>> suggesting the option of adding drm_notice logging if the send fails 
>> from the worker. (Because I think other callers would already 
>> propagate the error, like sysfs.)
>>
>>   err = slpc_force_min_freq(slpc, slpc->boost_freq);
>>   if (!err)
>>        slpc->num_boosts++;
>>   else
>>        drm_notice(... "Failed to send waitboost request (%d)", err);
> The only error this should ever report would be EBUSY when the H2G 
> channel is full. Anything else (ENODEV, EPIPE, etc.) means the system 
> is already toast and bigger errors will likely have already have been 
> reported.
>
> As above, maybe this should be looping on the EBUSY case. Presumably 
> it is safe to do so if it was already looping waiting for the 
> response. And then printing a notice level warning on more 
> catastrophic errors seems reasonable.

Not sure if we need an extensive busy loop here. All we are trying to do 
(through a tasklet) is bump the min freq. If that fails due to EBUSY, we 
will just end up

retrying next time around.

Thanks,

Vinay.

>
> John.
>
>>
>> Something like that.
>>
>> Regards,
>>
>> Tvrtko
>>
>>
>>>> Even if we soften the blow here, the actual timeout error occurs in 
>>>> the intel_guc_ct.c code, so we cannot hide that error anyways. 
>>>> Making this call non-blocking will achieve both things.
>>>>
>>>> Thanks,
>>>>
>>>> Vinay.
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar at intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38 
>>>>>> ++++++++++++++++-----
>>>>>>   1 file changed, 30 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c 
>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>>> index 1db833da42df..c852f73cf521 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct 
>>>>>> intel_guc_slpc *slpc)
>>>>>>       return data->header.global_state;
>>>>>>   }
>>>>>>   +static int guc_action_slpc_set_param_nb(struct intel_guc *guc, 
>>>>>> u8 id, u32 value)
>>>>>> +{
>>>>>> +    u32 request[] = {
>>>>>> +        GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>>>>>> +        SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>>>>>> +        id,
>>>>>> +        value,
>>>>>> +    };
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>>>>>> +
>>>>>> +    return ret > 0 ? -EPROTO : ret;
>>>>>> +}
>>>>>> +
>>>>>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id, 
>>>>>> u32 value)
>>>>>> +{
>>>>>> +    struct intel_guc *guc = slpc_to_guc(slpc);
>>>>>> +
>>>>>> +    GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>>>>>> +
>>>>>> +    return guc_action_slpc_set_param_nb(guc, id, value);
>>>>>> +}
>>>>>> +
>>>>>>   static int guc_action_slpc_set_param(struct intel_guc *guc, u8 
>>>>>> id, u32 value)
>>>>>>   {
>>>>>>       u32 request[] = {
>>>>>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct 
>>>>>> intel_guc_slpc *slpc, u32 freq)
>>>>>>        */
>>>>>>         with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>>>>> -        ret = slpc_set_param(slpc,
>>>>>> - SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>>> -                     freq);
>>>>>> -        if (ret)
>>>>>> -            i915_probe_error(i915, "Unable to force min freq to 
>>>>>> %u: %d",
>>>>>> -                     freq, ret);
>>>>>> +        /* Non-blocking request will avoid stalls */
>>>>>> +        ret = slpc_set_param_nb(slpc,
>>>>>> + SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>>> +                    freq);
>>>>>>       }
>>>>>>         return ret;
>>>>>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct 
>>>>>> work_struct *work)
>>>>>>        */
>>>>>>       mutex_lock(&slpc->lock);
>>>>>>       if (atomic_read(&slpc->num_waiters)) {
>>>>>> -        slpc_force_min_freq(slpc, slpc->boost_freq);
>>>>>> -        slpc->num_boosts++;
>>>>>> +        if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>>>>>> +            slpc->num_boosts++;
>>>>>>       }
>>>>>>       mutex_unlock(&slpc->lock);
>>>>>>   }
>>>
>