[Intel-gfx] [PATCH] drm/i915/guc/slpc: Use non-blocking H2G for waitboost
Belgaumkar, Vinay
vinay.belgaumkar at intel.com
Fri May 6 16:21:27 UTC 2022
On 5/6/2022 12:18 AM, Tvrtko Ursulin wrote:
>
> On 05/05/2022 19:36, John Harrison wrote:
>> On 5/5/2022 10:21, Belgaumkar, Vinay wrote:
>>> On 5/5/2022 5:13 AM, Tvrtko Ursulin wrote:
>>>> On 05/05/2022 06:40, Vinay Belgaumkar wrote:
>>>>> SLPC min/max frequency updates require H2G calls. We are seeing
>>>>> timeouts when GuC channel is backed up and it is unable to respond
>>>>> in a timely fashion causing warnings and affecting CI.
>>>>
>>>> Is it the "Unable to force min freq" error? Do you have a link to
>>>> the GitLab issue to add to commit message?
>>> We don't have a specific error for this one, but have seen similar
>>> issues with other H2G which are blocking.
>>>>
>>>>> This is seen when waitboosting happens during a stress test.
>>>>> this patch updates the waitboost path to use a non-blocking
>>>>> H2G call instead, which returns as soon as the message is
>>>>> successfully transmitted.
>>>>
>>>> AFAIU with this approach, when CT channel is congested, you instead
>>>> achieve silent dropping of the waitboost request, right?
>>> We are hoping it makes it, but just not waiting for it to complete.
>> We are not 'hoping it makes it'. We know for a fact that it will make
>> it. We just don't know when. The issue is not about whether the
>> waitboost request itself gets dropped/lost it is about the ack that
>> comes back. The GuC will process the message and it will send an ack.
>> It's just a question of whether the i915 driver has given up waiting
>> for it yet. And if it has, then you get the initial 'timed out
>> waiting for ack' followed by a later 'got unexpected ack' message.
>>
>> Whereas, if we make the call asynchronous, there is no ack. i915
>> doesn't bother waiting and it won't get surprised later.
>>
>> Also, note that this is only an issue when GuC itself is backed up.
>> Normally that requires the creation/destruction of large numbers of
>> contexts in rapid succession (context management is about the slowest
>> thing we do with GuC). Some of the IGTs and selftests do that with
>> thousands of contexts all at once. Those are generally where we see
>> this kind of problem. It would be highly unlikely (but not
>> impossible) to hit it in real world usage.
>
> Goto ->
>
>> The general design philosophy of H2G messages is that asynchronous
>> mode should be used for everything if at all possible. It is fire and
>> forget and will all get processed in the order sent (same as batch
>> buffer execution, really). Synchronous messages should only be used
>> when an ack/status is absolutely required. E.g. start of day
>> initialisation or things like TLB invalidation where we need to know
>> that a cache has been cleared/flushed before updating memory from the
>> CPU.
>>
>> John.
>>
>>
>>>>
>>>> It sounds like a potentially important feedback from the field to
>>>> lose so easily. How about you added drm_notice to the worker when
>>>> it fails?
>>>>
>>>> Or simply a "one line patch" to replace i915_probe_error (!?) with
>>>> drm_notice and keep the blocking behavior. (I have no idea what is
>>>> the typical time to drain the CT buffer, and so to decide whether
>>>> waiting or dropping makes more sense for effectiveness of
>>>> waitboosting.)
>>>>
>>>> Or since the congestion /should not/ happen in production, then the
>>>> argument is why complicate with more code, in which case going with
>>>> one line patch is an easy way forward?
>
> Here. Where I did hint I understood the "should not happen in
> production angle".
>
> So statement is GuC is congested in processing requests, but the h2g
> buffer is not congested so no chance intel_guc_send_nb() will fail
> with no space in that buffer? Sounds a bit un-intuitive.
>
> Anyway, it sounds okay to me to use the non-blocking, but I would like
> to see some logging if the unexpected does happen. Hence I was
> suggesting the option of adding drm_notice logging if the send fails
> from the worker. (Because I think other callers would already
> propagate the error, like sysfs.)
>
> err = slpc_force_min_freq(slpc, slpc->boost_freq);
> if (!err)
> slpc->num_boosts++;
> else
> drm_notice(... "Failed to send waitboost request (%d)", err);
Ok, makes sense. Will send out another rev with this change.
Thanks,
Vinay.
>
> Something like that.
>
> Regards,
>
> Tvrtko
>
>
>>> Even if we soften the blow here, the actual timeout error occurs in
>>> the intel_guc_ct.c code, so we cannot hide that error anyways.
>>> Making this call non-blocking will achieve both things.
>>>
>>> Thanks,
>>>
>>> Vinay.
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar at intel.com>
>>>>> ---
>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 38
>>>>> ++++++++++++++++-----
>>>>> 1 file changed, 30 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> index 1db833da42df..c852f73cf521 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c
>>>>> @@ -98,6 +98,30 @@ static u32 slpc_get_state(struct intel_guc_slpc
>>>>> *slpc)
>>>>> return data->header.global_state;
>>>>> }
>>>>> +static int guc_action_slpc_set_param_nb(struct intel_guc *guc,
>>>>> u8 id, u32 value)
>>>>> +{
>>>>> + u32 request[] = {
>>>>> + GUC_ACTION_HOST2GUC_PC_SLPC_REQUEST,
>>>>> + SLPC_EVENT(SLPC_EVENT_PARAMETER_SET, 2),
>>>>> + id,
>>>>> + value,
>>>>> + };
>>>>> + int ret;
>>>>> +
>>>>> + ret = intel_guc_send_nb(guc, request, ARRAY_SIZE(request), 0);
>>>>> +
>>>>> + return ret > 0 ? -EPROTO : ret;
>>>>> +}
>>>>> +
>>>>> +static int slpc_set_param_nb(struct intel_guc_slpc *slpc, u8 id,
>>>>> u32 value)
>>>>> +{
>>>>> + struct intel_guc *guc = slpc_to_guc(slpc);
>>>>> +
>>>>> + GEM_BUG_ON(id >= SLPC_MAX_PARAM);
>>>>> +
>>>>> + return guc_action_slpc_set_param_nb(guc, id, value);
>>>>> +}
>>>>> +
>>>>> static int guc_action_slpc_set_param(struct intel_guc *guc, u8
>>>>> id, u32 value)
>>>>> {
>>>>> u32 request[] = {
>>>>> @@ -208,12 +232,10 @@ static int slpc_force_min_freq(struct
>>>>> intel_guc_slpc *slpc, u32 freq)
>>>>> */
>>>>> with_intel_runtime_pm(&i915->runtime_pm, wakeref) {
>>>>> - ret = slpc_set_param(slpc,
>>>>> - SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>> - freq);
>>>>> - if (ret)
>>>>> - i915_probe_error(i915, "Unable to force min freq to
>>>>> %u: %d",
>>>>> - freq, ret);
>>>>> + /* Non-blocking request will avoid stalls */
>>>>> + ret = slpc_set_param_nb(slpc,
>>>>> + SLPC_PARAM_GLOBAL_MIN_GT_UNSLICE_FREQ_MHZ,
>>>>> + freq);
>>>>> }
>>>>> return ret;
>>>>> @@ -231,8 +253,8 @@ static void slpc_boost_work(struct work_struct
>>>>> *work)
>>>>> */
>>>>> mutex_lock(&slpc->lock);
>>>>> if (atomic_read(&slpc->num_waiters)) {
>>>>> - slpc_force_min_freq(slpc, slpc->boost_freq);
>>>>> - slpc->num_boosts++;
>>>>> + if (!slpc_force_min_freq(slpc, slpc->boost_freq))
>>>>> + slpc->num_boosts++;
>>>>> }
>>>>> mutex_unlock(&slpc->lock);
>>>>> }
>>
More information about the Intel-gfx
mailing list