[Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset

Tue Nov 23 09:15:10 UTC 2021

On 22/11/2021 23:39, Umesh Nerlige Ramappa wrote:
> On Mon, Nov 22, 2021 at 03:44:29PM +0000, Tvrtko Ursulin wrote:
>>
>> On 11/11/2021 16:48, Umesh Nerlige Ramappa wrote:
>>> On Thu, Nov 11, 2021 at 02:37:43PM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 04/11/2021 22:04, Umesh Nerlige Ramappa wrote:
>>>>> On Thu, Nov 04, 2021 at 05:37:37PM +0000, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 03/11/2021 22:47, Umesh Nerlige Ramappa wrote:
>>>>>>> Since the PMU callback runs in irq context, it synchronizes with gt
>>>>>>> reset using the reset count. We could run into a case where the PMU
>>>>>>> callback could read the reset count before it is updated. This has a
>>>>>>> potential of corrupting the busyness stats.
>>>>>>>
>>>>>>> In addition to the reset count, check if the reset bit is set before
>>>>>>> capturing busyness.
>>>>>>>
>>>>>>> In addition save the previous stats only if you intend to update 
>>>>>>> them.
>>>>>>>
>>>>>>> Signed-off-by: Umesh Nerlige Ramappa 
>>>>>>> <umesh.nerlige.ramappa at intel.com>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
>>>>>>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> index 5cc49c0b3889..d83ade77ca07 100644
>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> @@ -1183,6 +1183,7 @@ static ktime_t guc_engine_busyness(struct 
>>>>>>> intel_engine_cs *engine, ktime_t *now)
>>>>>>>      u64 total, gt_stamp_saved;
>>>>>>>      unsigned long flags;
>>>>>>>      u32 reset_count;
>>>>>>> +    bool in_reset;
>>>>>>>      spin_lock_irqsave(&guc->timestamp.lock, flags);
>>>>>>> @@ -1191,7 +1192,9 @@ static ktime_t guc_engine_busyness(struct 
>>>>>>> intel_engine_cs *engine, ktime_t *now)
>>>>>>>       * engine busyness from GuC, so we just use the driver stored
>>>>>>>       * copy of busyness. Synchronize with gt reset using 
>>>>>>> reset_count.
>>>>>>>       */
>>>>>>> -    reset_count = i915_reset_count(gpu_error);
>>>>>>> +    rcu_read_lock();
>>>>>>> +    in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
>>>>>>> +    rcu_read_unlock();
>>>>>>
>>>>>> I don't really understand the point of rcu_read_lock over test_bit 
>>>>>> but I guess you copied it from the trylock loop.
>>>>>
>>>>> Yes, I don't see other parts of code using the lock though. I can 
>>>>> drop it.
>>>>>
>>>>>>
>>>>>>>      *now = ktime_get();
>>>>>>> @@ -1201,9 +1204,10 @@ static ktime_t guc_engine_busyness(struct 
>>>>>>> intel_engine_cs *engine, ktime_t *now)
>>>>>>>       * start_gt_clk is derived from GuC state. To get a consistent
>>>>>>>       * view of activity, we query the GuC state only if gt is 
>>>>>>> awake.
>>>>>>>       */
>>>>>>> -    stats_saved = *stats;
>>>>>>> -    gt_stamp_saved = guc->timestamp.gt_stamp;
>>>>>>> -    if (intel_gt_pm_get_if_awake(gt)) {
>>>>>>> +    if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
>>>>>>
>>>>>> What is the point of looking at the old value of in_reset here? 
>>>>>> Gut feeling says if there is a race this does not fix it.
>>>>>>
>>>>>> I did not figure out from the commit message what does "could read 
>>>>>> the reset count before it is updated" mean?
>>>>>> I thought the point of reading
>>>>>
>>>>>> the reset count twice was that you are sure there was no reset 
>>>>>> while in here, in which case it is safe to update the software 
>>>>>> copy. I don't easily see what test_bit does on top.
>>>>>
>>>>> This is what I see in the reset flow
>>>>> ---------------
>>>>>
>>>>> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>>>> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>>>> R3) reset prepare
>>>>> R4) do the HW reset
>>>>>
>>>>> The reset count is updated only once above and that's before an 
>>>>> actual HW reset happens.
>>>>>
>>>>> PMU callback flow before this patch
>>>>> ---------------
>>>>>
>>>>> P1) read reset count
>>>>> P2) update stats
>>>>> P3) read reset count
>>>>> P4) if reset count changed, use old stats. if not use updated stats.
>>>>>
>>>>> I am concerned that the PMU flow could run after step (R2). Then we 
>>>>> wrongly conclude that the count stayed the same and no HW reset 
>>>>> happened.
>>>
>>> Here is the problematic sequence: Threads R and P.
>>> ------------
>>> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>>     P1) read reset count
>>>     P2) update stats
>>>     P3) read reset count
>>>     P4) if reset count changed, use old stats. if not use updated stats.
>>> R3) reset prepare
>>> R4) do the HW reset
>>>
>>> Do you agree that this is racy? In thread P we don't know in if the 
>>> reset flag was set or not when we captured the reset count in P1?
>>>
>>>>>
>>>>> PMU callback flow with this patch
>>>>> ---------------
>>>>> This would rely on the reset_count only if a reset is not in progress.
>>>>>
>>>>> P0) test_bit for I915_RESET_BACKOFF
>>>>> P1) read reset count if not in reset. if in reset, use old stats
>>>>> P2) update stats
>>>>> P3) read reset count
>>>>> P4) if reset count changed, use old stats. if not use updated stats.
>>>>>
>>>>> Now that I think about it more, I do see one sequence that still 
>>>>> needs fixing though - P0, R1, R2, P1 - P4. For that, I think I need 
>>>>> to re-read the BACKOFF bit after reading the reset_count for the 
>>>>> first time.
>>>>> Modified PMU callback sequence would be:
>>>>> ----------
>>>>>
>>>>> M0) test_bit for I915_RESET_BACKOFF
>>>>> M1) read reset count if not in reset, if in reset, use old stats
>>>>>
>>>>> M1.1) test_bit for I915_RESET_BACKOFF. if set, use old stats. if 
>>>>> not, use reset_count to synchronize
>>>>>
>>>>> M2) update stats
>>>>> M3) read reset count
>>>>> M4) if reset count changed, use old stats. if not use updated stats.
>>>>
>>>> You did not end up implementing this flow? Have you later changed 
>>>> your mind whether it is required or not? Or maybe I am looking at 
>>>> not the latest patch.
>>>>
>>>> Is the below the latest?
>>>>
>>>> """
>>>> v2:
>>>> - The 2 reset counts captured in the PMU callback can end up being the
>>>>  same if they were captured right after the count is incremented in the
>>>>  reset flow. This can lead to a bad busyness state. Ensure that reset
>>>>  is not in progress when the initial reset count is captured.
>>>> """
>>>
>>> Yes, v2 is the latest (maybe CI results re-ordered the patches). 
>>> Instead of sampling the BACKOFF flag before and after the reset count 
>>> (as in the modified sequence), I just sample it after. The order is 
>>> critical - first sample reset count and then the reset flag.
>>>
>>>>
>>>> Is the key now that you rely on ordering of atomic_inc and set_bit 
>>>> in the reset path?
>>>
>>> Yes
>>>
>>>> Frankly I still don't understand why you can get away
>>>
>>>> with using stale in_reset in v2. If you acknowledge it can change 
>>>> between sampling and checking, then what is the point in having it 
>>>> altogether? You still solely rely on reset count in that case, no?
>>>
>>> Correct, but now I know for sure that the first sample of reset_count 
>>> was captured when reset flag was not set (since I am relying on the 
>>> order of sampling).
>>>
>>> About solely using the reset_count, I have listed the problematic 
>>> sequence above to highlight what the issue is.
>>
>> It was this:
>>
>> """
>> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>     P1) read reset count
>>     P2) update stats
>>     P3) read reset count
>>     P4) if reset count changed, use old stats. if not use updated stats.
>> R3) reset prepare
>> R4) do the HW reset
>>
>> Do you agree that this is racy? In thread P we don't know in if the 
>> reset flag was set or not when we captured the reset count in P1?
>> """
>>
>> Why it matter if reset flag was set or not? Lets see how things are 
>> after this patch:
>>
>> After this patch it ends like this:
>>
>>     P1) Read and store reset bit
>> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>     P2) If reset bit was not set:
>>           P2.1) read reset count
>> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>           P2.2) update stats
>>           P2.3) read reset count
>>           P2.4) if reset count changed, use old stats. if not use 
>> updated stats.
>> R3) reset prepare
>> R4) do the HW reset
>>
>> So the reset bit got set between P1 and P2. How is that then not the 
>> same as not looking at the reset bit at all?
> 
> But the new sequence in this patch is this:

Oops I was looking at v1. Okay I can't come up with any further races, 
acked.

Regards,

Tvrtko

> 
>      P0) read reset count
>      P1) Read and store reset bit
> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>      P2) If reset bit was not set:
> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>         P2.2) update stats
>         P2.3) read reset count
>         P2.4) if reset count changed, use old stats. if not use updated 
> stats.
> R3) reset prepare
> R4) do the HW reset
> 
> P2.1 moved to P0 when compared to the sequence you shared above.
> 
> Thanks,
> Umesh
> 
>>
>> Regards,
>>
>> Tvrtko