[Intel-gfx] [PATCH] drm/i915: Don't wait forever in drop_caches

Tue Nov 8 09:08:50 UTC 2022

On 07/11/2022 19:45, John Harrison wrote:
> On 11/7/2022 06:09, Tvrtko Ursulin wrote:
>> On 04/11/2022 17:45, John Harrison wrote:
>>> On 11/4/2022 03:01, Tvrtko Ursulin wrote:
>>>> On 03/11/2022 19:16, John Harrison wrote:
>>>>> On 11/3/2022 02:38, Tvrtko Ursulin wrote:
>>>>>> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>>>>>>> On 03/11/2022 01:33, John Harrison wrote:
>>>>>>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>>>>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>>>>>>> On Tue, 01 Nov 2022, John.C.Harrison at Intel.com wrote:
>>>>>>>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>>>>>
>>>>>>>>>>> At the end of each test, IGT does a drop caches call via 
>>>>>>>>>>> sysfs with
>>>>>>>>>>
>>>>>>>>>> sysfs?
>>>>>>>> Sorry, that was meant to say debugfs. I've also been working on 
>>>>>>>> some sysfs IGT issues and evidently got my wires crossed!
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> special flags set. One of the possible paths waits for idle 
>>>>>>>>>>> with an
>>>>>>>>>>> infinite timeout. That causes problems for debugging issues 
>>>>>>>>>>> when CI
>>>>>>>>>>> catches a "can't go idle" test failure. Best case, the CI 
>>>>>>>>>>> system times
>>>>>>>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>>>>>>>> reboots the system to recover it. Worst case, the CI system 
>>>>>>>>>>> can't do
>>>>>>>>>>> anything at all and then times out (after 1000s) and simply 
>>>>>>>>>>> reboots.
>>>>>>>>>>> Sometimes a serial port log of dmesg might be available, 
>>>>>>>>>>> sometimes not.
>>>>>>>>>>>
>>>>>>>>>>> So rather than making life hard for ourselves, change the 
>>>>>>>>>>> timeout to
>>>>>>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>>>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>>>>>>>> working system (if possible).
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>>>>> ---
>>>>>>>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>>>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>> @@ -641,6 +641,9 @@ 
>>>>>>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>>>>>>             DROP_RESET_ACTIVE | \
>>>>>>>>>>>             DROP_RESET_SEQNO | \
>>>>>>>>>>>             DROP_RCU)
>>>>>>>>>>> +
>>>>>>>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>>>>>>>
>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also 
>>>>>>>>>> only used
>>>>>>>>>> here.
>>>>>>>>>
>>>>>>>>> So move here, dropping i915 prefix, next to the newly proposed 
>>>>>>>>> one?
>>>>>>>> Sure, can do that.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>>>>>>>> gt/intel_gt.c.
>>>>>>>>>
>>>>>>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>>>>>>
>>>>>>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>>>>>>>> intel_gt_pm.c.
>>>>>>>>>
>>>>>>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>>>>>>
>>>>>>>> These two are totally unrelated and in code not being touched by 
>>>>>>>> this change. I would rather not conflate changing random other 
>>>>>>>> things with fixing this specific issue.
>>>>>>>>
>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>>>>>>
>>>>>>>>> Add _MS suffix if wanted.
>>>>>>>>>
>>>>>>>>>> My head spins.
>>>>>>>>>
>>>>>>>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT 
>>>>>>>>> applies to DROP_ACTIVE and not only DROP_IDLE.
>>>>>>>> My original intention for the name was that is the 'drop caches 
>>>>>>>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful 
>>>>>>>> and hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised 
>>>>>>>> later that name can be conflated with the DROP_IDLE flag. Will 
>>>>>>>> rename.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Things get refactored, code moves around, bits get left behind, 
>>>>>>>>> who knows. No reason to get too worked up. :) As long as people 
>>>>>>>>> are taking a wider view when touching the code base, and are 
>>>>>>>>> not afraid to send cleanups, things should be good.
>>>>>>>> On the other hand, if every patch gets blocked in code review 
>>>>>>>> because someone points out some completely unrelated piece of 
>>>>>>>> code could be a bit better then nothing ever gets fixed. If you 
>>>>>>>> spot something that you think should be improved, isn't the 
>>>>>>>> general idea that you should post a patch yourself to improve it?
>>>>>>>
>>>>>>> There's two maintainers per branch and an order of magnitude or 
>>>>>>> two more developers so it'd be nice if cleanups would just be 
>>>>>>> incoming on self-initiative basis. ;)
>>>>>>>
>>>>>>>>> For the actual functional change at hand - it would be nice if 
>>>>>>>>> code paths in question could handle SIGINT and then we could 
>>>>>>>>> punt the decision on how long someone wants to wait purely to 
>>>>>>>>> userspace. But it's probably hard and it's only debugfs so 
>>>>>>>>> whatever.
>>>>>>>>>
>>>>>>>> The code paths in question will already abort on a signal won't 
>>>>>>>> they? Both intel_gt_wait_for_idle() and 
>>>>>>>> intel_guc_wait_for_pending_msg(), which is where the 
>>>>>>>> uc_wait_for_idle eventually ends up, have an 'if(signal_pending) 
>>>>>>>> return -EINTR;' check. Beyond that, it sounds like what you are 
>>>>>>>> asking for is a change in the IGT libraries and/or CI framework 
>>>>>>>> to start sending signals after some specific timeout. That seems 
>>>>>>>> like a significantly more complex change (in terms of the number 
>>>>>>>> of entities affected and number of groups involved) and 
>>>>>>>> unnecessary.
>>>>>>>
>>>>>>> If you say so, I haven't looked at them all. But if the code path 
>>>>>>> in question already aborts on signals then I am not sure what is 
>>>>>>> the patch fixing? I assumed you are trying to avoid the write 
>>>>>>> stuck in D forever, which then prevents driver unload and 
>>>>>>> everything, requiring the test runner to eventually reboot. If 
>>>>>>> you say SIGINT works then you can already recover from userspace, 
>>>>>>> no?
>>>>>>>
>>>>>>>>> Whether or not 10s is enough CI will hopefully tell us. I'd 
>>>>>>>>> probably err on the side of safety and make it longer, but at 
>>>>>>>>> most half from the test runner timeout.
>>>>>>>> This is supposed to be test clean up. This is not about how long 
>>>>>>>> a particular test takes to complete but about how long it takes 
>>>>>>>> to declare the system broken after the test has already 
>>>>>>>> finished. I would argue that even 10s is massively longer than 
>>>>>>>> required.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not convinced that wedging is correct though. Conceptually 
>>>>>>>>> could be just that the timeout is too short. What does wedging 
>>>>>>>>> really give us, on top of limiting the wait, when latter AFAIU 
>>>>>>>>> is the key factor which would prevent the need to reboot the 
>>>>>>>>> machine?
>>>>>>>>>
>>>>>>>> It gives us a system that knows what state it is in. If we can't 
>>>>>>>> idle the GT then something is very badly wrong. Wedging 
>>>>>>>> indicates that. It also ensure that a full GT reset will be 
>>>>>>>> attempted before the next test is run. Helping to prevent a 
>>>>>>>> failure on test X from propagating into failures of unrelated 
>>>>>>>> tests X+1, X+2, ... And if the GT reset does not work because 
>>>>>>>> the system is really that badly broken then future tests will 
>>>>>>>> not run rather than report erroneous failures.
>>>>>>>>
>>>>>>>> This is not about getting a more stable system for end users by 
>>>>>>>> sweeping issues under the carpet and pretending all is well. End 
>>>>>>>> users don't run IGTs or explicitly call dodgy debugfs entry 
>>>>>>>> points. The sole motivation here is to get more accurate results 
>>>>>>>> from CI. That is, correctly identifying which test has hit a 
>>>>>>>> problem, getting valid debug analysis for that test (logs and 
>>>>>>>> such) and allowing further testing to complete correctly in the 
>>>>>>>> case where the system can be recovered.
>>>>>>>
>>>>>>> I don't really oppose shortening of the timeout in principle, 
>>>>>>> just want a clear statement if this is something IGT / test 
>>>>>>> runner could already do or not. It can apply a timeout, it can 
>>>>>>> also send SIGINT, and it could even trigger a reset from outside. 
>>>>>>> Sure it is debugfs hacks so general "kernel should not implement 
>>>>>>> policy" need not be strictly followed, but lets have it clear 
>>>>>>> what are the options.
>>>>>>
>>>>>> One conceptual problem with applying this policy is that the code is:
>>>>>>
>>>>>>     if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>>>>>         ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>>>>>         if (ret)
>>>>>>             return ret;
>>>>>>     }
>>>>>>
>>>>>>     if (val & DROP_IDLE) {
>>>>>>         ret = intel_gt_pm_wait_for_idle(gt);
>>>>>>         if (ret)
>>>>>>             return ret;
>>>>>>     }
>>>>>>
>>>>>> So if someone passes in DROP_IDLE and then why would only the 
>>>>>> first branch have a short timeout and wedge. Yeah some bug happens 
>>>>>> to be there at the moment, but put a bug in a different place and 
>>>>>> you hang on the second branch and then need another patch. Versus 
>>>>>> perhaps making it all respect SIGINT and handle from outside.
>>>>>>
>>>>> The pm_wait_for_idle is can only called after gt_wait_for_idle has 
>>>>> completed successfully. There is no route to skip the GT idle or to 
>>>>> do the PM idle even if the GT idle fails. So the chances of the PM 
>>>>> idle failing are greatly reduced. There would have to be something 
>>>>> outside of a GT keeping the GPU awake and there isn't a whole lot 
>>>>> of hardware left at that point!
>>>>
>>>> Well "greatly reduced" is beside my point. Point is today bug is 
>>>> here and we add a timeout, tomorrow bug is there and then the same 
>>>> dance. It can be just a sw bug which forgets to release the pm ref 
>>>> in some circumstances, doesn't really matter.
>>>>
>>> Huh?
>>>
>>> Greatly reduced is the whole point. Today there is a bug and it 
>>> causes a kernel hang which requires the CI framework to reboot the 
>>> system in an extremely unfriendly way which makes it very hard to 
>>> work out what happened. Logs are likely not available. We don't even 
>>> necessarily know which test was being run at the time. Etc. So we 
>>> replace the infinite timeout with a meaningful timeout. CI now 
>>> correctly marks the single test as failing, captures all the correct 
>>> logs, creates a useful bug report and continues on testing more stuff.
>>
>> So what is preventing CI to collect logs if IGT is forever stuck in 
>> interruptible wait? Surely it can collect the logs at that point if 
>> the kernel is healthy enough. If it isn't then I don't see how wedging 
>> the GPU will make the kernel any healthier.
>>
>> Is i915 preventing better log collection or could test runner be 
>> improved?
>>
>>> Sure, there is still the chance of hitting an infinite timeout. But 
>>> that one is significantly more complicated to remove. And the chances 
>>> of hitting that one are significantly smaller than the chances of 
>>> hitting the first one.
>>
>> This statement relies on intimate knowledge implementation details and 
>> a bit too much white box testing approach but that's okay, lets move 
>> past this one.
>>
>>> So you are arguing that because I can't fix the last 0.1% of possible 
>>> failures, I am not allowed to fix the first 99.9% of the failures?
>>
>> I am clearly not arguing for that. But we are also not talking about 
>> "fixing failures" here. Just how to make CI cope better with a class 
>> of i915 bugs.
>>
>>>>> Regarding signals, the PM idle code ends up at 
>>>>> wait_var_event_killable(). I assume that is interruptible via at 
>>>>> least a KILL signal if not any signal. Although it's not entirely 
>>>>> clear trying to follow through the implementation of this code. 
>>>>> Also, I have no idea if there is a safe way to add a timeout to 
>>>>> that code (or why it wasn't already written with a timeout 
>>>>> included). Someone more familiar with the wakeref internals would 
>>>>> need to comment.
>>>>>
>>>>> However, I strongly disagree that we should not fix the driver just 
>>>>> because it is possible to workaround the issue by re-writing the CI 
>>>>> framework. Feel free to bring a redesign plan to the IGT WG and 
>>>>> whatever equivalent CI meetings in parallel. But we absolutely 
>>>>> should not have infinite waits in the kernel if there is a trivial 
>>>>> way to not have infinite waits.
>>>>
>>>> I thought I was clear that I am not really opposed to the timeout.
>>>>
>>>> The rest of the paragraph I don't really care - point is moot 
>>>> because it's debugfs so we can do whatever, as long as it is not 
>>>> burdensome to i915, which this isn't. If either wasn't the case then 
>>>> we certainly wouldn't be adding any workarounds in the kernel if it 
>>>> can be achieved in IGT.
>>>>
>>>>> Also, sending a signal does not result in the wedge happening. I 
>>>>> specifically did not want to change that code path because I was 
>>>>> assuming there was a valid reason for it. If you have been 
>>>>> interrupted then you are in the territory of maybe it would have 
>>>>> succeeded if you just left it for a moment longer. Whereas, hitting 
>>>>> the timeout says that someone very deliberately said this is too 
>>>>> long to wait and therefore the system must be broken.
>>>>
>>>> I wanted to know specifically about wedging - why can't you 
>>>> wedge/reset from IGT if DROP_IDLE times out in quiescent or 
>>>> wherever, if that's what you say is the right thing? 
>>> Huh?
>>>
>>> DROP_IDLE has two waits. One that I am trying to change from infinite 
>>> to finite + wedge. One that would take considerable effort to change 
>>> and would be quite invasive to a lot more of the driver and which can 
>>> only be hit if the first timeout actually completed successfully and 
>>> is therefore of less importance anyway. Both of those time outs 
>>> appear to respect signal interrupts.
>>>
>>>> That's a policy decision so why would i915 wedge if an arbitrary 
>>>> timeout expired? I915 is not controlling how much work there is 
>>>> outstanding at the point the IGT decides to call DROP_IDLE.
>>>
>>> Because this is a debug test interface that is used solely by IGT 
>>> after it has finished its testing. This is not about wedging the 
>>> device at some random arbitrary point because an AI compute workload 
>>> takes three hours to complete. This is about a very specific test 
>>> framework cleaning up after testing is completed and making sure the 
>>> test did not fry the system.
>>>
>>> And even if an IGT test was calling DROP_IDLE in the middle of a test 
>>> for some reason, it should not be deliberately pushing 10+ seconds of 
>>> work through and then calling a debug only interface to flush it out. 
>>> If a test wants to verify that the system can cope with submitting a 
>>> minutes worth of rendering and then waiting for it to complete then 
>>> the test should be using official channels for that wait.
>>>
>>>>
>>>>> Plus, infinite wait is not a valid code path in the first place so 
>>>>> any change in behaviour is not really a change in behaviour. Code 
>>>>> can't be relying on a kernel call to never return for its correct 
>>>>> operation!
>>>>
>>>> Why infinite wait wouldn't be valid? Then you better change the 
>>>> other one as well. ;P
>>> In what universe is it ever valid to wait forever for a test to 
>>> complete?
>>
>> Well above you claimed both paths respect SIGINT. If that is so then 
>> the wait is as infinite as the IGT wanted it to be.
>>
>>> See above, the PM code would require much more invasive changes. This 
>>> was low hanging fruit. It was supposed to be a two minute change to a 
>>> very self contained section of code that would provide significant 
>>> benefit to debugging a small class of very hard to debug problems.
>>
>> Sure, but I'd still like to know why can't you do what you want from 
>> the IGT framework.
>>
>> Have the timeout reduction in i915, again that's fine assuming 10 
>> seconds it enough to not break something by accident.
> CI showed no regressions. And if someone does find a valid reason why a 
> post test drop caches call should legitimately take a stupidly long time 
> then it is easy to track back where the ETIME error came from and bump 
> the timeout.
> 
>>
>> With that change you already have broken the "infinite wait". It makes 
>> the debugfs write return -ETIME in time much shorter than the test 
>> runner timeout(s). What is the thing that you cannot do from IGT at 
>> that point is my question? You want to wedge then? Send 
>> DROP_RESET_ACTIVE to do it for you? If that doesn't work add a new 
>> flag which will wedge explicitly.
>>
>> We are again degrading into a huge philosophical discussion and all I 
>> wanted to start with is to hear how exactly things go bad.
>>
> I have no idea what you are wanting. I am trying to have a technical 
> discussion about improving the stability of the driver during CI 
> testing. I have no idea if you are arguing that this change is good, 
> bad, broken, wrong direction or what.
> 
> Things go bad as explained in the commit message. The CI framework does 
> not use signals. The IGT framework does not use signals. There is no 
> watchdog that sends a TERM or KILL signal after a specified timeout. All 
> that happens is the IGT sits there forever waiting for the drop caches 
> IOCTL to return. The CI framework eventually gives up waiting for the 
> test to complete and tries to recover. There are many different CI 
> frameworks in use across Intel. Some timeout quickly, some timeout 
> slowly. But basically, they all eventually give up and don't bother 
> trying any kind of remedial action but just hit the reset button 
> (sometimes by literally power cycling the DUT). As result, background 
> processes that are saving dmesg, stdout, etc do not necessarily 
> terminate cleanly. That results in logs that are at best truncated, at 
> worst missing entirely. It also results in some frameworks aborting 
> testing at that point. So no results are generated for all the other 
> tests that have yet to be run. Some frameworks also run tests in 
> batches. All they log is that something, somewhere in the batch died. So 
> you don't even know which specific test actually hit the problem.
> 
> Can the CI frameworks be improved? Undoubtedly. In very many ways. Is 
> that something we have the ability to do with a simple patch? No. Would 
> re-writing the IGT framework to add watchdog mechanisms improve things? 
> Yes. Can it be done with a simple patch? No. Would a simple patch to 
> i915 significantly improve the situation? Yes. Will it solve every 
> possible CI hang? No. Will it fix any actual end user visible bugs? No. 
> Will it introduce any new bugs? No. Will it help us to debug at least 
> some CI failures? Yes.

To unblock, I suggest you go with the patch which caps the wait only, 
and propose a wedging as an IGT patch to gem_quiescent_gpu(). That 
should involve the CI/IGT folks into discussion on what logs will be, or 
will not be collected once gem_quiescent_gpu() fails due -ETIME. In fact 
probably you should copy CI/IGT folks on the v2 of the i915 patch as 
well since I now think their acks would be good to have - from the point 
of view of the current test runner behaviour with hanging tests.

Regards,

Tvrtko