[Intel-gfx] [PATCH] drm/i915: Don't wait forever in drop_caches
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Tue Nov 8 09:08:50 UTC 2022
On 07/11/2022 19:45, John Harrison wrote:
> On 11/7/2022 06:09, Tvrtko Ursulin wrote:
>> On 04/11/2022 17:45, John Harrison wrote:
>>> On 11/4/2022 03:01, Tvrtko Ursulin wrote:
>>>> On 03/11/2022 19:16, John Harrison wrote:
>>>>> On 11/3/2022 02:38, Tvrtko Ursulin wrote:
>>>>>> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>>>>>>> On 03/11/2022 01:33, John Harrison wrote:
>>>>>>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>>>>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>>>>>>> On Tue, 01 Nov 2022, John.C.Harrison at Intel.com wrote:
>>>>>>>>>>> From: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>>>>>
>>>>>>>>>>> At the end of each test, IGT does a drop caches call via
>>>>>>>>>>> sysfs with
>>>>>>>>>>
>>>>>>>>>> sysfs?
>>>>>>>> Sorry, that was meant to say debugfs. I've also been working on
>>>>>>>> some sysfs IGT issues and evidently got my wires crossed!
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> special flags set. One of the possible paths waits for idle
>>>>>>>>>>> with an
>>>>>>>>>>> infinite timeout. That causes problems for debugging issues
>>>>>>>>>>> when CI
>>>>>>>>>>> catches a "can't go idle" test failure. Best case, the CI
>>>>>>>>>>> system times
>>>>>>>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>>>>>>>> reboots the system to recover it. Worst case, the CI system
>>>>>>>>>>> can't do
>>>>>>>>>>> anything at all and then times out (after 1000s) and simply
>>>>>>>>>>> reboots.
>>>>>>>>>>> Sometimes a serial port log of dmesg might be available,
>>>>>>>>>>> sometimes not.
>>>>>>>>>>>
>>>>>>>>>>> So rather than making life hard for ourselves, change the
>>>>>>>>>>> timeout to
>>>>>>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>>>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>>>>>>>> working system (if possible).
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>> @@ -641,6 +641,9 @@
>>>>>>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>>>>>> DROP_RESET_ACTIVE | \
>>>>>>>>>>> DROP_RESET_SEQNO | \
>>>>>>>>>>> DROP_RCU)
>>>>>>>>>>> +
>>>>>>>>>>> +#define DROP_IDLE_TIMEOUT (HZ * 10)
>>>>>>>>>>
>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also
>>>>>>>>>> only used
>>>>>>>>>> here.
>>>>>>>>>
>>>>>>>>> So move here, dropping i915 prefix, next to the newly proposed
>>>>>>>>> one?
>>>>>>>> Sure, can do that.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>>>>>>>> gt/intel_gt.c.
>>>>>>>>>
>>>>>>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>>>>>>
>>>>>>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in
>>>>>>>>>> intel_gt_pm.c.
>>>>>>>>>
>>>>>>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>>>>>>
>>>>>>>> These two are totally unrelated and in code not being touched by
>>>>>>>> this change. I would rather not conflate changing random other
>>>>>>>> things with fixing this specific issue.
>>>>>>>>
>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>>>>>>
>>>>>>>>> Add _MS suffix if wanted.
>>>>>>>>>
>>>>>>>>>> My head spins.
>>>>>>>>>
>>>>>>>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT
>>>>>>>>> applies to DROP_ACTIVE and not only DROP_IDLE.
>>>>>>>> My original intention for the name was that is the 'drop caches
>>>>>>>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful
>>>>>>>> and hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised
>>>>>>>> later that name can be conflated with the DROP_IDLE flag. Will
>>>>>>>> rename.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Things get refactored, code moves around, bits get left behind,
>>>>>>>>> who knows. No reason to get too worked up. :) As long as people
>>>>>>>>> are taking a wider view when touching the code base, and are
>>>>>>>>> not afraid to send cleanups, things should be good.
>>>>>>>> On the other hand, if every patch gets blocked in code review
>>>>>>>> because someone points out some completely unrelated piece of
>>>>>>>> code could be a bit better then nothing ever gets fixed. If you
>>>>>>>> spot something that you think should be improved, isn't the
>>>>>>>> general idea that you should post a patch yourself to improve it?
>>>>>>>
>>>>>>> There's two maintainers per branch and an order of magnitude or
>>>>>>> two more developers so it'd be nice if cleanups would just be
>>>>>>> incoming on self-initiative basis. ;)
>>>>>>>
>>>>>>>>> For the actual functional change at hand - it would be nice if
>>>>>>>>> code paths in question could handle SIGINT and then we could
>>>>>>>>> punt the decision on how long someone wants to wait purely to
>>>>>>>>> userspace. But it's probably hard and it's only debugfs so
>>>>>>>>> whatever.
>>>>>>>>>
>>>>>>>> The code paths in question will already abort on a signal won't
>>>>>>>> they? Both intel_gt_wait_for_idle() and
>>>>>>>> intel_guc_wait_for_pending_msg(), which is where the
>>>>>>>> uc_wait_for_idle eventually ends up, have an 'if(signal_pending)
>>>>>>>> return -EINTR;' check. Beyond that, it sounds like what you are
>>>>>>>> asking for is a change in the IGT libraries and/or CI framework
>>>>>>>> to start sending signals after some specific timeout. That seems
>>>>>>>> like a significantly more complex change (in terms of the number
>>>>>>>> of entities affected and number of groups involved) and
>>>>>>>> unnecessary.
>>>>>>>
>>>>>>> If you say so, I haven't looked at them all. But if the code path
>>>>>>> in question already aborts on signals then I am not sure what is
>>>>>>> the patch fixing? I assumed you are trying to avoid the write
>>>>>>> stuck in D forever, which then prevents driver unload and
>>>>>>> everything, requiring the test runner to eventually reboot. If
>>>>>>> you say SIGINT works then you can already recover from userspace,
>>>>>>> no?
>>>>>>>
>>>>>>>>> Whether or not 10s is enough CI will hopefully tell us. I'd
>>>>>>>>> probably err on the side of safety and make it longer, but at
>>>>>>>>> most half from the test runner timeout.
>>>>>>>> This is supposed to be test clean up. This is not about how long
>>>>>>>> a particular test takes to complete but about how long it takes
>>>>>>>> to declare the system broken after the test has already
>>>>>>>> finished. I would argue that even 10s is massively longer than
>>>>>>>> required.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not convinced that wedging is correct though. Conceptually
>>>>>>>>> could be just that the timeout is too short. What does wedging
>>>>>>>>> really give us, on top of limiting the wait, when latter AFAIU
>>>>>>>>> is the key factor which would prevent the need to reboot the
>>>>>>>>> machine?
>>>>>>>>>
>>>>>>>> It gives us a system that knows what state it is in. If we can't
>>>>>>>> idle the GT then something is very badly wrong. Wedging
>>>>>>>> indicates that. It also ensure that a full GT reset will be
>>>>>>>> attempted before the next test is run. Helping to prevent a
>>>>>>>> failure on test X from propagating into failures of unrelated
>>>>>>>> tests X+1, X+2, ... And if the GT reset does not work because
>>>>>>>> the system is really that badly broken then future tests will
>>>>>>>> not run rather than report erroneous failures.
>>>>>>>>
>>>>>>>> This is not about getting a more stable system for end users by
>>>>>>>> sweeping issues under the carpet and pretending all is well. End
>>>>>>>> users don't run IGTs or explicitly call dodgy debugfs entry
>>>>>>>> points. The sole motivation here is to get more accurate results
>>>>>>>> from CI. That is, correctly identifying which test has hit a
>>>>>>>> problem, getting valid debug analysis for that test (logs and
>>>>>>>> such) and allowing further testing to complete correctly in the
>>>>>>>> case where the system can be recovered.
>>>>>>>
>>>>>>> I don't really oppose shortening of the timeout in principle,
>>>>>>> just want a clear statement if this is something IGT / test
>>>>>>> runner could already do or not. It can apply a timeout, it can
>>>>>>> also send SIGINT, and it could even trigger a reset from outside.
>>>>>>> Sure it is debugfs hacks so general "kernel should not implement
>>>>>>> policy" need not be strictly followed, but lets have it clear
>>>>>>> what are the options.
>>>>>>
>>>>>> One conceptual problem with applying this policy is that the code is:
>>>>>>
>>>>>> if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>>>>> ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>>>>> if (ret)
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> if (val & DROP_IDLE) {
>>>>>> ret = intel_gt_pm_wait_for_idle(gt);
>>>>>> if (ret)
>>>>>> return ret;
>>>>>> }
>>>>>>
>>>>>> So if someone passes in DROP_IDLE and then why would only the
>>>>>> first branch have a short timeout and wedge. Yeah some bug happens
>>>>>> to be there at the moment, but put a bug in a different place and
>>>>>> you hang on the second branch and then need another patch. Versus
>>>>>> perhaps making it all respect SIGINT and handle from outside.
>>>>>>
>>>>> The pm_wait_for_idle is can only called after gt_wait_for_idle has
>>>>> completed successfully. There is no route to skip the GT idle or to
>>>>> do the PM idle even if the GT idle fails. So the chances of the PM
>>>>> idle failing are greatly reduced. There would have to be something
>>>>> outside of a GT keeping the GPU awake and there isn't a whole lot
>>>>> of hardware left at that point!
>>>>
>>>> Well "greatly reduced" is beside my point. Point is today bug is
>>>> here and we add a timeout, tomorrow bug is there and then the same
>>>> dance. It can be just a sw bug which forgets to release the pm ref
>>>> in some circumstances, doesn't really matter.
>>>>
>>> Huh?
>>>
>>> Greatly reduced is the whole point. Today there is a bug and it
>>> causes a kernel hang which requires the CI framework to reboot the
>>> system in an extremely unfriendly way which makes it very hard to
>>> work out what happened. Logs are likely not available. We don't even
>>> necessarily know which test was being run at the time. Etc. So we
>>> replace the infinite timeout with a meaningful timeout. CI now
>>> correctly marks the single test as failing, captures all the correct
>>> logs, creates a useful bug report and continues on testing more stuff.
>>
>> So what is preventing CI to collect logs if IGT is forever stuck in
>> interruptible wait? Surely it can collect the logs at that point if
>> the kernel is healthy enough. If it isn't then I don't see how wedging
>> the GPU will make the kernel any healthier.
>>
>> Is i915 preventing better log collection or could test runner be
>> improved?
>>
>>> Sure, there is still the chance of hitting an infinite timeout. But
>>> that one is significantly more complicated to remove. And the chances
>>> of hitting that one are significantly smaller than the chances of
>>> hitting the first one.
>>
>> This statement relies on intimate knowledge implementation details and
>> a bit too much white box testing approach but that's okay, lets move
>> past this one.
>>
>>> So you are arguing that because I can't fix the last 0.1% of possible
>>> failures, I am not allowed to fix the first 99.9% of the failures?
>>
>> I am clearly not arguing for that. But we are also not talking about
>> "fixing failures" here. Just how to make CI cope better with a class
>> of i915 bugs.
>>
>>>>> Regarding signals, the PM idle code ends up at
>>>>> wait_var_event_killable(). I assume that is interruptible via at
>>>>> least a KILL signal if not any signal. Although it's not entirely
>>>>> clear trying to follow through the implementation of this code.
>>>>> Also, I have no idea if there is a safe way to add a timeout to
>>>>> that code (or why it wasn't already written with a timeout
>>>>> included). Someone more familiar with the wakeref internals would
>>>>> need to comment.
>>>>>
>>>>> However, I strongly disagree that we should not fix the driver just
>>>>> because it is possible to workaround the issue by re-writing the CI
>>>>> framework. Feel free to bring a redesign plan to the IGT WG and
>>>>> whatever equivalent CI meetings in parallel. But we absolutely
>>>>> should not have infinite waits in the kernel if there is a trivial
>>>>> way to not have infinite waits.
>>>>
>>>> I thought I was clear that I am not really opposed to the timeout.
>>>>
>>>> The rest of the paragraph I don't really care - point is moot
>>>> because it's debugfs so we can do whatever, as long as it is not
>>>> burdensome to i915, which this isn't. If either wasn't the case then
>>>> we certainly wouldn't be adding any workarounds in the kernel if it
>>>> can be achieved in IGT.
>>>>
>>>>> Also, sending a signal does not result in the wedge happening. I
>>>>> specifically did not want to change that code path because I was
>>>>> assuming there was a valid reason for it. If you have been
>>>>> interrupted then you are in the territory of maybe it would have
>>>>> succeeded if you just left it for a moment longer. Whereas, hitting
>>>>> the timeout says that someone very deliberately said this is too
>>>>> long to wait and therefore the system must be broken.
>>>>
>>>> I wanted to know specifically about wedging - why can't you
>>>> wedge/reset from IGT if DROP_IDLE times out in quiescent or
>>>> wherever, if that's what you say is the right thing?
>>> Huh?
>>>
>>> DROP_IDLE has two waits. One that I am trying to change from infinite
>>> to finite + wedge. One that would take considerable effort to change
>>> and would be quite invasive to a lot more of the driver and which can
>>> only be hit if the first timeout actually completed successfully and
>>> is therefore of less importance anyway. Both of those time outs
>>> appear to respect signal interrupts.
>>>
>>>> That's a policy decision so why would i915 wedge if an arbitrary
>>>> timeout expired? I915 is not controlling how much work there is
>>>> outstanding at the point the IGT decides to call DROP_IDLE.
>>>
>>> Because this is a debug test interface that is used solely by IGT
>>> after it has finished its testing. This is not about wedging the
>>> device at some random arbitrary point because an AI compute workload
>>> takes three hours to complete. This is about a very specific test
>>> framework cleaning up after testing is completed and making sure the
>>> test did not fry the system.
>>>
>>> And even if an IGT test was calling DROP_IDLE in the middle of a test
>>> for some reason, it should not be deliberately pushing 10+ seconds of
>>> work through and then calling a debug only interface to flush it out.
>>> If a test wants to verify that the system can cope with submitting a
>>> minutes worth of rendering and then waiting for it to complete then
>>> the test should be using official channels for that wait.
>>>
>>>>
>>>>> Plus, infinite wait is not a valid code path in the first place so
>>>>> any change in behaviour is not really a change in behaviour. Code
>>>>> can't be relying on a kernel call to never return for its correct
>>>>> operation!
>>>>
>>>> Why infinite wait wouldn't be valid? Then you better change the
>>>> other one as well. ;P
>>> In what universe is it ever valid to wait forever for a test to
>>> complete?
>>
>> Well above you claimed both paths respect SIGINT. If that is so then
>> the wait is as infinite as the IGT wanted it to be.
>>
>>> See above, the PM code would require much more invasive changes. This
>>> was low hanging fruit. It was supposed to be a two minute change to a
>>> very self contained section of code that would provide significant
>>> benefit to debugging a small class of very hard to debug problems.
>>
>> Sure, but I'd still like to know why can't you do what you want from
>> the IGT framework.
>>
>> Have the timeout reduction in i915, again that's fine assuming 10
>> seconds it enough to not break something by accident.
> CI showed no regressions. And if someone does find a valid reason why a
> post test drop caches call should legitimately take a stupidly long time
> then it is easy to track back where the ETIME error came from and bump
> the timeout.
>
>>
>> With that change you already have broken the "infinite wait". It makes
>> the debugfs write return -ETIME in time much shorter than the test
>> runner timeout(s). What is the thing that you cannot do from IGT at
>> that point is my question? You want to wedge then? Send
>> DROP_RESET_ACTIVE to do it for you? If that doesn't work add a new
>> flag which will wedge explicitly.
>>
>> We are again degrading into a huge philosophical discussion and all I
>> wanted to start with is to hear how exactly things go bad.
>>
> I have no idea what you are wanting. I am trying to have a technical
> discussion about improving the stability of the driver during CI
> testing. I have no idea if you are arguing that this change is good,
> bad, broken, wrong direction or what.
>
> Things go bad as explained in the commit message. The CI framework does
> not use signals. The IGT framework does not use signals. There is no
> watchdog that sends a TERM or KILL signal after a specified timeout. All
> that happens is the IGT sits there forever waiting for the drop caches
> IOCTL to return. The CI framework eventually gives up waiting for the
> test to complete and tries to recover. There are many different CI
> frameworks in use across Intel. Some timeout quickly, some timeout
> slowly. But basically, they all eventually give up and don't bother
> trying any kind of remedial action but just hit the reset button
> (sometimes by literally power cycling the DUT). As result, background
> processes that are saving dmesg, stdout, etc do not necessarily
> terminate cleanly. That results in logs that are at best truncated, at
> worst missing entirely. It also results in some frameworks aborting
> testing at that point. So no results are generated for all the other
> tests that have yet to be run. Some frameworks also run tests in
> batches. All they log is that something, somewhere in the batch died. So
> you don't even know which specific test actually hit the problem.
>
> Can the CI frameworks be improved? Undoubtedly. In very many ways. Is
> that something we have the ability to do with a simple patch? No. Would
> re-writing the IGT framework to add watchdog mechanisms improve things?
> Yes. Can it be done with a simple patch? No. Would a simple patch to
> i915 significantly improve the situation? Yes. Will it solve every
> possible CI hang? No. Will it fix any actual end user visible bugs? No.
> Will it introduce any new bugs? No. Will it help us to debug at least
> some CI failures? Yes.
To unblock, I suggest you go with the patch which caps the wait only,
and propose a wedging as an IGT patch to gem_quiescent_gpu(). That
should involve the CI/IGT folks into discussion on what logs will be, or
will not be collected once gem_quiescent_gpu() fails due -ETIME. In fact
probably you should copy CI/IGT folks on the v2 of the i915 patch as
well since I now think their acks would be good to have - from the point
of view of the current test runner behaviour with hanging tests.
Regards,
Tvrtko
More information about the dri-devel
mailing list