[PATCH 05/18] drm/amd/display: Use mdelay to avoid crashes

Aurabindo Pillai aurabindo.pillai at amd.com
Thu Dec 15 19:38:00 UTC 2022


On 12/15/22 10:17, Harry Wentland wrote:
>
> On 12/15/22 05:29, Michel Dänzer wrote:
>> On 12/15/22 09:09, Christian König wrote:
>>> Am 15.12.22 um 00:33 schrieb Alex Hung:
>>>> On 2022-12-14 16:06, Alex Deucher wrote:
>>>>> On Wed, Dec 14, 2022 at 5:56 PM Alex Hung<alex.hung at amd.com>  wrote:
>>>>>> On 2022-12-14 15:35, Alex Deucher wrote:
>>>>>>> On Wed, Dec 14, 2022 at 5:25 PM Alex Hung<alex.hung at amd.com>  wrote:
>>>>>>>> On 2022-12-14 14:54, Alex Deucher wrote:
>>>>>>>>> On Wed, Dec 14, 2022 at 4:50 PM Alex Hung<alex.hung at amd.com>  wrote:
>>>>>>>>>> On 2022-12-14 13:48, Alex Deucher wrote:
>>>>>>>>>>> On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai
>>>>>>>>>>> <aurabindo.pillai at amd.com>  wrote:
>>>>>>>>>>>> From: Alex Hung<alex.hung at amd.com>
>>>>>>>>>>>>
>>>>>>>>>>>> [Why]
>>>>>>>>>>>> When running IGT kms_bw test with DP monitor, some systems crash from
>>>>>>>>>>>> msleep no matter how long or short the time is.
>>>>>>>>>>>>
>>>>>>>>>>>> [How]
>>>>>>>>>>>> To replace msleep with mdelay.
>>>>>>>>>>> Can you provide a bit more info about the crash?  A lot of platforms
>>>>>>>>>>> don't support delay larger than 2-4ms so this change will generate
>>>>>>>>>>> errors on ARM and possibly other platforms.
>>>>>>>>>>>
>>>>>>>>>>> Alex
>>>>>>>>>> The msleep was introduced in eec3303de3378 for non-compliant display
>>>>>>>>>> port monitors but IGT's kms_bw test can cause a recent h/w to hang at
>>>>>>>>>> msleep(60) when calling "igt_remove_fb" in IGT
>>>>>>>>>> (https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/kms_bw.c#L197>>>>>>>>>>
>>>>>>>>>> It is possible to workaround this by reversing order of
>>>>>>>>>> igt_remove_fb(&buffer[i]), as the following example:
>>>>>>>>>>
>>>>>>>>>>        igt_create_color_fb with the order buffer[0], buffer[1], buffer[2]
>>>>>>>>>>
>>>>>>>>>> Hangs:
>>>>>>>>>>        igt_remove_fb with the order buffer[0], buffer[1], buffer[2]
>>>>>>>>>>
>>>>>>>>>> No hangs:
>>>>>>>>>>        igt_remove_fb with the reversed order buffer[2], buffer[1], buffer[0]
>>>>>>>>>>
>>>>>>>>>> However, IGT simply exposes the problem and it makes more sense to stop
>>>>>>>>>> the hang from occurring.
>>>>>>>>>>
>>>>>>>>>> I also tried to remove the msleep completely and it also work, but I
>>>>>>>>>> didn't want to break the fix for the original problematic hardware
>>>>>>>>>> configuration.
>>>>>>>>> Why does sleep vs delay make a difference?  Is there some race that we
>>>>>>>>> are not locking against?
>>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>> That was my original thought but I did not find any previously. I will
>>>>>>>> investigate it again.
>>>>>>>>
>>>>>>>> If mdelay(>4) isn't usable on other platforms, is it an option to use
>>>>>>>> mdelay on x86_64 only and keep msleep on other platforms or just remove
>>>>>>>> the msleep for other platforms, something like
>>>>>>>>
>>>>>>>> -                       msleep(60);
>>>>>>>> +#ifdef CONFIG_X86_64
>>>>>>>> +                       mdelay(60);
>>>>>>>> +#endif
>>>>>>> That's pretty ugly.  I'd rather try and resolve the root cause.  How
>>>>>>> important is the IGT test?  What does it do?  Is the test itself
>>>>>>> correct?
>>>>>>>
>>>>>> Agreed, and I didn't want to add conditions around the mdelay for the
>>>>>> same reason. I will assume this is not an option now.
>>>>>>
>>>>>> As in the previous comment, IGT can be modified to avoid the crash by
>>>>>> reversing the order fb is removed - though I suspect I will receive
>>>>>> questions why this is not fixed in kernel.
>>>>>>
>>>>>> I wanted to fix this in kernel because nothing stops other user-space
>>>>>> applications to use the same way to crash kernel, so fixing IGT is the
>>>>>> second option.
>>>>>>
>>>>>> Apparently causing problems on other platforms isn't an option at all so
>>>>>> I will try to figure out an non-mdelay solution, and then maybe an IGT
>>>>>> solution instead.
>>>>> What hangs?  The test or the kernel or the hardware?
>>>> The system becomes completely unresponsive - no keyboard, mouse nor remote accesses.
>>> I agree with Alex that changing this is extremely questionable and not justified at all.
>>>
>>> My educated guess is that by using mdelay() instead of msleep() we keep the CPU core busy and preventing something from happening at the same time as something else.
>>>
>>> This clearly points to missing locking or similar to protect concurrent execution of things.
>> Might another possibility be that this code gets called from an atomic context which can't sleep?
>>
>>
> It can come through handle_hpd_rx_irq but we're using a workqueue
> to queue interrupt handling so this shouldn't come from an atomic
> context. I currently don't see where else it might be used in an
> atomic context. Alex Hung, can you do a dump_stack() in this function
> to see where the problematic call is coming from?
>
> Fixing IGT will only mask the issue. Userspace should never be able
> to put the system in a state where it stops responding entirely. This
> will need some sort of fix in the kernel.
>
> Harry
>
>
I will drop this patch from the series.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20221215/84182205/attachment-0001.htm>


More information about the amd-gfx mailing list