[PATCH 05/18] drm/amd/display: Use mdelay to avoid crashes
Michel Dänzer
michel.daenzer at mailbox.org
Thu Dec 15 10:29:42 UTC 2022
On 12/15/22 09:09, Christian König wrote:
> Am 15.12.22 um 00:33 schrieb Alex Hung:
>> On 2022-12-14 16:06, Alex Deucher wrote:
>>> On Wed, Dec 14, 2022 at 5:56 PM Alex Hung <alex.hung at amd.com> wrote:
>>>> On 2022-12-14 15:35, Alex Deucher wrote:
>>>>> On Wed, Dec 14, 2022 at 5:25 PM Alex Hung <alex.hung at amd.com> wrote:
>>>>>> On 2022-12-14 14:54, Alex Deucher wrote:
>>>>>>> On Wed, Dec 14, 2022 at 4:50 PM Alex Hung <alex.hung at amd.com> wrote:
>>>>>>>> On 2022-12-14 13:48, Alex Deucher wrote:
>>>>>>>>> On Wed, Dec 14, 2022 at 3:22 PM Aurabindo Pillai
>>>>>>>>> <aurabindo.pillai at amd.com> wrote:
>>>>>>>>>>
>>>>>>>>>> From: Alex Hung <alex.hung at amd.com>
>>>>>>>>>>
>>>>>>>>>> [Why]
>>>>>>>>>> When running IGT kms_bw test with DP monitor, some systems crash from
>>>>>>>>>> msleep no matter how long or short the time is.
>>>>>>>>>>
>>>>>>>>>> [How]
>>>>>>>>>> To replace msleep with mdelay.
>>>>>>>>>
>>>>>>>>> Can you provide a bit more info about the crash? A lot of platforms
>>>>>>>>> don't support delay larger than 2-4ms so this change will generate
>>>>>>>>> errors on ARM and possibly other platforms.
>>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>
>>>>>>>> The msleep was introduced in eec3303de3378 for non-compliant display
>>>>>>>> port monitors but IGT's kms_bw test can cause a recent h/w to hang at
>>>>>>>> msleep(60) when calling "igt_remove_fb" in IGT
>>>>>>>> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Figt-gpu-tools%2F-%2Fblob%2Fmaster%2Ftests%2Fkms_bw.c%23L197&data=05%7C01%7Calex.hung%40amd.com%7C81664450189542a7bbc408dade27d0e9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638066559844526853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=M%2BF4H2qddXItPoUZRVyhlc5N8UF1%2FHIh8PpfT%2BCmdZ4%3D&reserved=0)
>>>>>>>>
>>>>>>>> It is possible to workaround this by reversing order of
>>>>>>>> igt_remove_fb(&buffer[i]), as the following example:
>>>>>>>>
>>>>>>>> igt_create_color_fb with the order buffer[0], buffer[1], buffer[2]
>>>>>>>>
>>>>>>>> Hangs:
>>>>>>>> igt_remove_fb with the order buffer[0], buffer[1], buffer[2]
>>>>>>>>
>>>>>>>> No hangs:
>>>>>>>> igt_remove_fb with the reversed order buffer[2], buffer[1], buffer[0]
>>>>>>>>
>>>>>>>> However, IGT simply exposes the problem and it makes more sense to stop
>>>>>>>> the hang from occurring.
>>>>>>>>
>>>>>>>> I also tried to remove the msleep completely and it also work, but I
>>>>>>>> didn't want to break the fix for the original problematic hardware
>>>>>>>> configuration.
>>>>>>>
>>>>>>> Why does sleep vs delay make a difference? Is there some race that we
>>>>>>> are not locking against?
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>
>>>>>> That was my original thought but I did not find any previously. I will
>>>>>> investigate it again.
>>>>>>
>>>>>> If mdelay(>4) isn't usable on other platforms, is it an option to use
>>>>>> mdelay on x86_64 only and keep msleep on other platforms or just remove
>>>>>> the msleep for other platforms, something like
>>>>>>
>>>>>> - msleep(60);
>>>>>> +#ifdef CONFIG_X86_64
>>>>>> + mdelay(60);
>>>>>> +#endif
>>>>>
>>>>> That's pretty ugly. I'd rather try and resolve the root cause. How
>>>>> important is the IGT test? What does it do? Is the test itself
>>>>> correct?
>>>>>
>>>>
>>>> Agreed, and I didn't want to add conditions around the mdelay for the
>>>> same reason. I will assume this is not an option now.
>>>>
>>>> As in the previous comment, IGT can be modified to avoid the crash by
>>>> reversing the order fb is removed - though I suspect I will receive
>>>> questions why this is not fixed in kernel.
>>>>
>>>> I wanted to fix this in kernel because nothing stops other user-space
>>>> applications to use the same way to crash kernel, so fixing IGT is the
>>>> second option.
>>>>
>>>> Apparently causing problems on other platforms isn't an option at all so
>>>> I will try to figure out an non-mdelay solution, and then maybe an IGT
>>>> solution instead.
>>>
>>> What hangs? The test or the kernel or the hardware?
>>
>> The system becomes completely unresponsive - no keyboard, mouse nor remote accesses.
>
> I agree with Alex that changing this is extremely questionable and not justified at all.
>
> My educated guess is that by using mdelay() instead of msleep() we keep the CPU core busy and preventing something from happening at the same time as something else.
>
> This clearly points to missing locking or similar to protect concurrent execution of things.
Might another possibility be that this code gets called from an atomic context which can't sleep?
--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer
More information about the amd-gfx
mailing list