Re: 回复: [PATCH] drm/amdgpu: Ignore first evction failure during suspend
Felix Kuehling
felix.kuehling at amd.com
Wed Sep 13 13:54:33 UTC 2023
On 2023-09-13 4:07, Christian König wrote:
> [+Fleix]
>
> Well that looks like quite a serious bug.
>
> If I'm not completely mistaken the KFD work item tries to restore the
> process by moving BOs into memory even after the suspend freeze.
> Normally work items are frozen together with the user space processes
> unless explicitly marked as not freezable.
>
> That this causes problem during the first eviction phase is just the
> tip of the iceberg here. If a BO is moved into invisible memory during
> this we wouldn't be able to get it out of that in the second phase
> because SDMA and hw is already turned off.
>
> @Felix any idea how that can happen? Have you guys marked a work item
> / work queue as not freezable?
We don't set anything to non-freezable in KFD.
Regards,
Felix
> Or maybe the display guys?
>
> @Xinhui please investigate what work item that is and where that is
> coming from. Something like "if (adev->in_suspend) dump_stack();" in
> the right place should probably do it.
>
> Thanks,
> Christian.
>
> Am 13.09.23 um 07:13 schrieb Pan, Xinhui:
>>
>> [AMD Official Use Only - General]
>>
>>
>> I notice that only user space process are frozen on my side. kthread
>> and workqueue keeps running. Maybe some kernel configs are not enabled.
>> I made one module which just prints something like i++ with mutex
>> lock both in workqueue and kthread. I paste some logs below.
>> [438619.696196] XH: 14 from workqueue
>> [438619.700193] XH: 15 from kthread
>> [438620.394335] PM: suspend entry (deep)
>> [438620.399619] Filesystems sync: 0.001 seconds
>> [438620.403887] PM: Preparing system for sleep (deep)
>> [438620.409299] Freezing user space processes
>> [438620.414862] Freezing user space processes completed (elapsed
>> 0.001 seconds)
>> [438620.421881] OOM killer disabled.
>> [438620.425197] Freezing remaining freezable tasks
>> [438620.430890] Freezing remaining freezable tasks completed (elapsed
>> 0.001 seconds)
>> [438620.438348] PM: Suspending system (deep)
>> .....
>> [438623.746038] PM: suspend of devices complete after 3303.137 msecs
>> [438623.752125] PM: start suspend of devices complete after 3309.713
>> msecs
>> [438623.758722] PM: suspend debug: Waiting for 5 second(s).
>> [438623.792166] XH: 22 from kthread
>> [438623.824140] XH: 23 from workqueue
>>
>>
>> So BOs definitely can be in use during suspend.
>> Even if kthread or workqueue can be stopped with one special kernel
>> config. I think suspend can only stop the workqueue with its callback
>> finish.
>> otherwise something like below makes things crazy.
>> LOCK BO
>> do something
>> -> schedule or wait, anycode might sleep. Stopped by suspend
>> now? no, i think.
>> UNLOCK BO
>>
>> I do tests with cmds below.
>> echo devices > /sys/power/pm_test
>> echo 0 > /sys/power/pm_async
>> echo 1 > /sys/power/pm_print_times
>> echo 1 > /sys/power/pm_debug_messages
>> echo 1 > /sys/module/amdgpu/parameters/debug_evictions
>> ./kfd.sh --gtest_filter=KFDEvictTest.BasicTest
>> pm-suspend
>>
>> thanks
>> xinhui
>>
>>
>> ------------------------------------------------------------------------
>> *发件人:* Christian König <ckoenig.leichtzumerken at gmail.com>
>> *发送时间:* 2023年9月12日 17:01
>> *收件人:* Pan, Xinhui <Xinhui.Pan at amd.com>;
>> amd-gfx at lists.freedesktop.org <amd-gfx at lists.freedesktop.org>
>> *抄送:* Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig,
>> Christian <Christian.Koenig at amd.com>; Fan, Shikang <Shikang.Fan at amd.com>
>> *主题:* Re: [PATCH] drm/amdgpu: Ignore first evction failure during
>> suspend
>> When amdgpu_device_suspend() is called processes should be frozen
>> already. In other words KFD queues etc... should already be idle.
>>
>> So when the eviction fails here we missed something previously and that
>> in turn can cause tons amount of problems.
>>
>> So ignoring those errors is most likely not a good idea at all.
>>
>> Regards,
>> Christian.
>>
>> Am 12.09.23 um 02:21 schrieb Pan, Xinhui:
>> > [AMD Official Use Only - General]
>> >
>> > Oh yep, Pinned BO is moved to other LRU list, So eviction fails
>> because of other reason.
>> > I will change the comments in the patch.
>> > The problem is eviction fails as many reasons, say, BO is locked.
>> > ASAIK, kfd will stop the queues and flush some evict/restore work
>> in its suspend callback. SO the first eviction before kfd callback
>> likely fails.
>> >
>> > -----Original Message-----
>> > From: Christian König <ckoenig.leichtzumerken at gmail.com>
>> > Sent: Friday, September 8, 2023 2:49 PM
>> > To: Pan, Xinhui <Xinhui.Pan at amd.com>; amd-gfx at lists.freedesktop.org
>> > Cc: Deucher, Alexander <Alexander.Deucher at amd.com>; Koenig,
>> Christian <Christian.Koenig at amd.com>; Fan, Shikang <Shikang.Fan at amd.com>
>> > Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure
>> during suspend
>> >
>> > Am 08.09.23 um 05:39 schrieb xinhui pan:
>> >> Some BOs might be pinned. So the first eviction's failure will abort
>> >> the suspend sequence. These pinned BOs will be unpined afterwards
>> >> during suspend.
>> > That doesn't make much sense since pinned BOs don't cause eviction
>> failure here.
>> >
>> > What exactly is the error code you see?
>> >
>> > Christian.
>> >
>> >> Actaully it has evicted most BOs, so that should stil work fine in
>> >> sriov full access mode.
>> >>
>> >> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call
>> >> during device_suspend.")
>> >> Signed-off-by: xinhui pan <xinhui.pan at amd.com>
>> >> ---
>> >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>> >> 1 file changed, 5 insertions(+), 4 deletions(-)
>> >>
>> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> index 5c0e2b766026..39af526cdbbe 100644
>> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device
>> >> *dev, bool fbcon)
>> >>
>> >> adev->in_suspend = true;
>> >>
>> >> - /* Evict the majority of BOs before grabbing the full access */
>> >> - r = amdgpu_device_evict_resources(adev);
>> >> - if (r)
>> >> - return r;
>> >> + /* Try to evict the majority of BOs before grabbing the full
>> access
>> >> + * Ignore the ret val at first place as we will unpin some
>> BOs if any
>> >> + * afterwards.
>> >> + */
>> >> + (void)amdgpu_device_evict_resources(adev);
>> >>
>> >> if (amdgpu_sriov_vf(adev)) {
>> >> amdgpu_virt_fini_data_exchange(adev);
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20230913/75ec74d8/attachment-0001.htm>
More information about the amd-gfx
mailing list