<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<p style="font-family:Arial;font-size:10pt;color:#0000FF;margin:5pt;font-style:normal;font-weight:normal;text-decoration:none;" align="Left">
[AMD Official Use Only - General]<br>
</p>
<br>
<div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
I notice that only user space process are frozen on my side. kthread and workqueue keeps running. Maybe some kernel configs are not enabled.</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
I made one module which just prints something like i++ with mutex lock both in workqueue and kthread. I paste some logs below.</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted0">
[438619.696196] XH: 14 from workqueue
<div class="ContentPasted0">[438619.700193] XH: 15 from kthread</div>
<div class="ContentPasted0">[438620.394335] PM: suspend entry (deep)</div>
<div class="ContentPasted0">[438620.399619] Filesystems sync: 0.001 seconds</div>
<div class="ContentPasted0">[438620.403887] PM: Preparing system for sleep (deep)</div>
<div class="ContentPasted0">[438620.409299] Freezing user space processes</div>
<div class="ContentPasted0">[438620.414862] Freezing user space processes completed (elapsed 0.001 seconds)</div>
<div class="ContentPasted0">[438620.421881] OOM killer disabled.</div>
<div class="ContentPasted0">[438620.425197] Freezing remaining freezable tasks</div>
<div class="ContentPasted0">[438620.430890] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)</div>
[438620.438348] PM: Suspending system (deep)<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
.....</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted1">
[438623.746038] PM: suspend of devices complete after 3303.137 msecs
<div class="ContentPasted1">[438623.752125] PM: start suspend of devices complete after 3309.713 msecs</div>
<div class="ContentPasted1">[438623.758722] PM: suspend debug: Waiting for 5 second(s).</div>
<div class="ContentPasted1">[438623.792166] XH: 22 from kthread</div>
[438623.824140] XH: 23 from workqueue</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted1">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted1">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted1">
So BOs definitely can be in use during suspend.</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted1">
Even if kthread or workqueue can be stopped with one special kernel config. I think suspend can only stop the workqueue with its callback finish.
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted1">
otherwise something like below makes things crazy.</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted1">
LOCK BO</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted1">
do something<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted1">
-> schedule or wait, anycode might sleep. Stopped by suspend now? no, i think.<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted1">
UNLOCK BO<br>
</div>
<div id="appendonsend"></div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">
I do tests with cmds below.<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted2">
echo devices > /sys/power/pm_test
<div class="ContentPasted2">echo 0 > /sys/power/pm_async</div>
<div class="ContentPasted2">echo 1 > /sys/power/pm_print_times</div>
<div class="ContentPasted2">echo 1 > /sys/power/pm_debug_messages</div>
echo 1 > /sys/module/amdgpu/parameters/debug_evictions</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted2 ContentPasted3">
./kfd.sh --gtest_filter=KFDEvictTest.BasicTest</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted2 ContentPasted3">
pm-suspend<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted2">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted2">
thanks</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted2">
xinhui</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted2">
<br>
</div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof ContentPasted2">
<br>
</div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="divRplyFwdMsg" dir="ltr"><font style="font-size: 11pt; color: rgb(0, 0, 0);" face="Calibri, sans-serif"><b>发件人:</b> Christian König <ckoenig.leichtzumerken@gmail.com><br>
<b>发送时间:</b> 2023年9月12日 17:01<br>
<b>收件人:</b> Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org><br>
<b>抄送:</b> Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com><br>
<b>主题:</b> Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt">
<div class="PlainText">When amdgpu_device_suspend() is called processes should be frozen
<br>
already. In other words KFD queues etc... should already be idle.<br>
<br>
So when the eviction fails here we missed something previously and that <br>
in turn can cause tons amount of problems.<br>
<br>
So ignoring those errors is most likely not a good idea at all.<br>
<br>
Regards,<br>
Christian.<br>
<br>
Am 12.09.23 um 02:21 schrieb Pan, Xinhui:<br>
> [AMD Official Use Only - General]<br>
><br>
> Oh yep, Pinned BO is moved to other LRU list, So eviction fails because of other reason.<br>
> I will change the comments in the patch.<br>
> The problem is eviction fails as many reasons, say, BO is locked.<br>
> ASAIK, kfd will stop the queues and flush some evict/restore work in its suspend callback. SO the first eviction before kfd callback likely fails.<br>
><br>
> -----Original Message-----<br>
> From: Christian König <ckoenig.leichtzumerken@gmail.com><br>
> Sent: Friday, September 8, 2023 2:49 PM<br>
> To: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org<br>
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Fan, Shikang <Shikang.Fan@amd.com><br>
> Subject: Re: [PATCH] drm/amdgpu: Ignore first evction failure during suspend<br>
><br>
> Am 08.09.23 um 05:39 schrieb xinhui pan:<br>
>> Some BOs might be pinned. So the first eviction's failure will abort<br>
>> the suspend sequence. These pinned BOs will be unpined afterwards<br>
>> during suspend.<br>
> That doesn't make much sense since pinned BOs don't cause eviction failure here.<br>
><br>
> What exactly is the error code you see?<br>
><br>
> Christian.<br>
><br>
>> Actaully it has evicted most BOs, so that should stil work fine in<br>
>> sriov full access mode.<br>
>><br>
>> Fixes: 47ea20762bb7 ("drm/amdgpu: Add an extra evict_resource call<br>
>> during device_suspend.")<br>
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com><br>
>> ---<br>
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----<br>
>> 1 file changed, 5 insertions(+), 4 deletions(-)<br>
>><br>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
>> index 5c0e2b766026..39af526cdbbe 100644<br>
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c<br>
>> @@ -4148,10 +4148,11 @@ int amdgpu_device_suspend(struct drm_device<br>
>> *dev, bool fbcon)<br>
>><br>
>> adev->in_suspend = true;<br>
>><br>
>> - /* Evict the majority of BOs before grabbing the full access */<br>
>> - r = amdgpu_device_evict_resources(adev);<br>
>> - if (r)<br>
>> - return r;<br>
>> + /* Try to evict the majority of BOs before grabbing the full access<br>
>> + * Ignore the ret val at first place as we will unpin some BOs if any<br>
>> + * afterwards.<br>
>> + */<br>
>> + (void)amdgpu_device_evict_resources(adev);<br>
>><br>
>> if (amdgpu_sriov_vf(adev)) {<br>
>> amdgpu_virt_fini_data_exchange(adev);<br>
<br>
</div>
</span></font></div>
</div>
</body>
</html>