No subject
Gerry Liu
gerry at linux.alibaba.com
Mon Jan 13 01:19:57 UTC 2025
> 2025年1月10日 01:10,Mario Limonciello <mario.limonciello at amd.com> 写道:
>
> General note - don't use HTML for mailing list communication.
>
> I'm not sure if Apple Mail lets you switch this around.
>
> If not, you might try using Thunderbird instead. You can pick to reply in plain text or HTML by holding shift when you hit "reply all"
>
> For my reply I'll convert my reply to plain text, please see inline below.
>
> On 1/8/2025 23:34, Gerry Liu wrote:
>>> 2025年1月9日 00:33,Mario Limonciello <mario.limonciello at amd.com <mailto:mario.limonciello at amd.com>> 写道:
>>>
>>> On 1/8/2025 07:59, Jiang Liu wrote:
>>>> Subject: [RFC PATCH 00/13] Enhance device state machine to better support suspend/resume
>>>
>>> I'm not sure how this happened, but your subject didn't end up in the subject of the thread on patch 0 so the thread just looks like an unsubjected thread.
>> Maybe it’s caused by one extra blank line at the header.
>
> Yeah that might be it. Hopefully it doesn't happen on v2.
>
>>>
>>>> Recently we were testing suspend/resume functionality with AMD GPUs,
>>>> we have encountered several resource tracking related bugs, such as
>>>> double buffer free, use after free and unbalanced irq reference count.
>>>
>>> Can you share more aobut how you were hitting these issues? Are they specific to S3 or to s2idle flows? dGPU or APU?
>>> Are they only with SRIOV?
>>>
>>> Is there anything to do with the host influencing the failures to happen, or are you contriving the failures to find the bugs?
>>>
>>> I know we've had some reports about resource tracking warnings on the reset flows, but I haven't heard much about suspend/resume.
>> We are investigating to develop some advanced product features based on amdgpu suspend/resume.
>> So we started by tested the suspend/resume functionality of AMD 308x GPUs with the following simple script:
>> ```
>> echoplatform >/sys/power/pm_test
>> i=0
>> while true; do
>> echomem >/sys/power/state
>> leti=i+1
>> echo$i
>> sleep1
>> done
>> ```
>> It succeeds with the first and second iteration but always fails on following iterations on a bare metal servers with eight MI308X GPUs.
>
> Can you share more about this server? Does it support suspend to ram or a hardware backed suspend to idle? If you don't know, you can check like this:
>
> ❯ cat /sys/power/mem_sleep
> s2idle [deep]
# cat /sys/power/mem_sleep
[s2idle]
>
> If it's suspend to idle, what does the FACP indicate? You can do this check to find out if you don't know.
>
> ❯ sudo cp /sys/firmware/acpi/tables/FACP /tmp
> ❯ sudo iasl -d /tmp/FACP
> ❯ grep "idle" -i /tmp/FACP.dsl
> Low Power S0 Idle (V5) : 0
>
With acpidump and `iasl -d facp.data`, we got:
[070h 0112 4] Flags (decoded below) : 000084A5
WBINVD instruction is operational (V1) : 1
WBINVD flushes all caches (V1) : 0
All CPUs support C1 (V1) : 1
C2 works on MP system (V1) : 0
Control Method Power Button (V1) : 0
Control Method Sleep Button (V1) : 1
RTC wake not in fixed reg space (V1) : 0
RTC can wake system from S4 (V1) : 1
32-bit PM Timer (V1) : 0
Docking Supported (V1) : 0
Reset Register Supported (V2) : 1
Sealed Case (V3) : 0
Headless - No Video (V3) : 0
Use native instr after SLP_TYPx (V3) : 0
PCIEXP_WAK Bits Supported (V4) : 0
Use Platform Timer (V4) : 1
RTC_STS valid on S4 wake (V4) : 0
Remote Power-on capable (V4) : 0
Use APIC Cluster Model (V4) : 0
Use APIC Physical Destination Mode (V4) : 0
Hardware Reduced (V5) : 0
Low Power S0 Idle (V5) : 0
>> With some investigation we found that the gpu asic should be reset during the test,
>
> Yeah; but this comes back to my above questions. Typically there is an assumption that the power rails are going to be cut in system suspend.
>
> If that doesn't hold true, then you're doing a pure software suspend and have found a series of issues in the driver with how that's handled.
Yeah, we are trying to do a `pure software suspend`, letting hypervisor to save/restore system images instead of guest OS.
And during the suspend process, we hope we can cancel the suspend request at any later stage.
We cancel suspend at late stages, it does behave like a pure software suspend.
>
>> so we submitted a patch to fix the failure (https:// github.com/ROCm/ROCK-Kernel-Driver/pull/181 <https://github.com/ROCm/ ROCK-Kernel-Driver/pull/181>)
>
> Typically kernel patches don't go through that repo, they're discussed on the mailing lists. Can you bring this patch for discussion on amd-gfx?
Will post to amd-gfx after solving the conflicts.
Regards,
Gerry
>
>> During analyze and root-cause the failure, we have encountered several crashes, resource leakages and false alarms.
>
> Yeah; I think you found some real issues.
>
>> So I have worked out patch sets to solve issues we encountered. The other patch set is https://lists.freedesktop.org/archives/amd-gfx/2025- January/118484.html <https://lists.freedesktop.org/archives/amd- gfx/2025-January/118484.html>
>
> Thanks!
>
>> With sriov in single VF mode, resume always fails. Seems some contexts/ vram buffers get lost during suspend and haven’t be restored on resume, so cause failure.
>> We haven’t tested sriov in multiple VFs mode yet. We need more help from AMD side to make SR work for SRIOV:)
>>>
>>>> We have tried to solve these issues case by case, but found that may
>>>> not be the right way. Especially about the unbalanced irq reference
>>>> count, there will be new issues appear once we fixed the current known
>>>> issues. After analyzing related source code, we found that there may be
>>>> some fundamental implementaion flaws behind these resource tracking
>>>
>>> implementation
>>>
>>>> issues.
>>>> The amdgpu driver has two major state machines to driver the device
>>>> management flow, one is for ip blocks, the other is for ras blocks.
>>>> The hook points defined in struct amd_ip_funcs for device setup/teardown
>>>> are symmetric, but the implementation is asymmetric, sometime even
>>>> ambiguous. The most obvious two issues we noticed are:
>>>> 1) amdgpu_irq_get() are called from .late_init() but amdgpu_irq_put()
>>>> are called from .hw_fini() instead of .early_fini().
>>>> 2) the way to reset ip_bloc.status.valid/sw/hw/late_initialized doesn't
>>>> match the way to set those flags.
>>>> When taking device suspend/resume into account, in addition to device
>>>> probe/remove, things get much more complex. Some issues arise because
>>>> many suspend/resume implementations directly reuse .hw_init/.hw_fini/
>>>> .late_init hook points.
>>>>
>>>> So we try to fix those issues by two enhancements/refinements to current
>>>> device management state machines.
>>>> The first change is to make the ip block state machine and associated
>>>> status flags work in stack-like way as below:
>>>> Callback Status Flags
>>>> early_init: valid = true
>>>> sw_init: sw = true
>>>> hw_init: hw = true
>>>> late_init: late_initialized = true
>>>> early_fini: late_initialized = false
>>>> hw_fini: hw = false
>>>> sw_fini: sw = false
>>>> late_fini: valid = false
>>>
>>> At a high level this makes sense to me, but I'd just call 'late' or 'late_init'.
>>>
>>> Another idea if you make it stack like is to do it as a true enum for the state machine and store it all in one variable.
>> I will add a patch to convert those bool flags into an enum.
>
> Thanks!
More information about the amd-gfx
mailing list