[RFC v2 10/15] drm/admgpu: make device state machine work in stack like way

Mario Limonciello mario.limonciello at amd.com
Wed Jan 15 19:36:42 UTC 2025


On 1/13/2025 19:58, Gerry Liu wrote:
> 
> 
>> 2025年1月14日 06:27,Mario Limonciello <mario.limonciello at amd.com> 写道:
>>
>> On 1/12/2025 19:42, Jiang Liu wrote:
>>> Make the device state machine work in stack like way to better support
>>> suspend/resume by following changes:
>>> 1. amdgpu_driver_load_kms()
>>> 	amdgpu_device_init()
>>> 		amdgpu_device_ip_early_init()
>>> 			ip_blocks[i].early_init()
>>> 			ip_blocks[i].status.valid = true
>>> 		amdgpu_device_ip_init()
>>> 			amdgpu_ras_init()
>>> 			ip_blocks[i].sw_init()
>>> 			ip_blocks[i].status.sw = true
>>> 			ip_blocks[i].hw_init()
>>> 			ip_blocks[i].status.hw = true
>>> 		amdgpu_device_ip_late_init()
>>> 			ip_blocks[i].late_init()
>>> 			ip_blocks[i].status.late_initialized = true
>>> 			amdgpu_ras_late_init()
>>> 				ras_blocks[i].ras_late_init()
>>> 					amdgpu_ras_feature_enable_on_boot()
>>> 2. amdgpu_pmops_suspend()/amdgpu_pmops_freeze()/amdgpu_pmops_poweroff()
>>> 	amdgpu_device_suspend()
>>> 		amdgpu_ras_early_fini()
>>> 			ras_blocks[i].ras_early_fini()
>>> 				amdgpu_ras_feature_disable()
>>> 		amdgpu_ras_suspend()
>>> 			amdgpu_ras_disable_all_features()
>>> +++		ip_blocks[i].early_fini()
>>> +++		ip_blocks[i].status.late_initialized = false
>>> 		ip_blocks[i].suspend()
>>> 3. amdgpu_pmops_resume()/amdgpu_pmops_thaw()/amdgpu_pmops_restore()
>>> 	amdgpu_device_resume()
>>> 		amdgpu_device_ip_resume()
>>> 			ip_blocks[i].resume()
>>> 		amdgpu_device_ip_late_init()
>>> 			ip_blocks[i].late_init()
>>> 			ip_blocks[i].status.late_initialized = true
>>> 			amdgpu_ras_late_init()
>>> 				ras_blocks[i].ras_late_init()
>>> 					amdgpu_ras_feature_enable_on_boot()
>>> 		amdgpu_ras_resume()
>>> 			amdgpu_ras_enable_all_features()
>>> 4. amdgpu_driver_unload_kms()
>>> 	amdgpu_device_fini_hw()
>>> 		amdgpu_ras_early_fini()
>>> 			ras_blocks[i].ras_early_fini()
>>> +++		ip_blocks[i].early_fini()
>>> +++		ip_blocks[i].status.late_initialized = false
>>> 		ip_blocks[i].hw_fini()
>>> 		ip_blocks[i].status.hw = false
>>> 5. amdgpu_driver_release_kms()
>>> 	amdgpu_device_fini_sw()
>>> 		amdgpu_device_ip_fini()
>>> 			ip_blocks[i].sw_fini()
>>> 			ip_blocks[i].status.sw = false
>>> ---			ip_blocks[i].status.valid = false
>>> +++			amdgpu_ras_fini()
>>> 			ip_blocks[i].late_fini()
>>> +++			ip_blocks[i].status.valid = false
>>> ---			ip_blocks[i].status.late_initialized = false
>>> ---			amdgpu_ras_fini()
>>> The main changes include:
>>> 1) invoke ip_blocks[i].early_fini in amdgpu_pmops_suspend().
>>> 2) set ip_blocks[i].status.late_initialized to false after calling
>>>     callback `early_fini`. We have auditted all usages of the
>>>     late_initialized flag and no functional changes found.
>>> 3) only set ip_blocks[i].status.valid = false after calling the
>>>     `late_fini` callback.
>>> 4) call amdgpu_ras_fini() before invoking ip_blocks[i].late_fini.
>>> There's one more task left to analyze GPU reset related state machine
>>> transitions.
>>> Signed-off-by: Jiang Liu <gerry at linux.alibaba.com>
>>
>> Ideally I think you should swap the order of patch 10 and 11, what do you think?
> I realized this when working patch 11, many changes introduced by patch 10 are changed again by patch 11.
> But swapping these patches will cause too much rework. How about folding these two patches instead?

It might be too big of a patch because it changes a lot all at same time.

Re-ordering is painful but leads to more reable patches and less 
ping-pong of code.

I think you can do something like this (I've done this myself on big 
patch series):

1) squash the two patches together
    git rebase -i HEAD~15

2) Spit it out as a patch file
    git format-patch -1 $SHA
3) Check out the commit before this combined one
    git checkout -b gerry/rebase $SHA~1
4) Apply the patch using patch -p1 (NOT git am)
    patch -p1 < foo.patch
5) Use vscode (specifically) to stage the lines that should go into 
patch 10.
6) Commit patch 10
7) Stage the lines that go into patch 11.
8) Commit patch 11
9) Note those two commit hashes
10) Go back to your original branch
     git checkout gerry/original
11) Run a rebase again, but swap out the "squashed" hash with the two 
hashes you made.

IE if your two hashes are aaaabbb and bbbccc and it's currently

pick abc123

change it to

pick aaaabbb
pick bbbccc

Then finish the rebase and it should swap them all out for you!

> 
> 
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 22 ++++++++++++++++++++--
>>>   1 file changed, 20 insertions(+), 2 deletions(-)
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 6b503fb7e366..c2e4057ecd82 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3449,6 +3449,8 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
>>>   		adev->ip_blocks[i].status.sw = false;
>>>   	}
>>>   +	amdgpu_ras_fini(adev);
>>> +
>>>   	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
>>>   		if (!adev->ip_blocks[i].status.valid)
>>>   			continue;
>>> @@ -3457,8 +3459,6 @@ static int amdgpu_device_ip_fini(struct amdgpu_device *adev)
>>>   		adev->ip_blocks[i].status.valid = false;
>>>   	}
>>>   -	amdgpu_ras_fini(adev);
>>> -
>>>   	return 0;
>>>   }
>>>   @@ -3516,6 +3516,24 @@ static int amdgpu_device_ip_suspend_phase1(struct amdgpu_device *adev)
>>>   	if (amdgpu_dpm_set_df_cstate(adev, DF_CSTATE_DISALLOW))
>>>   		dev_warn(adev->dev, "Failed to disallow df cstate");
>>>   +	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
>>> +		if (!adev->ip_blocks[i].status.valid)
>>> +			continue;
>>> +		if (!adev->ip_blocks[i].status.late_initialized)
>>> +			continue;
>>> +
>>> +		if (adev->ip_blocks[i].version->funcs->early_fini) {
>>> +			r = adev->ip_blocks[i].version->funcs->early_fini(&adev->ip_blocks[i]);
>>> +			if (r) {
>>> +				DRM_ERROR(" of IP block <%s> failed %d\n",
>>> +					  adev->ip_blocks[i].version->funcs->name, r);
>>> +				return r;
>>> +			}
>>> +		}
>>> +
>>> +		adev->ip_blocks[i].status.late_initialized = false;
>>> +	}
>>> +
>>>   	for (i = adev->num_ip_blocks - 1; i >= 0; i--) {
>>>   		if (!adev->ip_blocks[i].status.valid)
>>>   			continue;
> 



More information about the amd-gfx mailing list