[PATCH] drm/amdgpu: Add judgement to avoid infinite loop

Sat Jan 29 08:26:27 UTC 2022

It have solution to solve this defect,   I am debugging the modifications. 

-----Original Message-----
From: Zhou1, Tao <Tao.Zhou1 at amd.com> 
Sent: Saturday, January 29, 2022 3:54 PM
To: Chai, Thomas <YiPeng.Chai at amd.com>; amd-gfx at lists.freedesktop.org
Cc: Zhang, Hawking <Hawking.Zhang at amd.com>; Clements, John <John.Clements at amd.com>
Subject: RE: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop

[AMD Official Use Only]

For quick workaround, I agree with the solution. But regarding the root cause, the list is still messed up.
Can we make ras_list to be a global variable across all cards, and add list empty check (or add a flag to indicate the register status of ras block) before list add to avoid redundant register?

Regards,
Tao

> -----Original Message-----
> From: Chai, Thomas <YiPeng.Chai at amd.com>
> Sent: Saturday, January 29, 2022 11:53 AM
> To: amd-gfx at lists.freedesktop.org
> Cc: Chai, Thomas <YiPeng.Chai at amd.com>; Zhang, Hawking 
> <Hawking.Zhang at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Clements, 
> John <John.Clements at amd.com>; Chai, Thomas <YiPeng.Chai at amd.com>
> Subject: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop
> 
> 1. The infinite loop causing soft lock occurs on multiple amdgpu cards
>    supporting ras feature.
> 2. This a workaround patch. It is valid for multiple amdgpu cards of the
>    same type.
> 3. The root cause is that each GPU card device has a separate .ras_list
>    link header, but the instance and linked list node of each ras block
>    are unique. When each device is initialized, each ras instance will
>    repeatedly add link node to the device every time. In this way, only
>    the .ras_list of the last initialized device is completely correct.
>    the .ras_list->prev and .ras_list->next of the device initialzied
>    before can still point to the correct ras instance, but the prev
>    pointer and next pointer of the pointed ras instance both point to
>    the last initialized device's .ras_ list instead of the beginning
>    .ras_ list. When using list_for_each_entry_safe searches for
>    non-existent Ras nodes on devices other than the last device, the
>    last ras instance next pointer cannot always be equal to the
>    beginning .ras_list, so that the loop cannot be terminated, the
>    program enters a infinite loop.
>  BTW: Since the data and initialization process of each card are the same,
>       the link list between ras instances will not be destroyed every time
>       the device is initialized.
>  4. The soft locked logs are as follows:
> [  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G           OE
> 5.13.0-27-generic #29~20.04.1-Ubuntu
> [  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, 
> BIOS T20200717143848 07/17/2020 [  262.165698] Workqueue: events 
> amdgpu_ras_do_recovery [amdgpu] [  262.165980] RIP:
> 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [  262.166239] Code: 
> 68
> d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 
> 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 
> 89 f5 48 83 e8 28 48
> 39 d3 74 25 49 89 c6 49 8b 45 [  262.166243] RSP: 
> 0018:ffffac908fa87d80
> EFLAGS: 00000202 [  262.166247] RAX: ffffffffc1394248 RBX: 
> ffff91e4ab8d6e20
> RCX: ffffffffc1394248 [  262.166249] RDX: ffff91e4aa356e20 RSI:
> 000000000000000e RDI: ffff91e4ab8c0000 [  262.166252] RBP:
> ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001 [  
> 262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12:
> 000000000000000e [  262.166256] R13: ffff91e4aa356df8 R14: 
> ffffffffc1394320
> R15: 0000000000000003 [  262.166258] FS:  0000000000000000(0000)
> GS:ffff92238fb40000(0000) knlGS:0000000000000000 [  262.166261] CS:  
> 0010
> DS: 0000 ES: 0000 CR0: 0000000080050033 [  262.166264] CR2:
> 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0 [  
> 262.166267] Call Trace:
> [  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [  
> 262.166529]  ? psi_task_switch+0xd2/0x250 [  262.166537]  ?
> __switch_to+0x11d/0x460 [  262.166542]  ? __switch_to_asm+0x36/0x70 [  
> 262.166549]  process_one_work+0x220/0x3c0 [  262.166556]
> worker_thread+0x4d/0x3f0 [  262.166560]  ? 
> process_one_work+0x3c0/0x3c0 [  262.166563]  kthread+0x12b/0x150 [  262.166568]  ?
> set_kthread_struct+0x40/0x40 [  262.166571]  ret_from_fork+0x22/0x30
> 
> Signed-off-by: yipechai <YiPeng.Chai at amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index d4e07d0acb66..3d533ef0783d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -884,6 +884,7 @@ static int amdgpu_ras_block_match_default(struct
> amdgpu_ras_block_object *block_  static struct amdgpu_ras_block_object 
> *amdgpu_ras_get_ras_block(struct amdgpu_device *adev,
>  					enum amdgpu_ras_block block,
> uint32_t sub_block_index)  {
> +	int loop_cnt = 0;
>  	struct amdgpu_ras_block_object *obj, *tmp;
> 
>  	if (block >= AMDGPU_RAS_BLOCK__LAST) @@ -900,6 +901,9 @@ static 
> struct amdgpu_ras_block_object *amdgpu_ras_get_ras_block(struct 
> amdgpu_de
>  			if (amdgpu_ras_block_match_default(obj, block) == 0)
>  				return obj;
>  		}
> +
> +		if (++loop_cnt >= AMDGPU_RAS_BLOCK__LAST)
> +			break;
>  	}
> 
>  	return NULL;
> --
> 2.25.1