[PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
Christian König
ckoenig.leichtzumerken at gmail.com
Thu Jul 23 13:10:23 UTC 2020
I agree with Guchun as well.
When you have a dynamic module parameter and change the bad page
threshold the GPU might just stop working suddenly.
That is not a good idea as far as I can see.
Regards,
Christian.
Am 23.07.20 um 05:47 schrieb Chen, Guchun:
> [AMD Public Use]
>
> Hi Dennis,
>
> To be honest, your suggestion is considered when I start the design. My thought is in actual world, bad page threshold is one static configuration, it should be set once when probing.
> So module parameter is one ideal choice for this.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: Li, Dennis <Dennis.Li at amd.com>
> Sent: Thursday, July 23, 2020 8:32 AM
> To: Chen, Guchun <Guchun.Chen at amd.com>; amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Clements, John <John.Clements at amd.com>
> Subject: RE: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
>
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Guchun,
> It is better to let user be able to change amdgpu_bad_page_threshold with sysfs, so that users no need to reboot system when they want to change their strategy.
>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Chen, Guchun <Guchun.Chen at amd.com>
> Sent: Wednesday, July 22, 2020 11:14 AM
> To: amd-gfx at lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher at amd.com>; Zhang, Hawking <Hawking.Zhang at amd.com>; Li, Dennis <Dennis.Li at amd.com>; Yang, Stanley <Stanley.Yang at amd.com>; Zhou1, Tao <Tao.Zhou1 at amd.com>; Clements, John <John.Clements at amd.com>
> Cc: Chen, Guchun <Guchun.Chen at amd.com>
> Subject: [PATCH 1/5] drm/amdgpu: add bad page count threshold in module parameter
>
> bad_page_threshold could be specified to detect and retire bad GPU if faulty bad pages exceed it.
>
> When it's -1, ras will use typical bad page failure value.
>
> Signed-off-by: Guchun Chen <guchun.chen at amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 11 +++++++++++
> 2 files changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 06bfb8658dec..bb83ffb5e26a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -181,6 +181,7 @@ extern uint amdgpu_dm_abm_level; extern struct amdgpu_mgpu_info mgpu_info; extern int amdgpu_ras_enable; extern uint amdgpu_ras_mask;
> +extern int amdgpu_bad_page_threshold;
> extern int amdgpu_async_gfx_ring;
> extern int amdgpu_mcbp;
> extern int amdgpu_discovery;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index d28b95f721c4..f99671101746 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -161,6 +161,7 @@ struct amdgpu_mgpu_info mgpu_info = { }; int amdgpu_ras_enable = -1; uint amdgpu_ras_mask = 0xffffffff;
> +int amdgpu_bad_page_threshold = -1;
>
> /**
> * DOC: vramlimit (int)
> @@ -801,6 +802,16 @@ module_param_named(tmz, amdgpu_tmz, int, 0444); MODULE_PARM_DESC(reset_method, "GPU reset method (-1 = auto (default), 0 = legacy, 1 = mode0, 2 = mode1, 3 = mode2, 4 = baco)"); module_param_named(reset_method, amdgpu_reset_method, int, 0444);
>
> +/**
> + * DOC: bad_page_threshold (int)
> + * Bad page threshold configuration is driven by RMA(Return Merchandise
> + * Authorization) policy, which is to specify the threshold value of
> +faulty
> + * pages detected by ECC, which may result in GPU's retirement if total
> + * faulty pages by ECC exceed threshold value.
> + */
> +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 =
> +auto(default typical value))"); module_param_named(bad_page_threshold,
> +amdgpu_bad_page_threshold, int, 0444);
> +
> static const struct pci_device_id pciidlist[] = { #ifdef CONFIG_DRM_AMDGPU_SI
> {0x1002, 0x6780, PCI_ANY_ID, PCI_ANY_ID, 0, 0, CHIP_TAHITI},
> --
> 2.17.1
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
More information about the amd-gfx
mailing list