[PATCH] drm/amdgpu: change the fence ring wait timeout

Wed Jan 13 07:57:28 UTC 2021

Dear Roy,

Am 13.01.21 um 07:36 schrieb Roy Sun:
> This fix bug where when the engine hang, the fence ring will wait without quit and cause kernel crash

Please wrap commit message body after 75 characters. (checkpatch.pl 
should have shown you that.)

Some typos:

1.  fix*es*
2.  hang*s*
3.  Please add a dot/period at the end of sentences.

If there is a Linux kernel crash, please add the trace.

Please elaborate also. Where and why does the Linux kernel crash, and 
how does your patch fix this.

How can the bug be reproduced? On what hardware did you see it?

> Signed-off-by: Roy Sun <Roy.Sun at amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 48 ++++++++++++++++++++---
>   1 file changed, 43 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 6b0aeee61b8b..738ea65077ea 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -41,6 +41,8 @@
>   #include "amdgpu.h"
>   #include "amdgpu_trace.h"
>   
> +#define AMDGPU_FENCE_TIMEOUT  msecs_to_jiffies(1000)
> +#define AMDGPU_FENCE_GFX_XGMI_TIMEOUT msecs_to_jiffies(2000)

Where did you get the values from, or why did you choose them.

>   /*
>    * Fences
>    * Fences mark an event in the GPUs pipeline and are used
> @@ -104,6 +106,38 @@ static void amdgpu_fence_write(struct amdgpu_ring *ring, u32 seq)
>   		*drv->cpu_addr = cpu_to_le32(seq);
>   }
>   
> +/**
> + * amdgpu_fence_wait_timeout - get the fence wait timeout
> + *
> + * @ring: ring the fence is associated with
> + *
> + * Returns the value of the fence wait timeout.
> + */
> +long amdgpu_fence_wait_timeout(struct amdgpu_ring *ring)

See below, but it should be `signed long`.

> +{
> +	long tmo_gfx, tmo_mm, tmo;
> +	struct amdgpu_device *adev = ring->adev;
> +	tmo_mm = tmo_gfx = AMDGPU_FENCE_TIMEOUT;
> +	if (amdgpu_sriov_vf(adev)) {
> +		tmo_mm = 8 * AMDGPU_FENCE_TIMEOUT;
> +	}

See coding style. No {} needed for one line statements.

> +	if (amdgpu_sriov_runtime(adev)) {
> +		tmo_gfx = 8 * AMDGPU_FENCE_TIMEOUT;

So why not define a special SRIOV macro too. But add a comment, why it 
should be eight times that.

> +	} else if (adev->gmc.xgmi.hive_id) {
> +		tmo_gfx = AMDGPU_FENCE_GFX_XGMI_TIMEOUT;
> +	}

Ditto.

> +	if (ring->funcs->type == AMDGPU_RING_TYPE_UVD ||
> +		ring->funcs->type == AMDGPU_RING_TYPE_VCE ||
> +		ring->funcs->type == AMDGPU_RING_TYPE_UVD_ENC ||
> +		ring->funcs->type == AMDGPU_RING_TYPE_VCN_DEC ||
> +		ring->funcs->type == AMDGPU_RING_TYPE_VCN_ENC ||
> +		ring->funcs->type == AMDGPU_RING_TYPE_VCN_JPEG)

Align on `ring`?

> +		tmo = tmo_mm;
> +	else
> +		tmo = tmo_gfx;
> +	return tmo;

The return could happen in the branches.

> +}
> +
>   /**
>    * amdgpu_fence_read - read a fence value
>    *
> @@ -166,10 +200,12 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f,
>   		rcu_read_unlock();
>   
>   		if (old) {
> -			r = dma_fence_wait(old, false);
> +			long timeout;
> +			timeout = amdgpu_fence_wait_timeout(ring);
> +			r = dma_fence_wait_timeout(old, false, timeout);

The signature is `signed long` as far as I see.

>   			dma_fence_put(old);
>   			if (r)
> -				return r;
> +				return r < 0 ? r : 0;
>   		}
>   	}
>   
> @@ -343,10 +379,12 @@ int amdgpu_fence_wait_empty(struct amdgpu_ring *ring)
>   		return 0;
>   	}
>   	rcu_read_unlock();
> -
> -	r = dma_fence_wait(fence, false);
> +	
> +	long timeout;
> +	timeout = amdgpu_fence_wait_timeout(ring);
> +	r = dma_fence_wait_timeout(fence, false, timeout);
>   	dma_fence_put(fence);
> -	return r;
> +	return r < 0 ? r : 0;
>   }
>   
>   /**
> 

It looks like this is your first patch submission, so welcome to the 
Linux kernel community. Please keep in mind, that reviewer time is 
precious and scarce, and that formal and simple mistakes should not 
happen. No idea, if AMD has some kind of “onboarding” process for their 
newly employed Linux kernel contributors.

I am looking forward to your the patch iteration and your next 
contributions.

Kind regards,

Paul