[PATCH] drm/i915/display: Change ret value type from int to long

Mon Jul 7 18:12:03 UTC 2025

On Fri, Jul 04, 2025 at 02:27:02PM +0200, Christian König wrote:
> On 04.07.25 14:00, Jani Nikula wrote:
> > On Fri, 04 Jul 2025, Aakash Deep Sarkar <aakash.deep.sarkar at intel.com> wrote:
> >> dma_fence_wait_timeout returns a long type but the driver is
> >> only using the lower 32 bits of the retval and discarding the
> >> upper 32 bits.
> >>
> >> This is particularly problematic if there are already signalled
> >> or stub fences on some of the hw planes. In this case the
> >> dma_fence_wait_timeout function will immediately return with
> >> timeout value MAX_SCHEDULE_TIMEOUT (0x7fffffffffffffff) since
> >> the fence is already signalled.
> 
> That is not correct. If the fence is signaled dma_fence_wait_timeout() returns the remaining timeout or at least 1 if the input timeout was 0.
> 

>From my look this looks correct.

> Could be that i915 has a separately implemented fence_ops->wait callback which incorrectly returns MAX_SCHEDULE_TIMEOUT, but i strongly doubt that because that would break tons of stuff.
> 
> > If the driver only uses the lower
> >> 32 bits of this return value then it'll interpret it as an error
> >> code (0xFFFFFFFF or (-1)) and skip the wait on the remaining fences.
> >>
> >> This issue was first observed with the Android compositor where
> >> the GPU composited layer was not properly waited on when there
> >> were stub fences in other overlay planes resulting in significant
> >> visual artifacts.
> > 
> > Thanks for the patch, good catch!
> 
> If I'm not completely mistaken this patch is not necessary.
> 
> dma_fence_wait_timeout() does *not* return MAX_SCHEDULE_TIMEOUT when the fence is signaled, but rather the remaining timeout which is the input value at maximum.
> 
> So when the input timeout fits into 32bits the returned timeout also fits into 32bits as well.
> 

Sure, but the input here is MAX_SCHEDULE_TIMEOUT, which is not 32 bits.
The interface accepts a 32-bit timeout input, so the return value can be
greater than 32 bits.

> And as far as I can see i915 is using the config option DRM_I915_FENCE_TIMEOUT here. So as long as you don't set this to very very large value this should work as expected.
>
> >> Test: No graphical artifacts with shadertoy
> > 
> > We've never used this commit trailer before, please let's not start now.
> > 
> > The subject should plainly state the "what", and the commit message
> > should answer the "why". You do have that here, but I think the subject
> > is still a bit too much nuts and bolts.
> > 
> > For example,
> > 
> > 	drm/i915/display: Fix dma_fence_wait_timeout() return value handling
> > 
> > would state the "what" in *much* more helpful terms for anyone looking
> > at git logs.
> > 
> > Presumably this has been an error for some time. You should do a little
> > bit of git blame on the offending lines. It'll find commit d59cf7bb73f3
> > ("drm/i915/display: Use dma_fence interfaces instead of i915_sw_fence").
> > 
> > Based on that, we should add:
> > 
> > Fixes: d59cf7bb73f3 ("drm/i915/display: Use dma_fence interfaces instead of i915_sw_fence")
> > Cc: Ville Syrjälä <ville.syrjala at linux.intel.com>
> > Cc: Maarten Lankhorst <maarten.lankhorst at linux.intel.com>
> > Cc: Jouni Högander <jouni.hogander at intel.com>
> > Cc: <stable at vger.kernel.org> # v6.8+
> > 
> > Then it occurs to me this looks like a common mistake to make. A little
> > bit of git grep on dma_fence_wait_timeout() quickly finds multiple
> > similar mistakes in drm, at least amdgpu, etnaviv, msm, and tegra. Cc
> > some maintainers FYI. This class of bugs could cause issues elsewhere.
> 
> I can only guess but I think all those use cases use a fixed small timeout as well. IIRC amdgpu uses a timeout in the millisecond range.
>

Ah, no. A quick grep shows multiple cases in AMDGPU where long +
MAX_SCHEDULE_TIMEOUT is used:

- amdgpu_vm_cpu_update
- amdgpu_vm_wait_idle
- amdgpu_bo_kmap
- amdgpu_hmm_invalidate_gfx
- amdgpu_gem_wait_idle_ioctl

I'm pretty confused by the pushback here—it's among the most confusing
I've ever seen.

Again, this patch is almost certainly correct. I've made this mistake at
least twice myself.

> > Let's also Cc Julia and Dan in case they have ideas to improve static
> > analysis to catch this class of bugs. Or maybe one exists already, but
> > we're just not running them!
> 
> Yeah, agree. A script which checks if the input timeout could be >32bit and the return value is assigned to something <=32bits is probably a really good idea.
>

Yes, I agree. -Wstrict-overflow or -Wconversion in GCC would catch this,
but enabling them causes the core kernel includes to quickly explode. A
static checker would be a good solution here, or we could fix the Linux
kernel so it compiles cleanly with these options.

Matt

> Regards,
> Christian.
> 
> > 
> > Finally, for the actual change,
> > 
> > Reviewed-by: Jani Nikula <jani.nikula at intel.com>
> > 
> > 
> >> Signed-off-by: Aakash Deep Sarkar <aakash.deep.sarkar at intel.com>
> >> ---
> >>  drivers/gpu/drm/i915/display/intel_display.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/i915/display/intel_display.c b/drivers/gpu/drm/i915/display/intel_display.c
> >> index 456fc4b04cda..7035c1fc9033 100644
> >> --- a/drivers/gpu/drm/i915/display/intel_display.c
> >> +++ b/drivers/gpu/drm/i915/display/intel_display.c
> >> @@ -7092,7 +7092,8 @@ static void intel_atomic_commit_fence_wait(struct intel_atomic_state *intel_stat
> >>  	struct drm_i915_private *i915 = to_i915(intel_state->base.dev);
> >>  	struct drm_plane *plane;
> >>  	struct drm_plane_state *new_plane_state;
> >> -	int ret, i;
> >> +	long ret;
> >> +	int i;
> >>  
> >>  	for_each_new_plane_in_state(&intel_state->base, plane, new_plane_state, i) {
> >>  		if (new_plane_state->fence) {
> > 
>