[PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency

Erico Nunes nunes.erico at gmail.com
Sun Jan 21 15:11:41 UTC 2024


On Sun, Jan 21, 2024 at 12:20 PM Qiang Yu <yuq825 at gmail.com> wrote:
>
> On Sun, Jan 21, 2024 at 5:56 PM Hillf Danton <hdanton at sina.com> wrote:
> >
> > On Wed, 17 Jan 2024 04:12:10 +0100 Erico Nunes <nunes.erico at gmail.com>
> > >
> > > @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> > >       struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
> > >       struct lima_sched_task *task = to_lima_task(job);
> > >       struct lima_device *ldev = pipe->ldev;
> > > +     struct lima_ip *ip = pipe->processor[0];
> > > +
> > > +     /*
> > > +      * If the GPU managed to complete this jobs fence, the timeout is
> > > +      * spurious. Bail out.
> > > +      */
> > > +     if (dma_fence_is_signaled(task->done_fence)) {
> > > +             DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> > > +             return DRM_GPU_SCHED_STAT_NOMINAL;
> > > +     }
> >
> > Given 500ms in lima_sched_pipe_init(), no timeout is spurious by define,
> > and stop selling bandaid like this because you have options like locating
> > the reasons behind timeout.
>
> This chang do look like to aim for 2FPS apps. Maybe 500ms is too short
> for week mali4x0 gpus (2FPS apps appear more likely). AMD/NV GPU uses
> 10s timeout. So increasing the timeout seems to be an equivalent and better
> way?

Indeed 500ms might be too optimistic for the sort of applications that
users expect to run on this hardware currently. For a more similar
reference though, other embedded drivers like v3d and panfrost do
still set it to 500ms. Note that this patch is just exactly the same
as exists in Panfrost today and was already discussed with some common
arguments in the patches of this series:
https://patchwork.freedesktop.org/series/120820/

But I would agree to bump the timeout to a higher value for lima. Some
distributions are already doing this with module parameters anyway to
even be able to run some more demanding application stacks on a Mali
400.

Another thing we might consider (probably in a followup patchset to
not delay these fixes forever for the people hitting this issue) is to
configure the Mali hardware watchdog to the value that we would like
as a timeout. That way we would get timeout jobs going through the
same error irq path as other hardware error jobs and might be able to
delete(?)/simplify this software timeout code.

In the meantime for v2 of this series I'll make the change to account
for the multiple pp irqs. So do you agree it is ok to leave
drm_sched_stop() as it is at least for this series?

Thanks all for the reviews

Erico


More information about the lima mailing list