[RFC PATCH v2 10/17] WIP: gpu: host1x: Add no-recovery mode

Dmitry Osipenko digetx at gmail.com
Sun Sep 13 18:37:02 UTC 2020


13.09.2020 12:51, Mikko Perttunen пишет:
...
>> All waits that are internal to a job should only wait for relative sync
>> point increments. >
>> In the grate-kernel every job uses unique-and-clean sync point (which is
>> also internal to the kernel driver) and a relative wait [1] is used for
>> the job's internal sync point increments [2][3][4], and thus, kernel
>> driver simply jumps over a hung job by updating DMAGET to point at the
>> start of a next job.
> 
> Issues I have with this approach:
> 
> * Both this and my approach have the requirement for userspace, that if
> a job hangs, the userspace must ensure all external waiters have timed
> out / been stopped before the syncpoint can be freed, as if the
> syncpoint gets reused before then, false waiter completions can happen.
> 
> So freeing the syncpoint must be exposed to userspace. The kernel cannot
> do this since there may be waiters that the kernel is not aware of. My
> proposal only has one syncpoint, which I feel makes this part simpler, too.
> 
> * I believe this proposal requires allocating a syncpoint for each
> externally visible syncpoint increment that the job does. This can use
> up quite a few syncpoints, and it makes syncpoints a dynamically
> allocated resource with unbounded allocation latency. This is a problem
> for safety-related systems.

Maybe we could have a special type of a "shared" sync point that is
allocated per-hardware engine? Then shared SP won't be a scarce resource
and job won't depend on it. The kernel or userspace driver may take care
of recovering the counter value of a shared SP when job hangs or do
whatever else is needed without affecting the job's sync point.

Primarily I'm not feeling very happy about retaining the job's sync
point recovery code because it was broken the last time I touched it and
grate-kernel works fine without it.

> * If a job fails on a "virtual channel" (userctx), I think it's a
> reasonable expectation that further jobs on that "virtual channel" will
> not execute, and I think implementing that model is simpler than doing
> recovery.

Couldn't jobs just use explicit fencing? Then a second job won't be
executed if first job hangs and explicit dependency is expressed. I'm
not sure that concept of a "virtual channel" is applicable to drm-scheduler.

I'll need to see a full-featured driver implementation and the test
cases that cover all the problems that you're worried about because I'm
not aware about all the T124+ needs and seeing code should help. Maybe
in the end yours approach will be the best, but for now it's not clear :)


More information about the dri-devel mailing list