[RFC PATCH v2 10/17] WIP: gpu: host1x: Add no-recovery mode

Tue Sep 15 10:57:46 UTC 2020

On 9/13/20 9:37 PM, Dmitry Osipenko wrote:
> 13.09.2020 12:51, Mikko Perttunen пишет:
> ...
>>> All waits that are internal to a job should only wait for relative sync
>>> point increments. >
>>> In the grate-kernel every job uses unique-and-clean sync point (which is
>>> also internal to the kernel driver) and a relative wait [1] is used for
>>> the job's internal sync point increments [2][3][4], and thus, kernel
>>> driver simply jumps over a hung job by updating DMAGET to point at the
>>> start of a next job.
>>
>> Issues I have with this approach:
>>
>> * Both this and my approach have the requirement for userspace, that if
>> a job hangs, the userspace must ensure all external waiters have timed
>> out / been stopped before the syncpoint can be freed, as if the
>> syncpoint gets reused before then, false waiter completions can happen.
>>
>> So freeing the syncpoint must be exposed to userspace. The kernel cannot
>> do this since there may be waiters that the kernel is not aware of. My
>> proposal only has one syncpoint, which I feel makes this part simpler, too.
>>
>> * I believe this proposal requires allocating a syncpoint for each
>> externally visible syncpoint increment that the job does. This can use
>> up quite a few syncpoints, and it makes syncpoints a dynamically
>> allocated resource with unbounded allocation latency. This is a problem
>> for safety-related systems.
> 
> Maybe we could have a special type of a "shared" sync point that is
> allocated per-hardware engine? Then shared SP won't be a scarce resource
> and job won't depend on it. The kernel or userspace driver may take care
> of recovering the counter value of a shared SP when job hangs or do
> whatever else is needed without affecting the job's sync point.

Having a shared syncpoint opens up possibilities for interference 
between jobs (if we're not using the firewall, the HW cannot distinguish 
between jobs on the same channel), and doesn't work if there are 
multiple channels using the same engine, which we want to do for newer 
chips (for performance and virtualization reasons).

Even then, even if we need to allocate one syncpoint per job, the issue 
seems to be there.

> 
> Primarily I'm not feeling very happy about retaining the job's sync
> point recovery code because it was broken the last time I touched it and
> grate-kernel works fine without it.

I'm not planning to retain it any longer than necessary, which is until 
the staging interface is removed. Technically I can already remove it 
now -- that would cause any users of the staging interface to 
potentially behave weirdly if a job times out, but maybe we don't care 
about that all that much?

> 
>> * If a job fails on a "virtual channel" (userctx), I think it's a
>> reasonable expectation that further jobs on that "virtual channel" will
>> not execute, and I think implementing that model is simpler than doing
>> recovery.
> 
> Couldn't jobs just use explicit fencing? Then a second job won't be
> executed if first job hangs and explicit dependency is expressed. I'm
> not sure that concept of a "virtual channel" is applicable to drm-scheduler.

I assume what you mean is that each job incrementing a syncpoint would 
first wait for the preceding job incrementing that syncpoint to 
complete, by waiting for the preceding job's fence value.

I would consider what I do in this patch to be an optimization of that. 
Let's say we detect a timed out job and just skip that job in the CDMA 
pushbuffer (but do not CPU-increment syncpoints), then at every 
subsequent job using that syncpoint, we will be detecting a timeout and 
skipping it eventually. With the "NOPping" in this patch we just 
pre-emptively cancel those jobs so that we don't have to spend time 
waiting for timeouts in the future. Functionally these should be the 
same, though.

The wait-for-preceding-job-to-complete thing should already be there in 
form of the "serialize" operation if the jobs use the same syncpoint.

So, if DRM scheduler's current operation is just skipping the timing out 
job and continuing from the next job, that's functionally fine. But we 
could improve DRM scheduler to allow for also cancelling future jobs 
that we know will time out. That would be in essence "virtual channel" 
support.

Userspace still has options -- if it puts in other prefences, timeouts 
will happen as usual. If it wants to have multiple "threads" of 
execution where a timeout on one doesn't affect the others, it can use 
different syncpoints for them.

> 
> I'll need to see a full-featured driver implementation and the test
> cases that cover all the problems that you're worried about because I'm
> not aware about all the T124+ needs and seeing code should help. Maybe
> in the end yours approach will be the best, but for now it's not clear :)
> 

My primary goal is simplicity of programming model and implementation. 
Regarding the resource management concerns, I can of course create a 
test case that allocates a lot of resources, but what I'm afraid about 
is that once we put this into a big system, with several VMs with their 
own resource reservations (including syncpoints), and the GPU and camera 
subsystems using hundreds of syncpoints, dynamic usage of those 
resources will create uncertainty in the system, and bug reports.

And of course, if we want to make a safety-related system, you also need 
to document before-hand how you are ensuring that e.g. job submission 
(including syncpoint allocation if that is dynamic) happens under x 
microseconds.

I don't think the model used in the grate host1x driver is bad, and I 
think the code there and its integration with the existing kernel 
frameworks are beautiful, and that is definitely a goal for the mainline 
driver as well. But I think we can make things even simpler overall and 
more reliable.

Mikko