drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated

Wed May 3 11:40:50 UTC 2023

Hi Lucas,

Am 03.05.23 um 12:28 schrieb Lucas Stach:
> Hi Christian,
>
> Am Mittwoch, dem 03.05.2023 um 10:47 +0200 schrieb Christian König:
>> Adding Luben as well.
>>
>> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
>>> [SNIP]
>>>> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
>>> After the discussion I had with Matthew yesterday on IRC, I
>>> realized there was no clear agreement on this. Matthew uses those 3
>>> helpers in the Xe driver right now, and given he intends to use a
>>> multi-threaded wq for its 1:1 schedulers run queue, there's no way he
>>> can get away without calling drm_sched_{start,stop}().
>>> drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
>>> wondering if it wouldn't be preferable to add a ::resubmit_job() method
>>> or extend the ::run_job() one to support the resubmit semantics, which,
>>> AFAIU, is just about enforcing the job done fence (the one returned by
>>> ::run_job()) doesn't transition from a signaled to an unsignaled state.
>>>
>>> But probably more important than providing a generic helper, we should
>>> document the resubmit semantics (AKA, what should and/or shouldn't be
>>> done with pending jobs when a recovery happens). Because forbidding
>>> people to use a generic helper function doesn't give any guarantee that
>>> they'll do the right thing when coding their own logic, unless we give
>>> clues about what's considered right/wrong, and the current state of the
>>> doc is pretty unclear in this regard.
>> I should probably talk about the history of the re-submit feature a bit
>> more.
>>
>> Basically AMD came up with re-submission as a cheap way of increasing
>> the reliability of GPU resets. Problem is that it turned into an
>> absolutely nightmare. We tried for the last 5 years or so to get that
>> stable and it's still crashing.
>>
>> The first and most major problem is that the kernel doesn't even has the
>> information if re-submitting jobs is possible or not. For example a job
>> which has already been pushed to the hw could have grabbed a binary
>> semaphore and re-submitting it will just wait forever for the semaphore
>> to be released.
>>
> I can follow this argument, but concluding that job resubmission is
> impossible is punishing simple GPUs. On Vivante GPUs we have exactly
> one job running at a time and all dependencies are visible to the
> scheduler, as we don't have/use any hardware synchronization mechanism,
> so all synchronization is piped through kernel visible fences.
>
> It's reasonably easy for the etnaviv driver to find the guilty job to
> skip but resubmit all other jobs in the current hardware queue. I'm not
> really fond of having to make all applications deal with innocent
> context resets, while we can solve this via resubmission on simple HW.
>
> I know that more complex hardware and use-cases might still require the
> kernel driver for this HW to give up and shoot all contexts active at
> the time of the GPU reset, but that's the price you pay for the
> hardware being more capable. I don't see why we should also pay that
> price on really simple HW.

You can still re-create the hw state inside your driver to continue work 
from some point if know that this will work.

As I wrote below the scheduler component can even provide help with with 
that in the form of providing all the unsignaled hw or scheduler fences 
for example.

But what we absolutely should *not* do is to have this re-submission 
feature, because that requires re-using the dma_fence objects. In other 
words this dance with detaching the scheduler fence from the hw fence 
and attach a new one is what absolutely doesn't work.

>> The second problem is that the dma_fence semantics don't allow to ever
>> transit the state of a fence from signaled back to unsignaled. This
>> means that you can't re-use the hw fence and need to allocate a new one,
>> but since memory allocation is forbidden inside a reset handler as well
>> (YES we need to better document that part) you actually need to keep a
>> bunch of hw fences pre-allocated around to make this work. Amdgpu choose
>> to illegally re-use the hw fence instead which only works with quite
>> extreme hacks.
>>
> I'm with Boris here. Could you please explain when a fence would be
> already signaled in a GPU reset scenario and would need to go back to
> unsignaled, so we are on the same page here?

Take a look at how this re-submission feature of the scheduler works. 
The approach is basically that you detach the hw fence from the 
scheduler fence and then attach a new one.

Either you re-use the hw fence, violate the dma_fence semantic or make 
sure you can always allocate a new hw fence object. Basically you can 
only choose between evils and all of them suck.

> Also the no memory allocation policy really needs some documentation.
> Currently etnaviv may allocate a bunch of memory to create the
> devcoredump before resetting the GPU and I don't believe etnaviv is the
> only driver doing such a thing. The allocations are set up in a way to
> just give up if there is memory pressure, as getting the GPU back in
> working order is obviously more important than writing GPU state data
> for post mortem analysis, but I also don't see where else this
> allocation could be done.

Well doing some optional memory allocation for devcoredump for post 
mortem analysis is perfectly fine. If that fails you can just write into 
syslog "sorry can't dump because of lack of memory".

But what we can't do is to rely on memory allocation for proper 
operation, e.g. something like "I can't get the desktop working again 
because memory allocation in the GPU reset handler failed" is a no-go.

>> The third problem is that the lifetime of the job object was actually
>> defined very well before we tried to use re-submission. Basically it's
>> just an intermediate state used between the IOCTL and pushing things to
>> the hw, introducing this re-submit feature completely messed that up and
>> cause quite a number of use after free errors in the past which are
>> again only solved by quite some hacks.
>>
> Isn't it still well-defined? The job object lives until it's parent
> fence signaled. Pulling a unsignaled parent fence and attaching a new
> one is not a problem.

Yeah, problem again is which parent fence? The old one, the new one? How 
to prevent racing between fence signaling and job timeout? Can we 
cleanup the job from the interrupt or atomic context? (Turned out no, we 
can't).

The idea of the job object was to give the scheduler something to work 
with, e.g. you push it into the scheduler on the one end and get it back 
on the other end based on scheduling properties, dependencies etc... 
Initially that was even just a pointer to void.

The hw fence then represents what's currently running on the hw, with 
timeout, resource allocation, etc...

If necessary the driver can even use the same object for both (like 
amdgpu does now), but the key point is that this is nothing the 
scheduler should be dealing with.

Regards,
Christian.

>
> Regards,
> Lucas
>
>> What we should do in the GPU scheduler instead is the follow:
>>
>> 1. Don't support re-submission at all!
>>       Instead we can provide help to drivers to query which fences
>> (scheduler or hw) are still not signaled yet.
>>       This can then be used to re-create hw state if (and only if!) the
>> driver knows what it's doing and can actually guarantee that this will work.
>>       E.g. the case for XE where the kernel driver knows the contexts
>> which were not running at the time and can re-create their queues.
>>
>> 2. We can provide both a wq to use for single threaded application as
>> well as start/stop semantics.
>>       It's just that the start/stop semantics should never touch what was
>> already submitted, but rather just make sure that we don't get any new
>> submissions.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>
>>> Boris