drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated

Wed May 3 15:01:50 UTC 2023

Am 03.05.23 um 15:10 schrieb Lucas Stach:
> Am Mittwoch, dem 03.05.2023 um 13:40 +0200 schrieb Christian König:
>> Hi Lucas,
>>
>> Am 03.05.23 um 12:28 schrieb Lucas Stach:
>>> Hi Christian,
>>>
>>> Am Mittwoch, dem 03.05.2023 um 10:47 +0200 schrieb Christian König:
>>>> Adding Luben as well.
>>>>
>>>> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
>>>>> [SNIP]
>>>>>> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
>>>>> After the discussion I had with Matthew yesterday on IRC, I
>>>>> realized there was no clear agreement on this. Matthew uses those 3
>>>>> helpers in the Xe driver right now, and given he intends to use a
>>>>> multi-threaded wq for its 1:1 schedulers run queue, there's no way he
>>>>> can get away without calling drm_sched_{start,stop}().
>>>>> drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
>>>>> wondering if it wouldn't be preferable to add a ::resubmit_job() method
>>>>> or extend the ::run_job() one to support the resubmit semantics, which,
>>>>> AFAIU, is just about enforcing the job done fence (the one returned by
>>>>> ::run_job()) doesn't transition from a signaled to an unsignaled state.
>>>>>
>>>>> But probably more important than providing a generic helper, we should
>>>>> document the resubmit semantics (AKA, what should and/or shouldn't be
>>>>> done with pending jobs when a recovery happens). Because forbidding
>>>>> people to use a generic helper function doesn't give any guarantee that
>>>>> they'll do the right thing when coding their own logic, unless we give
>>>>> clues about what's considered right/wrong, and the current state of the
>>>>> doc is pretty unclear in this regard.
>>>> I should probably talk about the history of the re-submit feature a bit
>>>> more.
>>>>
>>>> Basically AMD came up with re-submission as a cheap way of increasing
>>>> the reliability of GPU resets. Problem is that it turned into an
>>>> absolutely nightmare. We tried for the last 5 years or so to get that
>>>> stable and it's still crashing.
>>>>
>>>> The first and most major problem is that the kernel doesn't even has the
>>>> information if re-submitting jobs is possible or not. For example a job
>>>> which has already been pushed to the hw could have grabbed a binary
>>>> semaphore and re-submitting it will just wait forever for the semaphore
>>>> to be released.
>>>>
>>> I can follow this argument, but concluding that job resubmission is
>>> impossible is punishing simple GPUs. On Vivante GPUs we have exactly
>>> one job running at a time and all dependencies are visible to the
>>> scheduler, as we don't have/use any hardware synchronization mechanism,
>>> so all synchronization is piped through kernel visible fences.
>>>
>>> It's reasonably easy for the etnaviv driver to find the guilty job to
>>> skip but resubmit all other jobs in the current hardware queue. I'm not
>>> really fond of having to make all applications deal with innocent
>>> context resets, while we can solve this via resubmission on simple HW.
>>>
>>> I know that more complex hardware and use-cases might still require the
>>> kernel driver for this HW to give up and shoot all contexts active at
>>> the time of the GPU reset, but that's the price you pay for the
>>> hardware being more capable. I don't see why we should also pay that
>>> price on really simple HW.
>> You can still re-create the hw state inside your driver to continue work
>> from some point if know that this will work.
>>
>> As I wrote below the scheduler component can even provide help with with
>> that in the form of providing all the unsignaled hw or scheduler fences
>> for example.
>>
>> But what we absolutely should *not* do is to have this re-submission
>> feature, because that requires re-using the dma_fence objects. In other
>> words this dance with detaching the scheduler fence from the hw fence
>> and attach a new one is what absolutely doesn't work.
>>
>>>> The second problem is that the dma_fence semantics don't allow to ever
>>>> transit the state of a fence from signaled back to unsignaled. This
>>>> means that you can't re-use the hw fence and need to allocate a new one,
>>>> but since memory allocation is forbidden inside a reset handler as well
>>>> (YES we need to better document that part) you actually need to keep a
>>>> bunch of hw fences pre-allocated around to make this work. Amdgpu choose
>>>> to illegally re-use the hw fence instead which only works with quite
>>>> extreme hacks.
>>>>
>>> I'm with Boris here. Could you please explain when a fence would be
>>> already signaled in a GPU reset scenario and would need to go back to
>>> unsignaled, so we are on the same page here?
>> Take a look at how this re-submission feature of the scheduler works.
>> The approach is basically that you detach the hw fence from the
>> scheduler fence and then attach a new one.
>>
> Right, but this shouldn't be a problem, as long as the old fence isn't
> signaled yet, right? It becomes a problem when the GPU reset and fence
> signaling are racing each other, due to insufficient hardware/software
> state synchronization.

Exactly that.

> I'm sure that the necessary synchronization can be hard to get right,
> but it's not the act of switching one unsignaled fence to a new one or
> reusing the old unsignaled fence that's causing problems, but the
> complications of making sure that old fences don't signal after the
> timeout detection, right?

Well sort of, switching the fences is the root of the problem.

Basically the initial hw fence for a submission should never signal 
until the hw is done with that submission. Either completed it or 
aborted it.

The concept that the GPU reset aborts the existing fences then re-submit 
which gives you new hw fences is what doesn't work.

Either you re-use the old hw fence (in which case you wouldn't have to 
switch it in the first place) or you allocate a new one (which violates 
the no-memory allocation requirements).

> To be clear: I'm not asking those questions because I think I know any
> better, but because I'm actually not sure and would like to understand
> your line of thought and background information when you say "this
> dance with detaching the scheduler fence from the hw fence and attach a
> new one is what absolutely doesn't work" above.

And that is really appreciated, we don't have enough people looking into 
this and especially pushing back on bad ideas.

> Driver writers need to understand the limitations of the current
> resubmission scheduler code to do better in their driver
> implementation, otherwise we just end up with (worse) open-coded
> variations of that code in each driver.

Yes, completely agree.

Regards,
Christian.

>
> Regards,
> Lucas