drm/sched: Replacement for drm_sched_resubmit_jobs() is deprecated

Thu May 4 11:07:29 UTC 2023

Am 04.05.23 um 06:54 schrieb Matthew Brost:
> On Wed, May 03, 2023 at 10:47:43AM +0200, Christian König wrote:
>> Adding Luben as well.
>>
>> Am 03.05.23 um 10:16 schrieb Boris Brezillon:
>>> [SNIP]
>>>> To sum-up, we shouldn't call drm_sched_{start,stop,resubmit_jobs}().
>>> After the discussion I had with Matthew yesterday on IRC, I
>>> realized there was no clear agreement on this. Matthew uses those 3
>>> helpers in the Xe driver right now, and given he intends to use a
>>> multi-threaded wq for its 1:1 schedulers run queue, there's no way he
>>> can get away without calling drm_sched_{start,stop}().
>>> drm_sched_resubmit_jobs() can be open-coded in each driver, but I'm
>>> wondering if it wouldn't be preferable to add a ::resubmit_job() method
>>> or extend the ::run_job() one to support the resubmit semantics, which,
>>> AFAIU, is just about enforcing the job done fence (the one returned by
>>> ::run_job()) doesn't transition from a signaled to an unsignaled state.
>>>
>>> But probably more important than providing a generic helper, we should
>>> document the resubmit semantics (AKA, what should and/or shouldn't be
>>> done with pending jobs when a recovery happens). Because forbidding
>>> people to use a generic helper function doesn't give any guarantee that
>>> they'll do the right thing when coding their own logic, unless we give
>>> clues about what's considered right/wrong, and the current state of the
>>> doc is pretty unclear in this regard.
>> I should probably talk about the history of the re-submit feature a bit
>> more.
>>
>> Basically AMD came up with re-submission as a cheap way of increasing the
>> reliability of GPU resets. Problem is that it turned into an absolutely
>> nightmare. We tried for the last 5 years or so to get that stable and it's
>> still crashing.
>>
>> The first and most major problem is that the kernel doesn't even has the
>> information if re-submitting jobs is possible or not. For example a job
>> which has already been pushed to the hw could have grabbed a binary
>> semaphore and re-submitting it will just wait forever for the semaphore to
>> be released.
>>
>> The second problem is that the dma_fence semantics don't allow to ever
>> transit the state of a fence from signaled back to unsignaled. This means
>> that you can't re-use the hw fence and need to allocate a new one, but since
>> memory allocation is forbidden inside a reset handler as well (YES we need
>> to better document that part) you actually need to keep a bunch of hw fences
>> pre-allocated around to make this work. Amdgpu choose to illegally re-use
>> the hw fence instead which only works with quite extreme hacks.
>>
>> The third problem is that the lifetime of the job object was actually
>> defined very well before we tried to use re-submission. Basically it's just
>> an intermediate state used between the IOCTL and pushing things to the hw,
>> introducing this re-submit feature completely messed that up and cause quite
>> a number of use after free errors in the past which are again only solved by
>> quite some hacks.
>>
>> What we should do in the GPU scheduler instead is the follow:
>>
>> 1. Don't support re-submission at all!
>>      Instead we can provide help to drivers to query which fences (scheduler
>> or hw) are still not signaled yet.
>>      This can then be used to re-create hw state if (and only if!) the driver
>> knows what it's doing and can actually guarantee that this will work.
>>      E.g. the case for XE where the kernel driver knows the contexts which
>> were not running at the time and can re-create their queues.
>>
>> 2. We can provide both a wq to use for single threaded application as well
>> as start/stop semantics.
>>      It's just that the start/stop semantics should never touch what was
>> already submitted, but rather just make sure that we don't get any new
>> submissions.
>>
> I pretty much agree with everything Christian has said here and Xe
> aligns with this. Let me explain what Xe does.
>
> 1. Entity hang (TDR timeout of job on a entity, firmware notifies Xe that a
> entity hung, entity IOMMU CAT error, etc...):
> 	- No re-submission at all
> 	- ban the entity
> 	- notify the UMD
> 	- cleanup all pending jobs / fences
> 2. Entire GPU hang (worth mentioning with good HW + KMD this *should*
> never happen):
> 	- stop all schedulers (same as a entity in Xe because 1 to 1)
> 	- cleanup odd entity state related to communication with the
> 	  firmware
> 	- check if an entity has a job that started but not finished, if
> 	  so ban it (same mechanism as above)
> 	- resubmit all jobs from good entities

As long as you can do this without creating new dma_fence objects for 
the hw submissions everything should be fine.

> 	- start all schedulers (same as a entity in Xe because 1 to 1)
>
> The implementation for this in the following file [1]. Search for the
> drm scheduler functions and you should be able to find implementation
> easily.
>
> If you want to use an ordered work queue to avoid the stop / start dance
> great do that, in Xe the stop / start dance works. I have extensively
> tested this and the flow is rock solid and please trust me when I say
> this as I worked on reset flows in the i915 and fought bugs for years, I
> still don't think it is in the i915 because we try to do resubmission +
> crazy races. Because of that I was incredibly paranoid when I
> implemented this + tested it extensively.
>
> I'll post an updated version of my DRM scheduler series [2] on the list
> soon for the WQ changes, plus whatever else is required for Xe.
>
> So my take from this discussion is maybe with a little documentation, we
> are good. Thoughts?

For XE what you describe above basically sounds perfectly fine to me.

I strongly assume when you re-submit things you just tell your hw 
scheduler to pick up at the place it left of? E.g. set a read pointer 
and write pointer of a ring buffer appropriately?

Christian.

>
> Matt
>
> [1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c
> [2] https://patchwork.freedesktop.org/series/116055/
>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>
>>> Boris