[RFC] Host1x/TegraDRM UAPI (sync points)

Thu Jul 2 12:10:57 UTC 2020

On 7/1/20 3:22 AM, Dmitry Osipenko wrote:
> 30.06.2020 13:26, Mikko Perttunen пишет:
>> On 6/29/20 10:42 PM, Dmitry Osipenko wrote:
>>>
>>> Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync
>>> point, correct? Or are they integrated with Host1x HW?
>>>
>>
>> They can access syncpoints directly. (That's what I alluded to in the
>> "Introduction to the hardware" section :) all those things have hardware
>> access to syncpoints)
> 
> Should we CC all the Nouveau developers then, or is it a bit too early? :)

I think we have a few other issues still to resolve before that :)

> 
>>>
>>> .. rest ..
>>>
>>
>> Let me try to summarize once more for my own understanding:
>>
>> * When submitting a job, you would allocate new syncpoints for the job
> 
> - Yes
> 
>> * After submitting the job, those syncpoints are not usable anymore
> 
> - Yes
> 
> Although, thinking a bit more about it, this needs to be relaxed.
> 
> It should be a userspace agreement/policy how to utilize sync points.
> 
> For example, if we know that userspace will have multiple application
> instances all using Tegra DRM UAPI, like a mesa or VDPAU drivers, then
> this userspace should consider to return sync points into the pool for
> sharing them with others. While something like an Opentegra Xorg driver,
> which usually has a single instance, could keep sync points pre-allocated.
> 
> The job's sync point counter will be reset to 0 by the kernel driver
> during of the submission process for each job, so we won't have the sync
> point recovery problem.
> 
>> * Postfences of that job would keep references to those syncpoints so
>> they aren't freed and cleared before the fences have been released
> 
> - No
> 
> I suggested that fence shouldn't refcount the sync point and *only* have
> a reference to it, this reference will be invalidated once fence is
> signaled by sync point reaching the threshold or once sync point is
> released.
> 
> The sync point will have a reference to every active fence (waiting for
> the signal) that is using this sync point until the threshold is reached.
> 
> So fence could detach itself from the sync point + sync point could
> detach all the fences from itself.
> 
> There will be more about this below, please see example with a dead
> process in the end of this email.
> 
>> * Once postfences have been released, syncpoints would be returned to
>> the pool and reset to zero
> 
> - No
> 
> I'm suggesting that sync point should be returned to the pool once its
> usage refcount reaches 0. This means that both userspace that created
> this sync point + the executed job will both keep the sync point alive
> until it is closed by userspace + job is completed.
> 
>> The advantage of this would be that at any point in time, there would be
>> a 1:1 correspondence between allocated syncpoints and jobs; so you could
>>   shuffle the jobs around channels or reorder them.
> 
> - Yes
> 
>> Please correct if I got that wrong :)
>>
>> ---
>>
>> I have two concerns:
>>
>> * A lot of churn on syncpoints - any time you submit a job you might not
>> get a syncpoint for an indefinite time. If we allocate syncpoints
>> up-front at least you know beforehand, and then you have the syncpoint
>> as long as you need it.
> 
> If you'll have a lot of active application instances all allocating sync
> points, then inevitably the sync points pool will be exhausted.
> 
> But my proposal doesn't differ from yours in this regards, correct?
> 
> And maybe there is a nice solution, please see more below!
> 
>> * Plumbing the dma-fence/sync_file everywhere, and keeping it alive
>> until waits on it have completed, is more work than just having the
>> ID/threshold. This is probably mainly a problem for downstream, where
>> updating code for this would be difficult. I know that's not a proper
>> argument but I hope we can reach something that works for both worlds.
> 
> You could have ID/threshold! :)
> 
> But, you can't use the *job's* ID/threshold because you won't know them
> until kernel driver scheduler will *complete(!)* the job's execution!
> The job may be re-pushed multiple times by the scheduler to recovered
> channel if a previous jobs hang!
> 
> Now, you could allocate *two* sync points:
> 
>    1. For the job itself (job's sync point).
> 
>    2. For the userspace to wait (user's sync point).
> 
> The job will have to increment both these sync points (example of
> multiple sync points usage) and you know the user's sync point ID/threshold!
> 
> If job times out, you *could* increment the user's sync point on CPU
> from userspace!
> 
> The counter of the user's sync point won't be touched by the kernel
> driver if job hangs!

Ok, so we would have two kinds of syncpoints for the job; one for kernel 
job tracking; and one that userspace can manipulate as it wants to.

Could we handle the job tracking syncpoint completely inside the kernel, 
i.e. allocate it in kernel during job submission, and add an increment 
for it at the end of the job (with condition OP_DONE)? For MLOCKing, the 
kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + 
MLOCK_RELEASE sequence at the end of each job.

> 
>> Here's a proposal in between:
>>
>> * Keep syncpoint allocation and submission in jobs as in my original
>> proposal
> 
> Yes, we could keep it.
> 
> But, as I suggested in my other email, we may want to extend the
> allocation IOCTL for the multi-syncpoint allocation support.
> 
> Secondly, if we'll want to have the multi-syncpoint support for the job,
> then we may want improve the SUBMIT IOCTL like this:
> 
> struct drm_tegra_channel_submit {
>          __u32 num_usr_syncpt_incrs;
>          __u64 usr_sync_points_ptr;
> 
>          __u32 num_job_syncpt_incrs;
>          __u32 job_syncpt_handle;
> };
> 
> If job doesn't need to increment user's sync points, then there is no
> need to copy them from userspace, hence num_usr_syncpt_incrs should be
> 0. I.e. one less copy_from_user() operation.
> 
>> * Don't attempt to recover user channel contexts. What this means:
>>    * If we have a hardware channel per context (MLOCKing), just tear down
>> the channel
> 
> !!!
> 
> Hmm, we actually should be able to have a one sync point per-channel for
> the job submission, similarly to what the current driver does!
> 
> I'm keep forgetting about the waitbase existence!

Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs 
anyway, can't we just recalculate wait thresholds at that time?

Maybe a more detailed sequence list or diagram of what happens during 
submission and recovery would be useful.

> 
> Please read more below.
> 
>>    * Otherwise, we can just remove (either by patching or by full
>> teardown/resubmit of the channel) all jobs submitted by the user channel
>> context that submitted the hanging job. Jobs of other contexts would be
>> undisturbed (though potentially delayed, which could be taken into
>> account and timeouts adjusted)
> 
> The DRM scheduler itself has an assumption/requirement that when channel
> hangs, it must be fully reset. The hanged job will be killed by the
> scheduler (maybe dependent jobs will be killed too, but I don't remember
> details right now) and then scheduler will re-submit jobs to the
> recovered channel [1].
> 
> [1]
> https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/scheduler.c#L206
> 
> Hence, if we could assign a sync point per-channel, then during of the
> channel's recovery, the channel's sync point will be reset as well! Only
> the waitbases of the re-submitted jobs will differ!
> 
> It also means that userspace won't need to allocate sync point for each job!
> 
> So far it sounds great! I'll try to think more thoroughly about this.
> 
>> * If this happens, we can set removed jobs' post-fences to error status
>> and user will have to resubmit them.
>> * We should be able to keep the syncpoint refcounting based on fences.
> 
> The fence doesn't need the sync point itself, it only needs to get a
> signal when the threshold is reached or when sync point is ceased.
> 
> Imagine:
> 
>    - Process A creates sync point
>    - Process A creates dma-fence from this sync point
>    - Process A exports dma-fence to process B
>    - Process A dies
> 
> What should happen to process B?
> 
>    - Should dma-fence of the process B get a error signal when process A
> dies?
>    - Should process B get stuck waiting endlessly for the dma-fence?
> 
> This is one example of why I'm proposing that fence shouldn't be coupled
> tightly to a sync point.

As a baseline, we should consider process B to get stuck endlessly 
(until a timeout of its choosing) for the fence. In this case it is 
avoidable, but if the ID/threshold pair is exported out of the fence and 
is waited for otherwise, it is unavoidable. I.e. once the ID/threshold 
are exported out of a fence, the waiter can only see the fence being 
signaled by the threshold being reached, not by the syncpoint getting freed.

> 
>> This can be made more fine-grained by not caring about the user channel
>> context, but tearing down all jobs with the same syncpoint. I think the
>> result would be that we can get either what you described (or how I
>> understood it in the summary in the beginning of the message), or a more
>> traditional syncpoint-per-userctx workflow, depending on how the
>> userspace decides to allocate syncpoints.
>>
>> If needed, the kernel can still do e.g. reordering (you mentioned job
>> priorities) at syncpoint granularity, which, if the userspace followed
>> the model you described, would be the same thing as job granularity.
>>
>> (Maybe it would be more difficult with current drm_scheduler, sorry,
>> haven't had the time yet to read up on that. Dealing with clearing work
>> stuff up before summer vacation)
> 
> Please take yours time! You definitely will need take a closer look at
> the DRM scheduler.
>