[PATCH] drm/atomic: Perform blocking commits on workqueue

Thu Oct 5 21:04:55 UTC 2023

Hi,

On Thu, Oct 5, 2023 at 5:57 AM Daniel Vetter <daniel at ffwll.ch> wrote:
> So imo the trouble with this is that we suddenly start to make
> realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi
> change, because even limited to the case of !ALLOW_MODESET we do best
> effort guarantees at best.
So you're saying there's a best effort to not use up process CPU time,
but Christian was trying to argue nearly the opposite, that any needed
CPU time for the operation should get accounted toward the process.

What you're saying makes more sense to me and it tracks with what I'm
getting skimming over the code. I see it potentially sleeps during
blocking drmModeAtomicCommit() calls in several places:

- copy_from_user when copying properties
- fence and vblank event allocation
- waiting on modeset locks
- And even plane changes:

for_each_new_plane_in_state(state, plane, new_plane_state, i) {•
...
→       /*•
→        * If waiting for fences pre-swap (ie: nonblock), userspace can•
→        * still interrupt the operation. Instead of blocking until the•
→        * timer expires, make the wait interruptible.•
→        */•
→       ret = dma_fence_wait(new_plane_state->fence, pre_swap);•
...
}•

(Ignore the typo where it says "nonblock" instead of "!nonblock",
the point is while waiting on plane state changes it sleeps.)

So, in general, the ioctl sleeps when it needs to. It's not trying
to be some non-premptible RT task with a dedicated cpu budget that it
strictly adheres to. That makes sense to me.

> And that state recomputation has to happen synchronously, because
> it always influences the ioctl errno return value.

Not sure I'm following. The task doesn't have to stay in RUNNING the
entire time the ioctl is getting carried out. It can get preempted,
and rescheduled later, all before returning from the ioctl and
delivering the errno back to userspace. What am I missing?

The problem isn't that the ioctl blocks, the problem is that when it
blocks it shouldn't be leaving the task state in RUNNING.

> My take is that you're papering over a performance problem here of the
> "the driver is too slow/wastes too much cpu time". We should fix the
> driver, if that's possible.

I think everyone agrees the amdgpu DC code doing several 100ms busy
loops in a row with no break in between is buggy behavior, and getting
that fixed is important.

Also, there's no dispute that the impetus for my proposed change was
that misbehavior in the driver code.

Neither of those facts mean we shouldn't improve
drm_atomic_helper_commit too. Making the change is a good idea. It was
even independently proposed a year ago, for reasons unrelated to the
current situation. It makes the nonblock and the
!nonblock code paths behave closer to the same. it makes the userspace
experience better in the face of driver bugs. You said best effort
earlier, this change helps provide a best effort.

Realistically, if it was done this way from the start, no one would be
batting an eye, right? It just improves things all around. I think
maybe you're worried about a slippery slope?

> We can also try to change the atomic uapi to give some hard real-time
> guarantees so that running compositors as SCHED_RT is possible, but that
> - means a very serious stream of bugs to fix all over
> - therefore needs some very wide buy-in from drivers that they're willing
>   to make this guarantee
> - probably needs some really carefully carved out limitations, because
>   there's imo flat-out no way we'll make all atomic ioctl hard time limit
>   bound

We don't need a guarantee that reprogramming ast BMC framebuffer
offsets or displayport link training (or whatever) takes less than
200ms.  If a driver has to sit and wait 32ms for vblank before
twiddling things in hardware to keep crud from showing up on screen or
something, that's fine.  But in none of these cases should the
userspace process be kept in RUNNING while it does it!

I take your point that there are a lot of drivers that may be doing
slow things, but surely they should let the process nap while they do
their work?

> Also, as König has pointed out, you can roll this duct-tape out in
> userspace by making the commit non-blocking and immediately waiting for
> the fences.

Sure, userspace can do that (even, it turns out, in the all crtc
disabled case which was an open question earlier in the thread).

That's not really an argument against fixing the !nonblock case
though.

> One thing I didn't see mention is that there's a very subtle uapi
> difference between non-blocking and blocking:
> - non-blocking is not allowed to get ahead of the previous commit, and
>   will return EBUSY in that case. See the comment in
>   drm_atomic_helper_commit()
> - blocking otoh will just block until any previous pending commit has
>   finished
>
> Not taking that into account in your patch here breaks uapi because
> userspace will suddenly get EBUSY when they don't expect that.

I don't think that's right, drm_atomic_helper_setup_commit has several
chunks of code like this:

→       →       if (nonblock && old_conn_state->commit &&•
→       →
!try_wait_for_completion(&old_conn_state->commit->flip_done)) {•
...
→       →       →       return -EBUSY;•
→       →       }•

So it only returns EBUSY for DRM_MODE_ATOMIC_NONBLOCK cases.

There's also this code earlier in the process:

→       list_for_each_entry(commit, &crtc->commit_list, commit_entry) {•
...
→       →       →       completed =
try_wait_for_completion(&commit->flip_done);•
...
→       →       →       if (!completed && nonblock) {•
...
→       →       →       →       return -EBUSY;•
→       →       →       }•
...
→       }•
...
→       ret = wait_for_completion_interruptible_timeout(&stall_commit->cleanup_done,
→       →       →       →       →       →       →       10*HZ);•
...

So it looks like it sleeps (not leaving the system in RUNNING state!) if
there's an outstanding commit.

--Ray