[PATCH] drm/atomic: Perform blocking commits on workqueue

Thu Sep 28 06:56:18 UTC 2023

Hi Ray,

Am 27.09.23 um 22:25 schrieb Ray Strode:
> Hi,
>
> On Wed, Sep 27, 2023 at 4:05 AM Christian König
> <christian.koenig at amd.com> wrote:
>> I'm not an expert for that stuff, but as far as I know the whole purpose
>> of the blocking functionality is to make sure that the CPU overhead
>> caused by the commit is accounted to the right process.
> I'm not an expert either, but that's clearly wrong.
>
> The point of blocking functionality in drmAtomicCommit is for the
> ioctl to block until the operation is completed, so userspace knows
> that they can commit again after it returns without getting EBUSY, and
> they can be sure, e.g., the mode is set or the displays are disabled
> or whatever.
>
> To say the "whole point" is about CPU overhead accounting sounds
> rather absurd to me. Is that really what you meant?

Yes, absolutely. See the functionality you try to implement already exists.

After making a non blocking commit userspace can still wait on the 
commit to finish by looking at the out fence.

This just happens asynchronously so you can use (for example) 
select()/poll() etc on multiple events.

> You could try to make the argument that one of the points of the
> blocking call is about CPU accounting (instead of the whole point),
> maybe that's what you meant? That seems likely wrong to me too. I mean
> just check the docs:
>
>         RLIMIT_RTTIME (since Linux 2.6.25)
>                This is a limit (in microseconds) on the amount of CPU
> time that a process scheduled under a real‐time scheduling policy may
> consume without making a blocking system call.  For  the  purpose  of
>                this limit, each time a process makes a blocking system
> call, the count of its consumed CPU time is reset to zero.
> drmAtomicCommit() is making a blocking system call so the limit should
> be reset, in my opinion.

Exactly that's the point why I'm rejecting this. As far as I can see 
this opinion does not match what RLIMIT_RTTIME is intended to do.

A blocking system call in the sense of RLIMIT_RTTIME means something 
which results in the process listening for external events, e.g. calling 
select(), epoll() or read() on (for example) a network socket etc...

As far as I can see drmAtomicCommit() is *not* meant with that what 
similar to for example yield() also doesn't reset the RLIMIT_RTTIME counter.

> Furthermore, a lot happens under the covers as part of drmAtomicCommit.
>
> Are you telling me that in all the drivers handling drmAtomicCommit,
> none of the work from those drivers gets deferred outside of process
> context if DRM_MODE_ATOMIC_NONBLOCK isn't set? I find that very hard
> to believe. But it would have to be true, if one of the main points of
> the blocking call is about CPU accounting, right? You can't say the
> point is about CPU accounting if some of the work is handled in a
> helper thread or work queue or whatever.
>
> ╎❯ git grep -E 'INIT_WORK|DECLARE_TASKLET|open_softirq|timer_setup|kthread_run'
> | wc -l
> 182
>
> There seems to be a lot of non-process context execution going on in the tree...
>
>> So what you are suggesting here is to actually break that functionality
>> and that doesn't seem to make sense.
> What's the use case here that could break?

Well you are breaking the RLIMIT_RTTIME functionality.

The purpose of that functionality is to allow debugging and monitoring 
applications to make sure that they keep alive and can react to external 
signals.

 From the RLIMIT_RTTIME documentation: "The intended use of this limit 
is to stop a runaway real-time process from locking up the system."

And when drmAtomicCommit() is triggering this then we either have a 
problem with the application doing something it is not supposed to do 
(like blocking for vblank while it should listen to network traffic) or 
the driver is somehow buggy.

>   Walk me through a
> real-world, sensible example where a program depends on
> drmAtomicCommit() blocking and continuing to eat into process cpu time
> while it blocks. I just want to see where you are coming from. Maybe
> I'm being unimaginative but I just don't see it. I do see actual
> main-stream display servers breaking today because the current
> behavior defies expectations.
>
>> When it's really not desirable to account the CPU overhead to the
>> process initiating it then you probably rather want to use an non
>> blocking commit plus a dma_fence to wait for the work to end from userspace.
> Well, first I don't think that's very convenient. You're talking about
> a per-plane property, so there would need to be a separate file
> descriptor allocated for every plane, right? and user-space would have
> to block on all of them before proceeding? Also, are you sure that
> works in all cases? The problematic case we're facing right now is
> when all planes and all crtcs are getting disabled. I'm just skimming
> over the code here, but I see this:
>
> →       for_each_new_crtc_in_state(state, crtc, crtc_state, i) {•
> ...
> →       →       if (arg->flags & DRM_MODE_PAGE_FLIP_EVENT || fence_ptr) {•
> ...
> →       →       →       e = create_vblank_event(crtc, arg->user_data);•
> ...
> →       →       →       crtc_state->event = e;•
> →       →       }•
> ...
> →       →       if (fence_ptr) {•
> ...
> →       →       →       fence = drm_crtc_create_fence(crtc);•
> ...
> →       →       →       ret = setup_out_fence(&f[(*num_fences)++], fence);•
> ...
> →       →       →       crtc_state->event->base.fence = fence;•
> →       →       }•
>
> Is the code really going to allocate a vblank_event when all the
> crtc's are disabled ? I guess it's possible, but that's
> counterintuitive to me. I really don't know. You're confident a set of
> fences will actually work?

No when you disable everything there is of course no fence allocated.

But then you also should never see anything waiting for to long to 
actually be able to trigger the RLIMIT_RTTIME.

> Regardless, this seems like a roundabout way to address a problem that
> we could just ... fix.

Well to make it clear: This is not a problem but intended behavior!

 From what I know RLIMIT_RTTIME usually works in units of multiple 
seconds. Milliseconds are possible as well, but when an application sets 
a low millisecond timeout and then blocks for a vblank which can easily 
take 30ms to complete that means you have a bug inside your application.

With this commit you are completely papering over that.

Regards,
Christian.

>
> --Ray