[PATCH] drm/atomic: Perform blocking commits on workqueue

Fri Oct 13 14:04:02 UTC 2023

Hi

On Fri, Oct 13, 2023 at 5:41 AM Daniel Vetter <daniel at ffwll.ch> wrote:
> > I mean we're not talking about scientific computing, or code
> > compilation, or seti at home. We're talking about nearly the equivalent
> > of `while (1) __asm__ ("nop");`
>
> I don't think anyone said this shouldn't be fixed or improved.

Yea but we apparently disagree that it would be an improvement to stop
discrediting user space for driver problems.
You see, to me, there are two problems 1) The behavior itself is not
nice and should be fixed 2) The blame/accounting/attribution for a
driver problem is getting directed to user space. I think both issues
should be fixed. One brought the other issue to light, but both
problems, in my mind, are legitimate and should be addressed. You
think fixing the second problem is some tactic to paper over the first
problem. Christian thinks the second problem isn't a problem at all
but the correct design.  So none of us are completely aligned and I
don't see anyone changing their mind anytime soon.

> What I'm saying is that the atomic ioctl is not going to make guarantees
> that it will not take up to much cpu time (for some extremely vague value
> of "too much") to the point that userspace can configure it's compositor
> in a way that it _will_ get killed if we _ever_ violate this rule.
yea I find that strange, all kernel tasks have a certain implied
baseline responsibility to be good citizens to the system.
And how user space is configured seems nearly immaterial.

But we're talking in circles.

> Fixing the worst offenders I don't think will see any pushback at all.
Yea we all agree fixing this one busy loop is a problem but we don't
agree on where the cpu time blame should go.

> > But *this* feels like duct tape: You've already said there's no
> > guarantee the problem won't also happen during preliminary computation
> > during non-blocking commits or via other drm entry points. So it
> > really does seem like a fix that won't age well. I won't be surprised
> > if in ~3 years (or whatever) in some RHEL release there's a customer
> > bug leading to the real-time thread getting blocklisted for obscure
> > server display hardware because it's causing the session to tank on a
> > production machine.
>
> Yes the atomic ioctl makes no guarantees across drivers and hw platforms
> of guaranteed "we will never violate, you can kill your compositor if you
> do" cpu bound limits. We'll try to not suck too badly, and we'll try to
> focus on fixing the suck where it upsets the people the most.
>
> But there is fundamentally no hard real time guarantee in atomic. At least
> not with the current uapi.

So in your mind mutter should get out of the real-time thread business entirely?

> The problem isn't about wasting cpu time. It's about you having set up the
> system in a way so that mutter gets killed if we ever waste too much cpu
> time, ever.

mutter is not set up in a way to kill itself if the driver wastes too
much cpu time,
ever. mutter is set up in a way to kill itself if it, itself, wastes
too much cpu time, ever.
The fact that the kernel is killing it when a kernel driver is wasting
cpu time is the
bug that led to the patch in the first place!

> The latter is flat out not a guarantee the kernel currently makes, and I
> see no practical feasible way to make that happen. And pretending we do
> make this guarantee will just result in frustrated users filling bugs that
> they desktop session died and developers closing them as "too hard to
> fix".

I think all three of us agree busy loops are not nice (though maybe we
aren't completely aligned on severity). And I'll certainly concede
that fixing all the busy loops can be hard. Some of the cpu-bound code
paths may even be doing legitimate computation.  I still assert that
if the uapi calls into driver code that might potentially be doing
something slow where it can't sleep, it should be doing it on a
workqueue or thread. That seems orthogonal to fixing all the places
where the drivers aren't acting nicely.

> Maybe as a bit more useful outcome of this entire discussion: Could you
> sign up to write a documentation patch for the atomic ioctl that adds a
> paragraph stating extremely clearly that this ioctl does not make hard
> real time guarantees, but only best effort soft realtime, and even then
> priority of the effort is focused entirely on the !ALLOW_MODESET case,
> because that is the one users care about the most.

I don't think I'm the best positioned to write such documentation. I
still think what the kernel is doing is wrong here and I don't even
fully grok what you mean by best effort.

--Ray