<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Am 09.10.24 um 16:17 schrieb Thomas Hellström:<br>
<blockquote type="cite" cite="mid:ff1d2e900d66664bd2ee6d29955ed7a858c0e44d.camel@linux.intel.com">
<pre class="moz-quote-pre" wrap="">On Wed, 2024-10-09 at 15:39 +0200, Thomas Hellström wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">On Mon, 2024-10-07 at 11:08 +0200, Christian König wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Hi Thomas,
I'm on sick leave, but I will try to answer questions and share
some
thoughts on this to unblock you.
Am 18.09.24 um 14:57 schrieb Thomas Hellström:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Sima, Christian
I've updated the shrinker series now with a guarded for_each
macro
instead:
<a class="moz-txt-link-freetext" href="https://patchwork.freedesktop.org/patch/614514/?series=131815&rev=9">https://patchwork.freedesktop.org/patch/614514/?series=131815&rev=9</a>
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Yeah that looks like a big step in the right direction.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">(Note I forgot to remove the export of the previous LRU walker).
so the midlayer argument is now not an issue anymore. The
cleanup.h
guard provides some additional protection against drivers exiting
the
LRU loop early.
So remaining is the question whether the driver is allowed to
discard a
suggested bo to shrink from TTM.
Arguments for:
1) Not allowing that would require teaching TTM about purgeable
objects.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I think that is actually not correct. TTM already knows about
purgeable
objects.
E.g. when TTM asks the driver what to do with evicted objects it is
perfectly valid to return an empty placement list indicating that
the
backing store can just be thrown away.
We use this for things like temporary buffers for example.
That this doesn't apply to swapping looks like a design bug in the
driver/TTM interface to me.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Yes we can do that with TTM, but for shrinking there's no point in
trying to shrink when there is no swap-space left, other than
purgeable
object. The number of shrinkable objects returned in shrinker::count
needs to reflect that, and *then* only those objects should be
selected
for shrinking. If we were to announce that to TTM using a callback,
we're actually back to version 1 of this series which was rejected by
you exactly since it was using callbacks a year or so ago????</pre>
</blockquote>
</blockquote>
<br>
Yeah that indeed doesn't sound like a good idea.<br>
<br>
On the other hand the decision that only purgeable objects should be
selected when there is no swap space left sounds like something TTM
should do and not the driver.<br>
<br>
<blockquote type="cite" cite="mid:ff1d2e900d66664bd2ee6d29955ed7a858c0e44d.camel@linux.intel.com">
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">2) Devices who need the blitter during shrinking would want to
punt
runtime_pm_get() to kswapd to avoid sleeping direct reclaim.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I think the outcome of the discussion is that runtime PM should
never
be
mixed into TTM swapping.
You can power up blocks to enable a HW blitter for swapping, but
this
then can't be driven by the runtime PM framework.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Still that power-on might be sleeping, so what's the difference using
runtime-pm or not? Why should the driver implement yet another power
interface, just because TTM refuses to publish a sane LRU walk
interface?</pre>
</blockquote>
</blockquote>
<br>
See the discussion with Sima around the PM functions.<br>
<br>
My understanding might be wrong, but I now think that with local
memory you can't do the i915 approach where the PM functions are so
low level that they can also be called inside the shrinker.<br>
<br>
So you basically have PM functions for your whole device and PM
functions for only the HW blocks necessary for the shrinker.<br>
<br>
<blockquote type="cite" cite="mid:ff1d2e900d66664bd2ee6d29955ed7a858c0e44d.camel@linux.intel.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">3) If those devices end up blitting (LNL) to be able to shrink,
they
would want to punt waiting for the fence to signal to kswapd to
avoid
waiting in direct reclaim.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Mhm, what do you mean with that?
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
When we fired the blitter from direct reclaim, we get a fence. If we
wait for it in direct reclaim we will be sleeping waiting for gpu,
which is bad form. We'd like return a failure so the object is
retried
when idle, or from kswapd.</pre>
</blockquote>
</blockquote>
<br>
Oh, that is a really good point. So basically you want to avoid
dependencies on the dma_fence.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:ff1d2e900d66664bd2ee6d29955ed7a858c0e44d.camel@linux.intel.com">
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">4) It looks like we need to resort to folio_trylock in the shmem
backup
backend when shrinking is called for gfp_t = GFP_NOFS. A failing
trylock will require a new bo.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Why would a folio trylock succeed with one BO and not another?
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Good point. We'd fail anyway but would perhaps need to call
SHRINK_STOP..
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
And why would that trylock something the device specific driver
should
handle?
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
It happens in the TTM shrinker helper called from the driver in the
spirit of demidlayering.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Arguments against:
None really. I thought the idea of demidlayering would be to
allow
the
driver more freedom.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Well that is a huge argument against it. Giving drivers more
freedom
is
absolutely not something which turned out to be valuable in the
past.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
So then what's the point of demidlayering?</pre>
</blockquote>
</blockquote>
<br>
So that drivers can prepare the environment for TTM to work with
instead of TTM asking for it.<br>
<br>
In your case for example that means powering up HW blocks so that
BOs could be moved.<br>
<br>
<span style="white-space: pre-wrap">
</span>
<blockquote type="cite" cite="mid:ff1d2e900d66664bd2ee6d29955ed7a858c0e44d.camel@linux.intel.com">
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
Instead we should put device drivers in a very strict corset of
validated approaches to do things.
Background is that in my experience driver developers are perfectly
willing to do unclean approaches which only work in like 99% of all
cases just to get a bit more performance or simpler driver
implementation.
Those approaches are not legal and in my opinion as subsystem
maintainer
I think we need to be more strict and push back much harder on
stuff
like that.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Still, historically that has made developers abandon common
components
for driver-specific solutions.
And the question is still not answered.
Exactly *why* can't the driver fail and continue traversing the LRU,
because all our argumentation revolves around this and you have yet
to
provide an objective reason why.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
And in the end, while I think there definitely things worthy of
discussion in this series,
I don't think there is a point in trying to land a LNL shrinker
operating against a crippled TTM LRU walk interface, nor do I think
anyone would want to attempt to port i915 over, and reworking it three
times I'm now pretty familiar with what works and what doesn't.
So question becomes, with proper reviews can I merge the series here as
is, *with* the de-midlayered LRU walk and noting your advise against
it, or not?</pre>
</blockquote>
<br>
More or less yes, I still have a bad feeling about it but we
probably need to see the whole thing getting used to judge if it
really works or not.<br>
<br>
I mean it's not UAPI we are talking about, so even if we find in
5years from now that it was a bad idea we can still roll it back and
try something else.<br>
<br>
So yeah, feel free to go ahead.<br>
<br>
Regards,<br>
Christian.<br>
<br>
<br>
<blockquote type="cite" cite="mid:ff1d2e900d66664bd2ee6d29955ed7a858c0e44d.camel@linux.intel.com">
<pre class="moz-quote-pre" wrap="">
Thanks,
Thomas
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
/Thomas
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
Regards,
Christian.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
So any feedback appreciated. If that is found acceptable we can
proceed
with reviewing this patch and also with the shrinker series.
Thanks,
Thomas
On Mon, 2024-09-02 at 13:07 +0200, Daniel Vetter wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">On Wed, Aug 28, 2024 at 02:20:34PM +0200, Christian König
wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Am 27.08.24 um 19:53 schrieb Daniel Vetter:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">On Tue, Aug 27, 2024 at 06:52:13PM +0200, Daniel Vetter
wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">On Thu, Aug 22, 2024 at 03:19:29PM +0200, Christian König
wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Completely agree that this is complicated, but I still
don't
see the need
for it.
Drivers just need to use pm_runtime_get_if_in_use()
inside
the shrinker and
postpone all hw activity until resume.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">Not good enough, at least long term I think. Also
postponing hw
activity
to resume doesn't solve the deadlock issue, if you still
need
to grab ttm
locks on resume.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">Pondered this specific aspect some more, and I think you
still
have a race
here (even if you avoid the deadlock): If the condiditional
rpm_get call
fails there's no guarantee that the device will
suspend/resume
and clean
up the GART mapping.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">Well I think we have a major disconnect here. When the device
is
powered
down there is no GART mapping to clean up any more.
In other words GART is a table in local memory (VRAM) when
the
device is
powered down this table is completely destroyed. Any BO which
was
mapped
inside this table is now not mapped any more.
So when the shrinker wants to evict a BO which is marked as
mapped
to GART
and the device is powered down we just skip the GART
unmapping
part
because
that has already implicitly happened during power down.
Before mapping any BO into the GART again we power the GPU up
through the
runtime PM calls. And while powering it up again the GART is
restored.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">My point is that you can't tell whether the device will power
down or
not,
you can only tell whether there's a chance it might be powering
down
and
so you can't get at the rpm reference without deadlock issues.
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">The race gets a bit smaller if you use
pm_runtime_get_if_active(), but even then you might catch
it
right when
resume almost finished.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">What race are you talking about?
The worst thing which could happen is that we restore a GART
entry
which
isn't needed any more, but that is pretty much irrelevant
since
we
only
clear them to avoid some hw bugs.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">The race I'm seeing is where you thought the GART entry is not
issue,
tossed an object, but the device didn't suspend, so might still
use
it.
I guess if we're clearly separating the sw allocation of the
TTM_TT
with
the physical entries in the GART that should all work, but
feels
a
bit
tricky. The race I've seen is essentially these two getting out
of
sync.
So maybe it was me who's stuck.
What I wonder is whether it works in practice, since on the
restore
side
you need to get some locks to figure out which gart mappings
exist
and
need restoring. And that's the same locks as the shrinker needs
to
figure
out whether it might need to reap a gart mapping.
Or do you just copy the gart entries over and restore them
exactly
as-is,
so that there's no shared locks?
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">That means we'll have ttm bo hanging around with GART
allocations/mappings
which aren't actually valid anymore (since they might
escape
the
cleanup
upon resume due to the race). That doesn't feel like a
solid
design
either.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">I'm most likely missing something, but I'm really scratching
my
head where
you see a problem here.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">I guess one issue is that at least traditionally, igfx drivers
have
nested
runtime pm within dma_resv lock. And dgpu drivers the other way
round.
Which is a bit awkward if you're trying for common code.
Cheers, Sima
</pre>
</blockquote>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
</pre>
</blockquote>
<br>
</body>
</html>