[PATCH v4] drm/ttm: Clarify that the TTM_PL_SYSTEM is under TTMs control

Mon Nov 22 15:05:56 UTC 2021

On Mon, 2021-11-22 at 15:15 +0100, Christian König wrote:
> Am 16.11.21 um 16:53 schrieb Zack Rusin:
> > > On Nov 16, 2021, at 03:20, Christian König
> > > <christian.koenig at amd.com> wrote:
> > > 
> > > Am 16.11.21 um 08:43 schrieb Thomas Hellström:
> > > > On 11/16/21 08:19, Christian König wrote:
> > > > > Am 13.11.21 um 12:26 schrieb Thomas Hellström:
> > > > > > Hi, Zack,
> > > > > > 
> > > > > > On 11/11/21 17:44, Zack Rusin wrote:
> > > > > > > On Wed, 2021-11-10 at 09:50 -0500, Zack Rusin wrote:
> > > > > > > > TTM takes full control over TTM_PL_SYSTEM placed
> > > > > > > > buffers. This makes
> > > > > > > > driver internal usage of TTM_PL_SYSTEM prone to errors
> > > > > > > > because it
> > > > > > > > requires the drivers to manually handle all
> > > > > > > > interactions between TTM
> > > > > > > > which can swap out those buffers whenever it thinks
> > > > > > > > it's the right
> > > > > > > > thing to do and driver.
> > > > > > > > 
> > > > > > > > CPU buffers which need to be fenced and shared with
> > > > > > > > accelerators
> > > > > > > > should
> > > > > > > > be placed in driver specific placements that can
> > > > > > > > explicitly handle
> > > > > > > > CPU/accelerator buffer fencing.
> > > > > > > > Currently, apart, from things silently failing nothing
> > > > > > > > is enforcing
> > > > > > > > that requirement which means that it's easy for drivers
> > > > > > > > and new
> > > > > > > > developers to get this wrong. To avoid the confusion we
> > > > > > > > can document
> > > > > > > > this requirement and clarify the solution.
> > > > > > > > 
> > > > > > > > This came up during a discussion on dri-devel:
> > > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F232f45e9-8748-1243-09bf-56763e6668b3%40amd.com&data=04%7C01%7Czackr%40vmware.com%7C084389d8acc04ffdbbb808d9adc28cfe%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637731873379429725%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ahHm7HV9VPAhfBKsk9xOBcnObsJXHvVAbCdEhJ%2BjJYU%3D&reserved=0
> > > > > > I took a slightly deeper look into this. I think we need to
> > > > > > formalize this a bit more to understand pros and cons and
> > > > > > what the restrictions are really all about. Anybody looking
> > > > > > at the prevous discussion will mostly see arguments similar
> > > > > > to "this is stupid and difficult" and "it has always been
> > > > > > this way" which are not really constructive.
> > > > > > 
> > > > > > First disregarding all accounting stuff, I think this all
> > > > > > boils down to TTM_PL_SYSTEM having three distinct states:
> > > > > > 1) POPULATED
> > > > > > 2) LIMBO (Or whatever we want to call it. No pages present)
> > > > > > 3) SWAPPED.
> > > > > > 
> > > > > > The ttm_bo_move_memcpy() helper understands these, and any
> > > > > > standalone driver implementation of the move() callback
> > > > > > _currently_ needs to understand these as well, unless using
> > > > > > the ttm_bo_move_memcpy() helper.
> > > > > > 
> > > > > > Now using a bounce domain to proxy SYSTEM means that the
> > > > > > driver can forget about the SWAPPED state, it's
> > > > > > automatically handled by the move setup code. However,
> > > > > > another pitfall is LIMBO, in that if when you move from
> > > > > > SYSTEM/LIMBO to your bounce domain, the BO will be
> > > > > > populated. So any naive accelerated move() implementation
> > > > > > creating a 1GB BO in fixed memory, like VRAM, will
> > > > > > needlessly allocate and free 1GB of system memory in the
> > > > > > process instead of just performing a clear operation. Looks
> > > > > > like amdgpu suffers from this?
> > > > > > 
> > > > > > I think what is really needed is either
> > > > > > 
> > > > > > a) A TTM helper that helps move callback implementations
> > > > > > resolve the issues populating system from LIMBO or SWAP,
> > > > > > and then also formalize driver notification for swapping.
> > > > > > At a minimum, I think the swap_notify() callback needs to
> > > > > > be able to return a late error.
> > > > > > 
> > > > > > b) Make LIMBO and SWAPPED distinct memory regions. (I think
> > > > > > I'd vote for this without looking into it in detail).
> > > > > > 
> > > > > > In both these cases, we should really make SYSTEM bindable
> > > > > > by GPU, otherwise we'd just be trading one pitfall for
> > > > > > another related without really resolving the root problem.
> > > > > > 
> > > > > > As for fencing not being supported by SYSTEM, I'm not sure
> > > > > > why we don't want this, because it would for example
> > > > > > prohibit async ttm_move_memcpy(), and also, async unbinding
> > > > > > of ttm_tt memory like MOB on vmgfx. (I think it's still
> > > > > > sync).
> > > > > > 
> > > > > > There might be an accounting issue related to this as well,
> > > > > > but I guess Christian would need to chime in on this. If
> > > > > > so, I think it needs to be well understood and documented
> > > > > > (in TTM, not in AMD drivers).
> > > > > I think the problem goes deeper than what has been mentioned
> > > > > here so far.
> > > > > 
> > > > > Having fences attached to BOs in the system domain is
> > > > > probably ok, but the key point is that the BOs in the system
> > > > > domain are under TTMs control and should not be touched by
> > > > > the driver.
> > > > > 
> > > > > What we have now is that TTMs internals like the allocation
> > > > > state of BOs in system memory (the populated, limbo, swapped
> > > > > you mentioned above) is leaking into the drivers and I think
> > > > > exactly that is the part which doesn't work reliable here.
> > > > > You can of course can get that working, but that requires
> > > > > knowledge of the internal state which in my eyes was always
> > > > > illegal.
> > > > > 
> > > > Well, I tend to agree to some extent, but then, like said above
> > > > even disregarding swap will cause trouble with the limbo state,
> > > > because the driver's move callback would need knowledge of that
> > > > to implement moves limbo -> vram efficiently.
> > > Well my long term plan is to audit the code base once more and
> > > remove the limbo state from the SYSTEM domain.
> > > 
> > > E.g. instead of a SYSTEM BO without pages you allocate a BO
> > > without a resource in general which is now possible since bo-
> > > >resource is a pointer.
> > > 
> > > This would still allow us to allocate "empty shell" BOs. But a
> > > validation of those BOs doesn't cause a move, but rather just
> > > allocates the resource for the first time.
> > > 
> > > The problem so far was just that we access bo->resource way to
> > > often without checking it.
> > So all in all this would be a two step process 1) Eliminate the
> > “limbo” state, 2) Split PL_SYSTEM into PL_SYSTEM, that drivers
> > could use for really anything, and PL_SWAP which would be under
> > complete TTM control, yes? If that’s the case that would certainly
> > make my life a lot easier (because the drivers wouldn’t need to
> > introduce/manage their own system placements) and I think it would
> > make the code a lot easier to understand.
> 
> We also have a couple of other use cases for this, yes.
> 
> > That’s a bit of work though so the question what happens until this
> > lands comes to mind. Is introduction of VMW_PL_SYSTEM and
> > documenting that PL_SYSTEM is under complete TTM control (like this
> > patch does) the way to go or do we want to start working on the
> > above immediately? Because I’d love to be able to unload vmwgfx
> > without kernel oopsing =)
> 
> I think documenting and getting into a clean state should be 
> prerequisite for new development (even if the fix for the unload is 
> trivial).

That sounds great (both points). Could you double check the wording and
ack it (ideally both the docs and the VMW_PL_SYSTEM patch) if you think
it's ready to go in?

z