[Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage
Christian König
christian.koenig at amd.com
Thu Sep 1 08:00:06 UTC 2022
Am 31.08.22 um 18:32 schrieb Matthew Auld:
> On 31/08/2022 15:53, Matthew Auld wrote:
>> On 31/08/2022 14:34, Christian König wrote:
>>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>>> On 31/08/2022 13:35, Christian König wrote:
>>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>>> [SNIP]
>>>>>>>>>>
>>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is
>>>>>>>>>> swapping out shmem ttm_tt and is calling ttm_bo_validate()
>>>>>>>>>> with empty placements to force the pipeline-gutting path,
>>>>>>>>>> which importantly unpopulates the ttm_tt for us (since
>>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it
>>>>>>>>>> looks like that will now also nuke the bo->resource, instead
>>>>>>>>>> of just leaving it in system memory. My assumption is that
>>>>>>>>>> when later calling ttm_bo_validate(), it will just do the
>>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating
>>>>>>>>>> the ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>>
>>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain,
>>>>>>>>> don't you? This should result in re-populating the tt object,
>>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>>
>>>>>>>> AFAIK for domains we either have system memory (which uses
>>>>>>>> ttm_tt and might be shmem underneath) or local-memory. But
>>>>>>>> perhaps i915 is doing something wrong here, or abusing TTM in
>>>>>>>> some way. I'm not sure tbh.
>>>>>>>>
>>>>>>>> Anyway, I think we have two cases here:
>>>>>>>>
>>>>>>>> - We have some system memory only object. After doing
>>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call
>>>>>>>> ttm_bo_validate() at some later point, but here we don't need
>>>>>>>> to copy anything, but it also looks like
>>>>>>>> ttm_bo_handle_move_mem() won't populate the ttm_tt or us
>>>>>>>> either, since mem_type == TTM_PL_SYSTEM. It looks like
>>>>>>>> i915_ttm_move() was taking care of this, but now we just call
>>>>>>>> ttm_bo_move_null().
>>>>>>>>
>>>>>>>> - We have a local-memory only object, which was evicted to
>>>>>>>> shmem, and then swapped out by the shrinker like above. The
>>>>>>>> bo->resource is NULL. However this time when calling
>>>>>>>> ttm_bo_validate() we need to actually do a copy in
>>>>>>>> i915_ttm_move(), as well as re-populate the ttm_tt.
>>>>>>>> i915_ttm_move() was taking care of this, but now we just call
>>>>>>>> ttm_bo_move_null().
>>>>>>>>
>>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>>
>>>>>>> Mhm, as far as I can see that should still work.
>>>>>>>
>>>>>>> See previously you should got a transition from SYSTEM->GTT in
>>>>>>> i915_ttm_move() to re-create your backing store. Not you get
>>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then
>>>>>>> SYSTEM->GTT.
>>>>>>
>>>>>> What is GTT here in TTM world? Also I'm not seeing where there is
>>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear,
>>>>>> i915 is only calling ttm_bo_validate() once when acquiring the
>>>>>> pages, and we don't call it again, unless it was evicted (and
>>>>>> potentially swapped out).
>>>>>
>>>>> Well GTT means TTM_PL_TT.
>>>>>
>>>>> And calling it only once is perfectly fine, TTM will internally
>>>>> see that we need two hops to reach TTM_PL_TT and so does the
>>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>>
>>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT
>>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>>
>>> Mhm, it could be that we then have a problem and the i915 driver
>>> only sees NULL->TT directly. But I really don't know the i915 driver
>>> code good enough to judge that.
>>>
>>> Can you take a look at this and test it maybe?
>>
>> I'll grab a machine and try to see what is going on here.
>
> Well at least the issue with the firmware not loading looks to be
> fixed now.
>
> So running some eviction + oom tests it looks it now does:
>
> /* eviction kicks in */
> i915_ttm_move(bo): LMEM -> PL_SYSTEM
>
> /* shrinker/oom kicks in at some point */
> i915_ttm_shrink(bo):
> bo->resource = NULL, /* pipeline_gutting */
> shmem ttm_tt is unpopulated and pages are correctly swapped out
>
> /* user touches the same object later */
> i915_ttm_move(bo): NULL -> LMEM, bo_move_null()
>
> So seems to incorrectly skip swapping it back in and then copy over to
> lmem. It just allocates directly in lmem.
>
> And previously the last two steps would have been:
>
> i915_ttm_shrink(bo):
> bo->resource = PL_SYSTEM, /* pipeline_gutting */
> shmem ttm_tt is unpopulated and pages are correctly swapped out
>
> i915_ttm_move(bo):
> PL_SYSTEM -> LMEM,
> ttm_tt is repopulated and pages are copied over to lmem
Mhm, crap. That indeed looks like it won't work.
How about changing the code like this:
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
index c420d1ab605b..1ee7d81ddcbc 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
@@ -560,7 +560,17 @@ int i915_ttm_move(struct ttm_buffer_object *bo,
bool evict,
bool clear;
int ret;
- if (GEM_WARN_ON(!obj) || !bo->resource) {
+ if (GEM_WARN_ON(!obj)) {
+ ttm_bo_move_null(bo, dst_mem);
+ return 0;
+ }
+
+ if (!bo->resource) {
+ if (dst_mem->mem_type != TTM_PL_SYSTEM) {
+ hop->mem_type = TTM_PL_SYSTEM;
+ hop->flags = TTM_PL_FLAG_TEMPORARY;
+ return -EMULTIHOP;
+ }
ttm_bo_move_null(bo, dst_mem);
return 0;
}
That should result in allocating a TTM_PL_SYSTEM resource first and then
moving from that to the final destination.
Thanks,
Christian.
More information about the Intel-gfx
mailing list