[Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage

Matthew Auld matthew.auld at intel.com
Thu Sep 1 12:52:49 UTC 2022


On 01/09/2022 09:00, Christian König wrote:
> Am 31.08.22 um 18:32 schrieb Matthew Auld:
>> On 31/08/2022 15:53, Matthew Auld wrote:
>>> On 31/08/2022 14:34, Christian König wrote:
>>>> Am 31.08.22 um 14:50 schrieb Matthew Auld:
>>>>> On 31/08/2022 13:35, Christian König wrote:
>>>>>> Am 31.08.22 um 14:06 schrieb Matthew Auld:
>>>>>>> On 31/08/2022 12:03, Christian König wrote:
>>>>>>>> Am 31.08.22 um 12:37 schrieb Matthew Auld:
>>>>>>>>> [SNIP]
>>>>>>>>>>>
>>>>>>>>>>> That hopefully just leaves i915_ttm_shrink(), which is 
>>>>>>>>>>> swapping out shmem ttm_tt and is calling ttm_bo_validate() 
>>>>>>>>>>> with empty placements to force the pipeline-gutting path, 
>>>>>>>>>>> which importantly unpopulates the ttm_tt for us (since 
>>>>>>>>>>> ttm_tt_unpopulate is not exported it seems). But AFAICT it 
>>>>>>>>>>> looks like that will now also nuke the bo->resource, instead 
>>>>>>>>>>> of just leaving it in system memory. My assumption is that 
>>>>>>>>>>> when later calling ttm_bo_validate(), it will just do the 
>>>>>>>>>>> bo_move_null() in i915_ttm_move(), instead of re-populating 
>>>>>>>>>>> the ttm_tt and then potentially copying it back to local-memory?
>>>>>>>>>>
>>>>>>>>>> Well you do ttm_bo_validate() with something like GTT domain, 
>>>>>>>>>> don't you? This should result in re-populating the tt object, 
>>>>>>>>>> but I'm not 100% sure if that really works as expected.
>>>>>>>>>
>>>>>>>>> AFAIK for domains we either have system memory (which uses 
>>>>>>>>> ttm_tt and might be shmem underneath) or local-memory. But 
>>>>>>>>> perhaps i915 is doing something wrong here, or abusing TTM in 
>>>>>>>>> some way. I'm not sure tbh.
>>>>>>>>>
>>>>>>>>> Anyway, I think we have two cases here:
>>>>>>>>>
>>>>>>>>> - We have some system memory only object. After doing 
>>>>>>>>> i915_ttm_shrink(), bo->resource is now NULL. We then call 
>>>>>>>>> ttm_bo_validate() at some later point, but here we don't need 
>>>>>>>>> to copy anything, but it also looks like 
>>>>>>>>> ttm_bo_handle_move_mem() won't populate the ttm_tt or us 
>>>>>>>>> either, since mem_type == TTM_PL_SYSTEM. It looks like 
>>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>>> ttm_bo_move_null().
>>>>>>>>>
>>>>>>>>> - We have a local-memory only object, which was evicted to 
>>>>>>>>> shmem, and then swapped out by the shrinker like above. The 
>>>>>>>>> bo->resource is NULL. However this time when calling 
>>>>>>>>> ttm_bo_validate() we need to actually do a copy in 
>>>>>>>>> i915_ttm_move(), as well as re-populate the ttm_tt. 
>>>>>>>>> i915_ttm_move() was taking care of this, but now we just call 
>>>>>>>>> ttm_bo_move_null().
>>>>>>>>>
>>>>>>>>> Perhaps i915 is doing something wrong in the above two cases?
>>>>>>>>
>>>>>>>> Mhm, as far as I can see that should still work.
>>>>>>>>
>>>>>>>> See previously you should got a transition from SYSTEM->GTT in 
>>>>>>>> i915_ttm_move() to re-create your backing store. Not you get 
>>>>>>>> NULL->SYSTEM which is handled by ttm_bo_move_null() and then 
>>>>>>>> SYSTEM->GTT.
>>>>>>>
>>>>>>> What is GTT here in TTM world? Also I'm not seeing where there is 
>>>>>>> this SYSTEM->GTT transition? Maybe I'm blind. Just to be clear, 
>>>>>>> i915 is only calling ttm_bo_validate() once when acquiring the 
>>>>>>> pages, and we don't call it again, unless it was evicted (and 
>>>>>>> potentially swapped out).
>>>>>>
>>>>>> Well GTT means TTM_PL_TT.
>>>>>>
>>>>>> And calling it only once is perfectly fine, TTM will internally 
>>>>>> see that we need two hops to reach TTM_PL_TT and so does the 
>>>>>> NULL->SYSTEM transition and then SYSTEM->TT.
>>>>>
>>>>> Ah interesting, so that's what the multi-hop thing does. But AFAICT 
>>>>> i915 is not using either TTM_PL_TT or -EMULTIHOP.
>>>>
>>>> Mhm, it could be that we then have a problem and the i915 driver 
>>>> only sees NULL->TT directly. But I really don't know the i915 driver 
>>>> code good enough to judge that.
>>>>
>>>> Can you take a look at this and test it maybe?
>>>
>>> I'll grab a machine and try to see what is going on here.
>>
>> Well at least the issue with the firmware not loading looks to be 
>> fixed now.
>>
>> So running some eviction + oom tests it looks it now does:
>>
>> /* eviction kicks in */
>> i915_ttm_move(bo):  LMEM -> PL_SYSTEM
>>
>> /* shrinker/oom kicks in at some point */
>> i915_ttm_shrink(bo):
>>     bo->resource = NULL, /* pipeline_gutting */
>>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>>
>> /* user touches the same object later */
>> i915_ttm_move(bo):  NULL -> LMEM, bo_move_null()
>>
>> So seems to incorrectly skip swapping it back in and then copy over to 
>> lmem. It just allocates directly in lmem.
>>
>> And previously the last two steps would have been:
>>
>> i915_ttm_shrink(bo):
>>     bo->resource = PL_SYSTEM, /* pipeline_gutting */
>>     shmem ttm_tt is unpopulated and pages are correctly swapped out
>>
>> i915_ttm_move(bo):
>>     PL_SYSTEM -> LMEM,
>>     ttm_tt is repopulated and pages are copied over to lmem
> 
> Mhm, crap. That indeed looks like it won't work.
> 
> How about changing the code like this:
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> index c420d1ab605b..1ee7d81ddcbc 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
> @@ -560,7 +560,17 @@ int i915_ttm_move(struct ttm_buffer_object *bo, 
> bool evict,
>          bool clear;
>          int ret;
> 
> -       if (GEM_WARN_ON(!obj) || !bo->resource) {
> +       if (GEM_WARN_ON(!obj)) {
> +               ttm_bo_move_null(bo, dst_mem);
> +               return 0;
> +       }
> +
> +       if (!bo->resource) {
> +               if (dst_mem->mem_type != TTM_PL_SYSTEM) {
> +                        hop->mem_type = TTM_PL_SYSTEM;
> +                        hop->flags = TTM_PL_FLAG_TEMPORARY;
> +                       return -EMULTIHOP;
> +               }
>                  ttm_bo_move_null(bo, dst_mem);
>                  return 0;
>          }
> 
> That should result in allocating a TTM_PL_SYSTEM resource first and then 
> moving from that to the final destination.

Ok, I played around with this some more. The final diff looks like:
https://patchwork.freedesktop.org/patch/500786/?series=108027&rev=1

It looks like we had some more places where bo->resource == NULL didn't 
end well. It at least now seems to survive my local testing.

> 
> Thanks,
> Christian.


More information about the Intel-gfx mailing list