[Intel-gfx] [PATCH 1/3] drm/i915: audit bo->resource usage

Christian König christian.koenig at amd.com
Wed Aug 31 09:38:37 UTC 2022


Am 31.08.22 um 11:26 schrieb Matthew Auld:
> On 31/08/2022 09:16, Christian König wrote:
>> Hi Matthew,
>>
>> Am 30.08.22 um 12:45 schrieb Matthew Auld:
>>> Hi,
>>>
>>> On 30/08/2022 08:33, Christian König wrote:
>>>> Hi guys,
>>>>
>>>> can we get an rb/acked-by for this i915 change?
>>>>
>>>> Basically we are just making sure that the driver doesn't crash 
>>>> when bo->resource is NULL and a bo doesn't have any backing store 
>>>> assigned to it.
>>>>
>>>> The Intel CI seems to be happy with this change, so I'm pretty sure 
>>>> it is correct.
>>>
>>> It looks like DG2/DG1 (which happen to use TTM here) are no longer 
>>> loading the module:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fintel-gfx-ci.01.org%2Ftree%2Fdrm-tip%2FPatchwork_107680v1%2Findex.html&data=05%7C01%7Cchristian.koenig%40amd.com%7Ccaca567b3279492450fd08da8b32e598%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637975347967354305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=apanfOjzSWD2vduINzr2j6F2DiIBC93hLRnnGJcGQ5o%3D&reserved=0? 
>>>
>>>
>>> According to the logs the firmware is failing to load, so perhaps 
>>> related to I915_BO_ALLOC_CPU_CLEAR, since that is one of the rare 
>>> users. See below.
>>>
>>>>
>>>> Thanks,
>>>> Christian.
>>>>
>>>> Am 24.08.22 um 16:23 schrieb Luben Tuikov:
>>>>> From: Christian König <christian.koenig at amd.com>
>>>>>
>>>>> Make sure we can at least move and alloc TT objects without 
>>>>> backing store.
>>>>>
>>>>> Signed-off-by: Christian König <christian.koenig at amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm.c      | 6 ++----
>>>>>   drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 2 +-
>>>>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> index bc9c432edffe03..45ce2d1f754cc4 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm.c
>>>>> @@ -271,8 +271,6 @@ static struct ttm_tt 
>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>   {
>>>>>       struct drm_i915_private *i915 = container_of(bo->bdev, 
>>>>> typeof(*i915),
>>>>>                                bdev);
>>>>> -    struct ttm_resource_manager *man =
>>>>> -        ttm_manager_type(bo->bdev, bo->resource->mem_type);
>>>>>       struct drm_i915_gem_object *obj = i915_ttm_to_gem(bo);
>>>>>       unsigned long ccs_pages = 0;
>>>>>       enum ttm_caching caching;
>>>>> @@ -286,8 +284,8 @@ static struct ttm_tt 
>>>>> *i915_ttm_tt_create(struct ttm_buffer_object *bo,
>>>>>       if (!i915_tt)
>>>>>           return NULL;
>>>>> -    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR &&
>>>>> -        man->use_tt)
>>>>> +    if (obj->flags & I915_BO_ALLOC_CPU_CLEAR && bo->resource &&
>>>>> +        ttm_manager_type(bo->bdev, bo->resource->mem_type)->use_tt)
>>>>>           page_flags |= TTM_TT_FLAG_ZERO_ALLOC;
>>>
>>> AFAICT i915 was massively relying on everything starting out in a 
>>> "dummy" system memory resource (ttm_tt), where it then later 
>>> "transitions" to the real resource. And if we need to clear the 
>>> memory we rely on ZERO_ALLOC being set before calling into the 
>>> i915_ttm_move() callback (even when allocating local-memory).
>>>
>>> For ttm_bo_type_device objects (userspace stuff) it looks like this 
>>> was previously handled by ttm_bo_validate() always doing:
>>>
>>>   ret = ttm_tt_create(bo, true); /* clear = true */
>>>
>>> Which we would always hit since the resource was always "compatible" 
>>> for the dummy case. But it looks like this is no longer even called, 
>>> since we can now call into ttm_move with bo->resource == NULL, which 
>>> still calls tt_create eventually, which not always with clear = true.
>>>
>>> All other objects are then ttm_bo_type_kernel so we don't care about 
>>> clearing, except in the rare case of ALLOC_CPU_CLEAR, which was 
>>> handled as per above in i915_ttm_tt_create(). But I think here 
>>> bo->resource is NULL at the start when first creating the object, 
>>> which will skip setting ZERO_ALLOC, which might explain the CI failure.
>>>
>>> The other possible concern (not sure since CI didn't get that far) 
>>> is around ttm_bo_pipeline_gutting(), which now leaves bo->resource = 
>>> NULL. It looks like i915_ttm_shrink() was relying on that to 
>>> unpopulate the ttm_tt. When later calling ttm_bo_validate(), 
>>> i915_ttm_move() would see the SWAPPED flag set on the ttm_tt , 
>>> re-populate it and then potentially move it back to local-memory.
>>>
>>> What are your thoughts here? Also sorry if i915 is making a bit of 
>>> mess here.
>>
>> First of all thanks a lot for taking a look. We previously got 
>> reports about strange crashes with this patch, but couldn't really 
>> reproduce them (even not by sending them out again). This explains 
>> that a bit.
>>
>> The simplest solution would be to just change the && into a ||, e.g. 
>> when previously either no resource is allocated or the resource 
>> requires to use a tt we clear it.
>>
>> That should give you the same behavior as before, but I agree that 
>> this is a bit messy.
>
> Yeah, that should do the trick.
>
> That hopefully just leaves i915_ttm_shrink(), which is swapping out 
> shmem ttm_tt and is calling ttm_bo_validate() with empty placements to 
> force the pipeline-gutting path, which importantly unpopulates the 
> ttm_tt for us (since ttm_tt_unpopulate is not exported it seems). But 
> AFAICT it looks like that will now also nuke the bo->resource, instead 
> of just leaving it in system memory. My assumption is that when later 
> calling ttm_bo_validate(), it will just do the bo_move_null() in 
> i915_ttm_move(), instead of re-populating the ttm_tt and then 
> potentially copying it back to local-memory?

Well you do ttm_bo_validate() with something like GTT domain, don't you? 
This should result in re-populating the tt object, but I'm not 100% sure 
if that really works as expected.

Thanks,
Christian.

>
>>
>> I've been considering to replacing the ttm_bo_type with a bunch of 
>> behavior flags for a bo. I'm hoping that this will clean things up a 
>> bit.
>>
>> Regards,
>> Christian.
>>
>>>
>>>>>       caching = i915_ttm_select_tt_caching(obj);
>>>>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c 
>>>>> b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> index 9a7e50534b84bb..c420d1ab605b6f 100644
>>>>> --- a/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c
>>>>> @@ -560,7 +560,7 @@ int i915_ttm_move(struct ttm_buffer_object 
>>>>> *bo, bool evict,
>>>>>       bool clear;
>>>>>       int ret;
>>>>> -    if (GEM_WARN_ON(!obj)) {
>>>>> +    if (GEM_WARN_ON(!obj) || !bo->resource) {
>>>>>           ttm_bo_move_null(bo, dst_mem);
>>>>>           return 0;
>>>>>       }
>>>>
>>



More information about the Intel-gfx mailing list