[Intel-gfx] [PATCH] drm/i915: Fix a VMA UAF for multi-gt platform

Tue Jun 6 09:06:18 UTC 2023

On 6/5/2023 10:27 PM, Andi Shyti wrote:
> Hi Nirmoy,
>
> On Mon, Jun 05, 2023 at 10:10:21PM +0200, Nirmoy Das wrote:
>> Ensure correct handling of closed VMAs on multi-gt platforms to prevent
>> Use-After-Free. Currently, when GT0 goes idle, closed VMAs that are
>> exclusively added to GT0's closed_vma link (gt->closed_vma) and
>> subsequently freed by i915_vma_parked(), which assumes the entire GPU is
>> idle. However, on platforms with multiple GTs, such as MTL, GT1 may
>> remain active while GT0 is idle. This causes GT0 to mistakenly consider
>> the closed VMAs in its closed_vma list as unnecessary, potentially
>> leading to Use-After-Free issues if a job for GT1 attempts to access a
>> freed VMA.
>>
>> Although we do take a wakeref for GT0 but it happens later, after
>> evaluating VMAs. To mitigate this, it is necessary to hold a GT0 wakeref
>> early.
> hooray! this is great, Nirmoy! I will give it a shot.
>
>> Cc: Joonas Lahtinen <joonas.lahtinen at linux.intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi at intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin at linux.intel.com>
>> Cc: Thomas Hellström <thomas.hellstrom at linux.intel.com>
>> Cc: Chris Wilson <chris.p.wilson at intel.com>
>> Cc: Andi Shyti <andi.shyti at linux.intel.com>
>> Cc: Andrzej Hajda <andrzej.hajda at intel.com>
>> Signed-off-by: Nirmoy Das <nirmoy.das at intel.com>
>> ---
>>   drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>> index 5fb459ea4294..adcf8837dfe6 100644
>> --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>> +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
>> @@ -2692,6 +2692,7 @@ static int
>>   eb_select_engine(struct i915_execbuffer *eb)
>>   {
>>   	struct intel_context *ce, *child;
>> +	struct intel_gt *gt;
>>   	unsigned int idx;
>>   	int err;
>>   
>> @@ -2715,10 +2716,16 @@ eb_select_engine(struct i915_execbuffer *eb)
>>   		}
>>   	}
>>   	eb->num_batches = ce->parallel.number_children + 1;
>> +	gt = ce->engine->gt;
>>   
>>   	for_each_child(ce, child)
>>   		intel_context_get(child);
>>   	intel_gt_pm_get(ce->engine->gt);
>> +	/* Keep GT0 active on MTL so that i915_vma_parked() doesn't
>> +	 * free VMAs while execbuf ioctl is validating VMAs.
>> +	 */
>> +	if (gt != to_gt(gt->i915))
> you can use gt->info.id
>
>> +		intel_gt_pm_get(to_gt(ce->engine->gt->i915));
>>   
>>   	if (!test_bit(CONTEXT_ALLOC_BIT, &ce->flags)) {
>>   		err = intel_context_alloc_state(ce);
>> @@ -2757,6 +2764,9 @@ eb_select_engine(struct i915_execbuffer *eb)
>>   	return err;
>>   
>>   err:
>> +	if (ce->engine->gt != to_gt(ce->engine->gt->i915))
> 	if (gt->info.id)

Thanks, was wondering if there is a better way to detect that.

>
> gt is already ce->engine->gt
>
>> +		intel_gt_pm_get(to_gt(ce->engine->gt->i915));
>> +
>>   	intel_gt_pm_put(ce->engine->gt);
>>   	for_each_child(ce, child)
>>   		intel_context_put(child);
>> @@ -2770,6 +2780,8 @@ eb_put_engine(struct i915_execbuffer *eb)
>>   	struct intel_context *child;
>>   
>>   	i915_vm_put(eb->context->vm);
>> +	if (eb->gt != to_gt(eb->gt->i915))
>> +		intel_gt_pm_put(to_gt(eb->gt->i915));
> this wakeref going up and down is a bit ugly... Perhaps we can
> add some flag about the GT type in the info structure.

Chris pointed out that there is a patch[1] in internal branch which 
fixes a regression by moving the

intel_context_enter() before vma lookup. That should have same effect as 
this one.

But for now I didn't mange to make it work on drm-tip(causing some 
crash) yet.

[1] drm/i915/gem: Push wait-for-execbuf outside of ww_mutex

>   MTL is a
> weird multi-gt platform and, indeed, you can't shut down GT0
> without affecting GT1.
>
> For now it's OK, though, as to test it.

Looking forward to that. I did test it extensively and ChromeOS team as 
well.

Regards,

Nirmoy

>
> Andi
>
>>   	intel_gt_pm_put(eb->gt);
>>   	for_each_child(eb->context, child)
>>   		intel_context_put(child);
>> -- 
>> 2.39.0