[Intel-gfx] [PATCH v2 1/1] drm/i915/reset: Fix error_state_read ptr + offset use
Teres Alexis, Alan Previn
alan.previn.teres.alexis at intel.com
Thu Mar 10 01:45:11 UTC 2022
On 3/9/2022 5:18 PM, John Harrison wrote:
> On 3/8/2022 11:47, Teres Alexis, Alan Previn wrote:
>> On 3/1/2022 1:37 PM, John Harrison wrote:
>>> On 2/25/2022 22:27, Alan Previn wrote:
>>>> ...
>>>> This fixes a kernel page fault can happen when
>>>> multiple tests are running concurrently in a loop
>>>> and one is producing engine resets and consuming
>>>> the i915 error_state dump while the other is
>>>> forcing full GT resets. (takes a while to trigger).
>>> Does need a fixes tag given that it is fixing a bug in an earlier
>>> patch. Especially when it is a kernel BUG.
>>> E.g.:
>>> 13:23> dim fixes 0e39037b31655
>>> Fixes: 0e39037b3165 ("drm/i915: Cache the error string")
>>>
>> okay, will add that.
>>
>> diff --git a/drivers/gpu/drm/i915/i915_sysfs.c
>> b/drivers/gpu/drm/i915/i915_sysfs.c
>>>> index a4d1759375b9..c40e51298066 100644
>>>> --- a/drivers/gpu/drm/i915/i915_sysfs.c
>>>> +++ b/drivers/gpu/drm/i915/i915_sysfs.c
>>>> @@ -432,7 +432,7 @@ static ssize_t error_state_read(struct file
>>>> *filp, struct kobject *kobj,
>>>> struct device *kdev = kobj_to_dev(kobj);
>>>> struct drm_i915_private *i915 = kdev_minor_to_i915(kdev);
>>>> struct i915_gpu_coredump *gpu;
>>>> - ssize_t ret;
>>>> + ssize_t ret = 0;
>>>> gpu = i915_first_error_state(i915);
>>>> if (IS_ERR(gpu)) {
>>>> @@ -444,8 +444,10 @@ static ssize_t error_state_read(struct file
>>>> *filp, struct kobject *kobj,
>>>> const char *str = "No error state collected\n";
>>>> size_t len = strlen(str);
>>>> - ret = min_t(size_t, count, len - off);
>>>> - memcpy(buf, str + off, ret);
>>>> + if (off < len) {
>>>> + ret = min_t(size_t, count, len - off);
>>>> + memcpy(buf, str + off, ret);
>>>> + }
>>> So the problem is that the error dump disappeared while it was being
>>> read? That seems like it cause more problems than just this
>>> out-of-range access. E.g. what if the dump was freed and a new one
>>> created? The entity doing the partial reads would end up with half
>>> of one dump and half of the next.
>>>
>>> I think we should at least add a FIXME comment to the code that fast
>>> recycling dumps could cause inconsistent sysfs reads.
>>>
>>> I guess you could do something like add a unique id to the gpu
>>> coredump structure. Then, when a partial sysfs read occurs starting
>>> at zero (i.e. a new read), take a note of the id somewhere (e.g.
>>> inside the i915 structure). When the next non-zero read comes in,
>>> compare the save id with the current coredump's id and return an
>>> error if there is a mis-match.
>>>
>>> Not sure if this would be viewed as an important enough problem to
>>> be worth fixing. I'd be happy with just a FIXME comment for now.
>> For now I shall add a FIXME additional checks might impact IGT's that
>> currently dump and check the error state. I would assume with that
>> additional check in kernel, we would need to return a specific error
>> value like -ENODATA or -ENOLINK and handle it accordingly in the igt.
> If an an extra check against returning invalid data affects an
> existing IGT then that IGT is already broken. The check would to
> prevent userland reading the first half of one capture and then
> getting the second half of a completely different one. If that is
> already happening then the returned data is meaningless gibberish and
> even if the IGT says it is passing, it really isn't.
>
>
>>>
>>> I would also change the test to be 'if (off)' rather than 'if (off <
>>> len)'. Technically, the user could have read the first 10 bytes of a
>>> coredump and then gets "tate collected\n" as the remainder, for
>>> example. If we ensure that 'off' is zero then we know that this is a
>>> fresh read from scratch.
>>>
>>> John.
>>>
>> I'm a little confused, did u mean: in the case we dont have a
>> gpu-state to report, and then the user offset is zero (i.e. "if
>> (!off)" ) then we copy the string vs if we dont have a gpu-state to
>> report and the user-offset is non-zero, then we return an error (like
>> the -ENOLINK?). If thats what you meant, then it does mean we are
>> assuming that (in the case we dont have a gpu-state) the user will
>> never come in with a first-time-read where the user's buffer size of
>> less than the string length and be expected continue to read the rest
>> of the string-length. This i guess is okay since even 6 chars are
>> enough to say "No err". :)
> Sorry, yes. That was meant to say 'if (!off)'.
>
> Hmm, I guess technically the user could be issuing single byte reads.
> User's can be evil.
>
> Okay. Maybe just add a FIXME about needing to check for
> changed/missing/new error state since last read if the offset is
> non-zero and otherwise leave it as is.
>
> John.
>
Sounds good - will do. Apologies on the tardiness with the responses.
>
>>>> }
>>>> return ret;
>>>
>
More information about the Intel-gfx
mailing list