[Intel-gfx] [PATCH 3/4] drm/i915/uc: Inject load errors into intel_uc_init_hw
Michal Wajdeczko
michal.wajdeczko at intel.com
Tue Jul 30 10:56:11 UTC 2019
On Tue, 30 Jul 2019 11:27:51 +0200, Chris Wilson
<chris at chris-wilson.co.uk> wrote:
> Quoting Chris Wilson (2019-07-30 09:47:37)
>> Quoting Michal Wajdeczko (2019-07-29 16:23:00)
>> > Inject load errors into intel_uc_init_hw to make sure we
>> > correctly handle uC initialization failures.
>> >
>> > To avoid complains from CI about inserted errors or warnings,
>> > use helper macro that checks if there was an error injection.
>> >
>> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko at intel.com>
>> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio at intel.com>
>> > Cc: Chris Wilson <chris at chris-wilson.co.uk>
>> > ---
>> > drivers/gpu/drm/i915/gt/uc/intel_uc.c | 8 ++++++++
>> > drivers/gpu/drm/i915/i915_drv.h | 7 ++++++-
>> > drivers/gpu/drm/i915/i915_gem.c | 2 +-
>> > 3 files changed, 15 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
>> b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
>> > index fafa9be1e12a..9e1156c29cb1 100644
>> > --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
>> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
>> > @@ -400,6 +400,14 @@ int intel_uc_init_hw(struct intel_uc *uc)
>> > if (!intel_uc_is_using_guc(uc))
>> > return 0;
>> >
>> > + ret = i915_inject_load_error(i915, -EIO);
>> > + if (ret)
>> > + return ret;
>> > +
>> > + ret = i915_inject_load_error(i915, -ENXIO);
>> > + if (ret)
>> > + return ret;
>> > +
>> > GEM_BUG_ON(!intel_uc_fw_supported(&guc->fw));
>> >
>> > guc_reset_interrupts(guc);
>> > diff --git a/drivers/gpu/drm/i915/i915_drv.h
>> b/drivers/gpu/drm/i915/i915_drv.h
>> > index 6b059d51aaff..36f7a146f06a 100644
>> > --- a/drivers/gpu/drm/i915/i915_drv.h
>> > +++ b/drivers/gpu/drm/i915/i915_drv.h
>> > @@ -137,9 +137,14 @@ bool i915_error_injected(void);
>> >
>> > #define i915_inject_probe_failure(i915)
>> i915_inject_load_error((i915), -ENODEV)
>> >
>> > -#define i915_probe_error(i915, fmt,
>> ...) \
>> > +#define I915_ERROR(i915, fmt, ...) \
>> > __i915_printk(i915, i915_error_injected() ? KERN_DEBUG :
>> KERN_ERR, \
>> > fmt, ##__VA_ARGS__)
>> > +#define I915_WARN(i915, fmt, ...) \
>> > + __i915_printk(i915, i915_error_injected() ? KERN_DEBUG :
>> KERN_WARNING, \
>> > + fmt, ##__VA_ARGS__)
>>
>> I didn't see I915_WARN be used in this series. Is it likely? We either
>> abort the module load, in which it is an error, or we are quite happy to
>> continue in which case I'd vote for a "normal but significant condition"
>> i.e. KERN_NOTICE.
>
> Spotted the I915_WARN, I guess that's reasonable.
>
> However, I do dislike I915_ERROR / I915_WARN on principle. Especially as
> they have the coupling to the probe-failure magic and I don't feel like
> it is a good idea to have that spread too far.
But what else can we do in i915 alone to
a) have messages that show origin of the problem as soon as possible,
b) ignore these fake errors if caused by injected failures ?
otherwise we risk that all new injected errors will be immediately
reported as regressions and we stop at CI.BAT
There are other options but we need some buy-in from the CI team:
- we can use existing messages to temporary ignore errors between
"injecting" and "failed"
i915 0000:00:02.0: Injecting failure %d at checkpoint %u [file:line]
i915 0000:00:02.0: Device initialization failed (%d)
- we can use existing messages to temporary ignore errors between
"injecting" and new "completed"
i915 0000:00:02.0: Injecting failure %d at checkpoint %u [file:line]
i915 0000:00:02.0: Checkpoint %u failure recovery completed
btw, I'm also thinking about introducing 'recoverable' failures, which
could be used not only to make sure that we don't break the driver while
unwinding, but also to make sure that driver was finally able handle such
failures in expected way:
i915_inject_load_error(i915, -ENOMEM); // probe must fail
i915_inject_recoverable_error(i915, -EIO, WEDGED); // probe ok but gpu
wedged
i915_inject_recoverable_error(i915, -ENOEXEC, NORMAL); // probe must be ok
More information about the Intel-gfx
mailing list