[Intel-gfx] [PATCH 07/15] drm/i915: Defer default hardware context initialisation until first open

Daniel Vetter daniel at ffwll.ch
Wed Jun 24 03:15:07 PDT 2015


On Fri, Jun 19, 2015 at 10:19:04AM +0100, Dave Gordon wrote:
> On 17/06/15 13:18, Daniel Vetter wrote:
> > On Mon, Jun 15, 2015 at 07:36:25PM +0100, Dave Gordon wrote:
> >> In order to fully initialise the default contexts, we have to execute
> >> batchbuffer commands on the GPU engines. But in the case of GuC-based
> >> batch submission, we can't do that until any required firmware has
> >> been loaded, which may not be possible during driver load, because the
> >> filesystem(s) containing the firmware may not be mounted until later.
> >>
> >> Therefore, we now allow the first call to the firmware-loading code to
> >> return -EAGAIN to indicate that it's not yet ready, and that it should
> >> be retried when the device is first opened from user code, by which
> >> time we expect that all required filesystems will have been mounted.
> >> The late-retry code will then re-attempt to load the firmware if the
> >> early attempt failed.
> >>
> >> If the late retry fails, the current open-in-progress will fail, but
> >> the recovery code will disable GuC submission and reset the GPU and
> >> driver. The next open will therefore be in non-GuC mode, and will be
> >> allowed to complete even if the GuC cannot be loaded or used.
> >>
> >> Issue: VIZ-4884
> >> Signed-off-by: Dave Gordon <david.s.gordon at intel.com>
> >> Signed-off-by: Alex Dai <yu.dai at intel.com>
> > 
> > I'm not really sold on this super-flexible fallback scheme implemented
> > here. Because such fallback schemes means more code to test (which no on
> > will do likely) or just even bigger fireworks when we actually hit them in
> > reality when something goes wrong. Imo if anything goes wrong in the setup
> > we just throw in the towel and fail the driver loading.
> 
> Firstly, GuC submission is an OPTION. That means we already have code to
> work with or without a GuC. The fallback just allows us to keep going
> after finding that although GuC submission has been requested, and we do
> have a GuC, nonetheless the request cannot be satisfied. That's no
> different from automatically disabling PPGTT or execlist mode if they're
> requested on platforms where we don't support them.

It is since we do the automatic ppgtt/execlist/whatever disabling decision
once at driver load and then stick to it. Well you can change it sometimes
at runtime it might work, but it's not something we test or recommend - it
autotaints the kernel even when you just touch these options.

> > There's only one exception: If something fails with GT init we declare the
> > gpu wedged but proceed with all the modeset setup. This makes sense
> > because we need all the code to handle a wedge gpu anyway, dead-on-boot
> > gpus happen occasionally and it's really not nice to greet the user with a
> > black screen. But more fallbacks are imo just headache.
> > 
> > Hence when the guc fails we imo really shouldn't bother with fallbacks,
> > but instead just declare the thing wedged and carry on.
> 
> So the strategy here is exactly the same as for GT init; declare the GPU
> wedged, but after disabling GuC mode. The recovery will then get us into
> the same state as if there were no GuC, or GuC mode had not been
> selected in the first place. We can't switch between GuC and execlists
> arbitrarily; the only switchover is from GuC to non-GuC, and it can only
> happen ONCE.

The existing wedged logic is a terminal state (except when developers
reset it through debugfs). There's no automatic recover/fallback ever if
we can't get the gpu up&running in the mode we want it to run in.

> To test this is easy; just rename your firmware blob so the driver can't
> find it and reboot. It should automatically run in execlist mode, with a
> log message telling you what went wrong (f/w file not found). Much nicer
> than your screen staying blank because you upgraded the driver and not
> the firmware, or vice versa.

The screen will not stay blank since we'll still enable the modeset driver
of i915, and at least basic userspace drivers know how to fall back to sw
rendering. The entire point of declaring the gpu wedged if init fails is
to increase the chances that we can get a bug report.

> > That should also allow us to simplify the firmware loading: We can do that
> > in an async worker and if the blob isn't there in time then we just move
> > on.
> > -Daniel
> 
> Under no circumstances can you ever load the firmware from an async
> worker thread, because Bad Things Will Happen if there is hardware
> activity already in progress when the GuC f/w starts up.

Whether you load the firmware through an async work item in a kernel
thread or from a userspace process (in open) doesn't materially change
things at all - it's concurrent and you need to cope with it. And
dev->struct_mutex is a big lock (way too big and one of the most serious
if not the worst piece of technical debt we carry around), but it does not
protect against concurrent access to the hardware for everything.

The upside of doing the init in an explicit async worker is that it's
explicit, looks scary and you don't have any illusions about it ;-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


More information about the Intel-gfx mailing list