[Intel-gfx] [PATCH 07/15] drm/i915: Defer default hardware context initialisation until first open

Fri Jun 19 02:19:04 PDT 2015

On 17/06/15 13:18, Daniel Vetter wrote:
> On Mon, Jun 15, 2015 at 07:36:25PM +0100, Dave Gordon wrote:
>> In order to fully initialise the default contexts, we have to execute
>> batchbuffer commands on the GPU engines. But in the case of GuC-based
>> batch submission, we can't do that until any required firmware has
>> been loaded, which may not be possible during driver load, because the
>> filesystem(s) containing the firmware may not be mounted until later.
>>
>> Therefore, we now allow the first call to the firmware-loading code to
>> return -EAGAIN to indicate that it's not yet ready, and that it should
>> be retried when the device is first opened from user code, by which
>> time we expect that all required filesystems will have been mounted.
>> The late-retry code will then re-attempt to load the firmware if the
>> early attempt failed.
>>
>> If the late retry fails, the current open-in-progress will fail, but
>> the recovery code will disable GuC submission and reset the GPU and
>> driver. The next open will therefore be in non-GuC mode, and will be
>> allowed to complete even if the GuC cannot be loaded or used.
>>
>> Issue: VIZ-4884
>> Signed-off-by: Dave Gordon <david.s.gordon at intel.com>
>> Signed-off-by: Alex Dai <yu.dai at intel.com>
> 
> I'm not really sold on this super-flexible fallback scheme implemented
> here. Because such fallback schemes means more code to test (which no on
> will do likely) or just even bigger fireworks when we actually hit them in
> reality when something goes wrong. Imo if anything goes wrong in the setup
> we just throw in the towel and fail the driver loading.

Firstly, GuC submission is an OPTION. That means we already have code to
work with or without a GuC. The fallback just allows us to keep going
after finding that although GuC submission has been requested, and we do
have a GuC, nonetheless the request cannot be satisfied. That's no
different from automatically disabling PPGTT or execlist mode if they're
requested on platforms where we don't support them.

> There's only one exception: If something fails with GT init we declare the
> gpu wedged but proceed with all the modeset setup. This makes sense
> because we need all the code to handle a wedge gpu anyway, dead-on-boot
> gpus happen occasionally and it's really not nice to greet the user with a
> black screen. But more fallbacks are imo just headache.
> 
> Hence when the guc fails we imo really shouldn't bother with fallbacks,
> but instead just declare the thing wedged and carry on.

So the strategy here is exactly the same as for GT init; declare the GPU
wedged, but after disabling GuC mode. The recovery will then get us into
the same state as if there were no GuC, or GuC mode had not been
selected in the first place. We can't switch between GuC and execlists
arbitrarily; the only switchover is from GuC to non-GuC, and it can only
happen ONCE.

To test this is easy; just rename your firmware blob so the driver can't
find it and reboot. It should automatically run in execlist mode, with a
log message telling you what went wrong (f/w file not found). Much nicer
than your screen staying blank because you upgraded the driver and not
the firmware, or vice versa.

> That should also allow us to simplify the firmware loading: We can do that
> in an async worker and if the blob isn't there in time then we just move
> on.
> -Daniel

Under no circumstances can you ever load the firmware from an async
worker thread, because Bad Things Will Happen if there is hardware
activity already in progress when the GuC f/w starts up.

.Dave.