[Intel-gfx] [RFC 00/11] TDR/watchdog timeout support for gen8
Tomas Elf
tomas.elf at intel.com
Tue Jun 23 03:47:16 PDT 2015
On 23/06/2015 11:05, Daniel Vetter wrote:
> Your patches don't apply cleanly any more and I can't find a suitable
> baseline where they would. But I'd like to see it all in context to check
> a few things.
>
> Can you pls push a git branch with these somewhere?
>
Here's the baseline for my local tree:
cd07637 - drm-intel-nightly: 2015y-04m-13d-09h-46m-59s UTC integration
manifest <daniel.vetter at ffwll.ch>
I haven't updated it in a while obviously since I thought that could
wait until we'd worked our way through the RFC series and I could get to
work on the first real patch series.
Is it possible for you to set up a local tree of your own with my
baseline and my RFC patches on top or would you prefer it if I push my
branch to drm-private?
Thanks,
Tomas
> Thanks, Daniel
>
> On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote:
>> This patch series introduces the following features:
>>
>> * Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.
>>
>> TDR is an umbrella term for anything that goes into detecting and recovering
>> from GPU hangs and is a term more widely used outside of the upstream driver.
>> This feature introduces an extensible framework that currently supports gen8
>> but that can be easily extended to support gen7 as well (which is already the
>> case in GMIN but unfortunately in a not quite upstreamable form). The code
>> contained in this submission represents the essentials of what is currently in
>> GMIN merged with what is currently in upstream (as of the time when this work
>> commenced a few months back).
>>
>> This feature adds a new hang recovery path alongside the legacy GPU reset path,
>> which takes care of engine recovery only. Aside from adding support for
>> per-engine recovery this feature also introduces rules for how to promote a
>> potential per-engine reset to a legacy, full GPU reset.
>>
>> The hang checker now integrates with the error handler in a slightly different
>> way in that it allows hang recovery on multiple engines at the same time by
>> passing an engine flag mask to the error handler where flags representing all
>> of the hung engines are set. This allows us to schedule hang recovery once for
>> all currently hung engines instead of one hang recovery per detected engine
>> hang. Previously, when only full GPU reset was supported this was all the same
>> since it wouldn't matter if one or four engines were hung at any given point
>> since it would all amount to the same thing - the GPU getting reset. As it
>> stands now the behaviour is different depending on which engine is hung since
>> each engine is reset separately from all the other engines, therefore we have
>> to think about this in terms of scheduling cost and recovery latency. (see
>> open question below)
>>
>> OPEN QUESTIONS:
>>
>> 1. Do we want to investigate the possibility of per-engine hang
>> detection? In the current upstream driver there is only one work queue
>> that handles the hang checker and everything from initial hang
>> detection to final hang recovery runs in this thread. This makes sense
>> if you're only supporting one form of hang recovery - using full GPU
>> reset and nothing tied to any particular engine. However, as part
>> of this patch series we're changing that by introducing per-engine
>> hang recovery. It could make sense to introduce multiple work
>> queues - one per engine - to run multiple hang checking threads in
>> parallel.
>>
>> This would potentially allow savings in terms of recovery latency since
>> we don't have to scan all engines every time the hang checker is
>> scheduled and the error handler does not have to scan all engines every
>> time it is scheduled. Instead, we could implement one work queue per
>> engine that would invoke the hang checker that only checks _that_
>> particular engine and then the error handler is invoked for _that_
>> particular engine. If one engine has hung the latency for getting to
>> the hang recovery path for that particular engine would be (Time For
>> Hang Checking One Engine) + (Time For Error Handling One Engine) rather
>> than the time it takes to do hang checking for all engines + the time
>> it takes to do error handling for all engines that have been detected
>> as hung (which in the worst case would be all engines). There would
>> potentially be as many hang checker and error handling threads going on
>> concurrently as there are engines in the hardware but they would all be
>> running in parallel without any significant locking. The first time
>> where any thread needs exclusive access to the driver is at the point
>> of the actual hang recovery but the time it takes to get there would
>> theoretically be lower and the time it actually takes to do per-engine
>> hang recovery is quite a lot lower than the time it takes to actually
>> detect a hang reliably.
>>
>> How much we would save by such a change still needs to be analysed and
>> compared against the current single-thread model but it makes sense
>> from a theoretical design point of view.
>>
>> 2. How does per-engine reset integrate with the public reset stats
>> IOCTL? These stats are used for the GL robustness interface and
>> currently these tests are failing when running per-engine hang recovery
>> since we treat per-engine recovery differently from full GPU recovery,
>> which is nothing that userland knows anything about. When userland
>> expects to hang the hardware it expects the reset stat interface to
>> reflect this, which is something that has changed as part of this code
>> submission. There's more than one way to solve this. Here are two options:
>>
>> 1. Expose per-engine reset statistics and set contexts as
>> guilty the same way for per-engine reset as for full GPU
>> resets.
>>
>> That would make this change to the hang recovery mechanism
>> transparent to userland but it would change the semantics since
>> an active context in the reset stats no longer implies that the
>> GPU was fully reset.
>>
>> 2. Add a new set of statistics for per-engine reset (one group
>> of statistics for each engine) to reflect the extended
>> capabilities that per-engine hang recovery offers.
>>
>> Would that be breaking the ABI?
>>
>> ... Or are there any other way of doing this?
>>
>> * Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
>>
>> This feature allows userland applications to control whether or not individual
>> batch buffers should have a first-level, fine-grained, hardware-based hang
>> detection mechanism on top of the ordinary, software-based periodic hang
>> checker that is already in the driver. The advantage over relying solely on the
>> current software-based hang checker is that the watchdog timeout mechanism is
>> about 1000x quicker and more precise. Since it's not a full driver-level hang
>> detection mechanism but only targetting one individual batch buffer at a time
>> it can afford to be that quick without risking an increase in false positive
>> hang detection.
>>
>> This feature includes the following changes:
>>
>> a) Watchdog timeout interrupt service routine for handling watchdog interrupts
>> and connecting these to per-engine hang recovery.
>>
>> b) Injection of watchdog timer enablement/cancellation instructions
>> before/after the batch buffer start instruction in the ring buffer so that
>> watchdog timeout is connected to the submission of an individual batch buffer.
>>
>> c) Extension of the DRM batch buffer interface, exposing the watchdog timeout
>> feature to userland. We've got two open source groups in VPG currently in the
>> process of integrating support for this feature, which should make it
>> principally possible to upstream this extension.
>>
>> There is currently full watchdog timeout support for gen7 in GMIN and it is
>> quite similar to the gen8 implementation so there is nothing obvious that
>> prevents us from upstreaming that code along with the gen8 code. However,
>> watchdog timeout is fully dependent on the per-engine hang recovery path and
>> that is not part of this code submission for gen7. Therefore watchdog timeout
>> support for gen7 has been excluded until per-engine hang recovery support for
>> gen7 has landed upstream.
>>
>> As part of this submission we've had to reinstate the work queue that was
>> previously in place between the error handler and the hang recovery path. The
>> reason for this is that the per-engine recovery path is called directly from
>> the interrupt handler in the case of watchdog timeout. In that situation
>> there's no way of grabbing the struct_mutex, which is a requirement for the
>> hang recovery path. Therefore, by reinstating the work queue we provide a
>> unified execution context for the hang recovery code that allows the hang
>> recovery code to grab whatever locks it needs without sacrificing interrupt
>> latency too much or sleeping indefinitely in hard interrupt context.
>>
>> * Feature 3. Context Submission Status Consistency checking
>>
>> Something that becomes apparent when you run long-duration operations tests
>> with concurrent rendering processes with intermittently injected hangs is that
>> it seems like the GPU forgets to send context completion interrupts to the
>> driver under some circumstances. What this means is that the driver sometimes
>> gets stuck on a context that never seems to finish, all the while the hardware
>> has completed and is waiting for more work.
>>
>> The problem with this is that the per-engine hang recovery path relies on
>> context resubmission to kick off the hardware again following an engine reset.
>> This can only be done safely if the hardware and driver share the same opinion
>> about the current state. Therefore we've extended the periodic hang checker to
>> check for context submission state inconsistencies aside from the hang checking
>> it already does.
>>
>> If such a state is detected it is assumed (based on experience) that a context
>> completion interrupt has been lost somehow. If this state persists for some
>> time an attempt to correct it is made by faking the presumably lost context
>> completion interrupt by manually calling the execlist interrupt handler, which
>> is normally called from the main interrupt handler cued by a received context
>> event interrupt. Just because an interrupt goes missing does not mean that the
>> context status buffer (CSB) does not get appropriately updated by the hardware,
>> which means that we can expect to find all the recent changes to the context
>> states for each engine captured there. If there are outstanding context status
>> changes in store there then the faked context event interrupt will allow the
>> interrupt handler to act on them. In the case of lost context completion
>> interrupts this will prompt the driver to remove the already completed context
>> from the execlist queue and move on to the next pending piece of work and
>> thereby eliminating the inconsistency.
>>
>> * Feature 4. Debugfs extensions for per-engine hang recovery and TDR/watchdog trace
>> points.
>>
>>
>> Tomas Elf (11):
>> drm/i915: Early exit from semaphore_waits_for for execlist mode.
>> drm/i915: Introduce uevent for full GPU reset.
>> drm/i915: Add reset stats entry point for per-engine reset.
>> drm/i915: Adding TDR / per-engine reset support for gen8.
>> drm/i915: Extending i915_gem_check_wedge to check engine reset in
>> progress
>> drm/i915: Disable warnings for TDR interruptions in the display
>> driver.
>> drm/i915: Reinstate hang recovery work queue.
>> drm/i915: Watchdog timeout support for gen8.
>> drm/i915: Fake lost context interrupts through forced CSB check.
>> drm/i915: Debugfs interface for per-engine hang recovery.
>> drm/i915: TDR/watchdog trace points.
>>
>> drivers/gpu/drm/i915/i915_debugfs.c | 146 +++++-
>> drivers/gpu/drm/i915/i915_dma.c | 79 +++
>> drivers/gpu/drm/i915/i915_drv.c | 201 ++++++++
>> drivers/gpu/drm/i915/i915_drv.h | 91 +++-
>> drivers/gpu/drm/i915/i915_gem.c | 93 +++-
>> drivers/gpu/drm/i915/i915_gpu_error.c | 2 +-
>> drivers/gpu/drm/i915/i915_irq.c | 378 ++++++++++++--
>> drivers/gpu/drm/i915/i915_params.c | 10 +
>> drivers/gpu/drm/i915/i915_reg.h | 13 +
>> drivers/gpu/drm/i915/i915_trace.h | 298 +++++++++++
>> drivers/gpu/drm/i915/intel_display.c | 16 +-
>> drivers/gpu/drm/i915/intel_lrc.c | 858 ++++++++++++++++++++++++++++++-
>> drivers/gpu/drm/i915/intel_lrc.h | 16 +-
>> drivers/gpu/drm/i915/intel_lrc_tdr.h | 40 ++
>> drivers/gpu/drm/i915/intel_ringbuffer.c | 87 +++-
>> drivers/gpu/drm/i915/intel_ringbuffer.h | 109 ++++
>> drivers/gpu/drm/i915/intel_uncore.c | 241 ++++++++-
>> include/uapi/drm/i915_drm.h | 5 +-
>> 18 files changed, 2589 insertions(+), 94 deletions(-)
>> create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h
>>
>> --
>> 1.7.9.5
>>
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
More information about the Intel-gfx
mailing list