[Intel-gfx] [PATCH 0/8] DPF (GPU l3 parity detection) improvements
Ben Widawsky
benjamin.widawsky at intel.com
Tue Sep 17 06:15:42 CEST 2013
I see. I had thought the hang bit was part of the test injection, when
it's actually modifying the behavior or L3 errors. Any opinions on
what the default should be (agreed that policy should be controlled by
user space, but we can control the default)? What does a "hang" mean
exactly, is the rest of memory still responsive, L3?
On Mon, Sep 16, 2013 at 5:52 PM, Bell, Bryan J <bryan.j.bell at intel.com> wrote:
> The "hang" injection is for the scenarios like:
> (1) L3 error occurs
> (2) Workload completion, reported to user mode driver, e.g. OpenCL
> (3) L3 error interrupt, handled.
>
> If (2) occurs before (3), it's possible to report that a GPGPU workload successfully completed when in fact it did not due to the L3 error.
>
> It should be up to the user mode if the "hang" bit is set.
>
> --Thanks
> Bryan
> -----Original Message-----
> From: Ben Widawsky [mailto:benjamin.widawsky at intel.com]
> Sent: Thursday, September 12, 2013 10:28 PM
> To: intel-gfx at lists.freedesktop.org
> Cc: Venkatesh, Vishnu; Bell, Bryan J; Widawsky, Benjamin
> Subject: [PATCH 0/8] DPF (GPU l3 parity detection) improvements
>
> Since IVB, our driver has supported GPU L3 cacheline remapping for parity errors. This is known as, "DPF" for Dynamic Parity Feature. I am told such an error is a good predictor for a subsequent error in the same part of the cache. To address this possible issue for workloads requiring precise and correct data, like GPGPU workloads the HW has extra space in the cache which can be dynamically remapped to fill in the old, faulting parts of the cache. I should also note, to my knowledge, no such error has actually been seen on either Ivybridge or Haswell in the wild.
>
> Note, and reminder: GPU L3 is not the same thing as "L3." It is a special (usually incoherent) cache that is only used by certain components within the GPU.
>
> Included in the patches:
> 1. Fix HSW test cases previously submitted and bikeshedded by Ville.
> 2. Support for an extra area of L3 added in certain HSW SKUs 3. Error injection support from the user space for test.
> 4. A reference daemon for listening to the parity error events.
>
> Caveats:
> * I've not implemented the "hang" injection. I was not clear what it does, and
> I don't really see how it benefits testing the software I have written.
>
> * I am currently missing a test which uses the error injection.
> Volunteers who want to help, please raise your hand. If not, I'll get
> to it as soon as possible.
>
> * We do have a race with the udev mechanism of error delivery. If I
> understand the way udev works, if we have more than 1 event before the
> daemon is woken, the properties will get us the failing cache location
> of the last error only. I think this is okay because of the earlier statement
> that a parity error is a good indicator of a future parity error. One thing
> which I've not done is trying to track when there are missed errors which
> should be possible even if the info about the location of the error can't be
> retrieved.
>
> * There is no way to read out the per context remapping information through
> sysfs. I only expose whether or not a context has outstanding remaps through
> debugfs. This does effect the testability a bit, but the implementation is
> simple enough that I'm not terrible worried.
>
> Ben Widawsky (8):
> drm/i915: Remove extra "ring"
> drm/i915: Round l3 parity reads down
> drm/i915: Fix l3 parity user buffer offset
> drm/i915: Fix HSW parity test
> drm/i915: Add second slice l3 remapping
> drm/i915: Make l3 remapping use the ring
> drm/i915: Keep a list of all contexts
> drm/i915: Do remaps for all contexts
>
> drivers/gpu/drm/i915/i915_debugfs.c | 23 ++++++---
> drivers/gpu/drm/i915/i915_drv.h | 13 +++--
> drivers/gpu/drm/i915/i915_gem.c | 46 +++++++++---------
> drivers/gpu/drm/i915/i915_gem_context.c | 20 +++++++-
> drivers/gpu/drm/i915/i915_irq.c | 84 +++++++++++++++++++++------------
> drivers/gpu/drm/i915/i915_reg.h | 6 +++
> drivers/gpu/drm/i915/i915_sysfs.c | 57 +++++++++++++++-------
> drivers/gpu/drm/i915/intel_ringbuffer.c | 6 +--
> include/uapi/drm/i915_drm.h | 8 ++--
> 9 files changed, 175 insertions(+), 88 deletions(-)
>
> --
> 1.8.4
>
More information about the Intel-gfx
mailing list