[Intel-gfx] [PATCH] drm/i915: Workaround incoherence between fences and LLC across multiple CPUs

Mon Jun 10 16:37:50 CEST 2013

On Mon, Jun 10, 2013 at 3:17 PM, Bloomfield, Jon
<jon.bloomfield at intel.com> wrote:
>> -----Original Message-----
>> From: daniel.vetter at ffwll.ch [mailto:daniel.vetter at ffwll.ch] On Behalf Of
>> Daniel Vetter
>> Sent: Saturday, June 08, 2013 10:22 PM
>> To: Carsten Emde
>> Cc: Chris Wilson; Jesse Barnes; Intel Graphics Development; Bloomfield, Jon;
>> Steven Rostedt; Christoph Mathys; stable
>> Subject: Re: [Intel-gfx] [PATCH] drm/i915: Workaround incoherence
>> between fences and LLC across multiple CPUs
>>
>> On Sat, Jun 8, 2013 at 10:41 PM, Carsten Emde <C.Emde at osadl.org> wrote:
>> > On 04/08/2013 11:54 AM, Daniel Vetter wrote:
>> >>
>> >> On Fri, Apr 05, 2013 at 09:03:27AM -0700, Jesse Barnes wrote:
>> >>>
>> >>> On Thu,  4 Apr 2013 21:31:03 +0100
>> >>> Chris Wilson <chris at chris-wilson.co.uk> wrote:
>> >>>
>> >>>> In order to fully serialize access to the fenced region and the
>> >>>> update to the fence register we need to take extreme measures on
>> >>>> SNB+, and manually flush writes to memory prior to writing the
>> >>>> fence register in conjunction with the memory barriers placed around
>> the register write.
>> >>>>
>> >>>> Fixes i-g-t/gem_fence_thrash
>> >>>>
>> >>>> v2: Bring a bigger gun
>> >>>> v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
>> >>>> v4: Remove changes for working generations.
>> >>>> v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
>> >>>> v6: Rewrite comments to ellide forgotten history.
>> >>>>
>> >>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191
>> >>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>> >>>> Cc: Jon Bloomfield <jon.bloomfield at intel.com>
>> >>>> Tested-by: Jon Bloomfield <jon.bloomfield at intel.com> (v2)
>> >>>> Cc: stable at vger.kernel.org
>> >>>> ---
>> >>>>   drivers/gpu/drm/i915/i915_gem.c |   28
>> +++++++++++++++++++++++-----
>> >>>>   1 file changed, 23 insertions(+), 5 deletions(-)
>> >>>>
>> >>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
>> >>>> b/drivers/gpu/drm/i915/i915_gem.c index fa4ea1a..8f7739e 100644
>> >>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> >>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> >>>> @@ -2689,17 +2689,35 @@ static inline int fence_number(struct
>> >>>> drm_i915_private *dev_priv,
>> >>>>         return fence - dev_priv->fence_regs;
>> >>>>   }
>> >>>>
>> >>>> +static void i915_gem_write_fence__ipi(void *data) {
>> >>>> +       wbinvd();
>> >>>> +}
>> >>>> +
>> >>>>   static void i915_gem_object_update_fence(struct
>> >>>> drm_i915_gem_object *obj,
>> >>>>                                          struct drm_i915_fence_reg
>> >>>> *fence,
>> >>>>                                          bool enable)
>> >>>>   {
>> >>>> -       struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
>> >>>> -       int reg = fence_number(dev_priv, fence);
>> >>>> -
>> >>>> -       i915_gem_write_fence(obj->base.dev, reg, enable ? obj : NULL);
>> >>>> +       struct drm_device *dev = obj->base.dev;
>> >>>> +       struct drm_i915_private *dev_priv = dev->dev_private;
>> >>>> +       int fence_reg = fence_number(dev_priv, fence);
>> >>>> +
>> >>>> +       /* In order to fully serialize access to the fenced region and
>> >>>> +        * the update to the fence register we need to take extreme
>> >>>> +        * measures on SNB+. In theory, the write to the fence register
>> >>>> +        * flushes all memory transactions before, and coupled with the
>> >>>> +        * mb() placed around the register write we serialise all memory
>> >>>> +        * operations with respect to the changes in the tiler. Yet, on
>> >>>> +        * SNB+ we need to take a step further and emit an explicit
>> >>>> wbinvd()
>> >>>> +        * on each processor in order to manually flush all memory
>> >>>> +        * transactions before updating the fence register.
>> >>>> +        */
>> >>>> +       if (HAS_LLC(obj->base.dev))
>> >>>> +               on_each_cpu(i915_gem_write_fence__ipi, NULL, 1);
>> >>>> +       i915_gem_write_fence(dev, fence_reg, enable ? obj : NULL);
>> >>>>
>> >>>>         if (enable) {
>> >>>> -               obj->fence_reg = reg;
>> >>>> +               obj->fence_reg = fence_reg;
>> >>>>                 fence->obj = obj;
>> >>>>                 list_move_tail(&fence->lru_list,
>> >>>> &dev_priv->mm.fence_list);
>> >>>>         } else {
>> >>>
>> >>>
>> >>> I almost wish x86 just gave up and went fully weakly ordered.  At
>> >>> least then we'd know we need barriers everywhere, rather than just
>> >>> in mystical places.
>> >>>
>> >>> Reviewed-by: Jesse Barnes <jbarnes at virtuousgeek.org>
>> >>
>> >>
>> >> Queued for -next, thanks for the patch. I am a bit unhappy that the
>> >> testcase doesn't work too well and can't reliably reproduce this on
>> >> all affected machines. But I've tried a bit to improve things and
>> >> it's very fickle. I guess we just have to accept this :(
>> >
>> >
>> > Under real-time conditions when we expect deterministic response to
>> > external and internal events at any time within a couple of
>> > microseconds, invalidating and flushing the entire cache in a running
>> > system is unacceptable. This is introducing latencies of several
>> > milliseconds which was clearly shown in RT regression tests on a
>> > kernel with this patch applied (https://www.osadl.org/?id=1543#c7602).
>> > We therefore have to revert it in the RT patch queue - kind of a
>> > workaround of a workaround.
>> >
>> > Would simply be wonderful, if we could get rid of the hateful wbinvd().
>>
>> I'm somewhat surprised people have not started to scream earlier about this
>> one ;-)
>>
>> We're trying to figure out whether there's a less costly w/a (we have some
>> benchmark where it kills performance, too) but until that's done we pretty
>> much need to stick with it. If you want to avoid any bad side-effects due to
>> that w/a you can alternatively pin all gpu rendering tasks to the same cpu
>> core/thread.
>> -Daniel
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
>
> This is such a serious bug, I'm surprised it hasn't been seen elsewhere already. We shouldn't just not fix it because the only known fix could have a performance impact. There's no point having fast tiled accesses that can't be relied upon. We need to release an interim fix while we look for better solutions.
>
> Daniel, how does pinning to a single CPU avoid the wbinvd() ? The page-fault code doesn't know not to apply the workaround in that case surely ?

It's only that the incoherency only shows up if you both migrate
threads between different cpus and also need to steal fence registers.
If you pin all threads to the same cpu thread/core (with e.g. with
numactl or sched_setaffinity tools) then we haven't yet seen a
corruption.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch