[Intel-gfx] [PATCH] drm/i915: Workaround incoherence between fences and LLC across multiple CPUs

Sat Jun 8 23:22:00 CEST 2013

On Sat, Jun 8, 2013 at 10:41 PM, Carsten Emde <C.Emde at osadl.org> wrote:
> On 04/08/2013 11:54 AM, Daniel Vetter wrote:
>>
>> On Fri, Apr 05, 2013 at 09:03:27AM -0700, Jesse Barnes wrote:
>>>
>>> On Thu,  4 Apr 2013 21:31:03 +0100
>>> Chris Wilson <chris at chris-wilson.co.uk> wrote:
>>>
>>>> In order to fully serialize access to the fenced region and the update
>>>> to the fence register we need to take extreme measures on SNB+, and
>>>> manually flush writes to memory prior to writing the fence register in
>>>> conjunction with the memory barriers placed around the register write.
>>>>
>>>> Fixes i-g-t/gem_fence_thrash
>>>>
>>>> v2: Bring a bigger gun
>>>> v3: Switch the bigger gun for heavier bullets (Arjan van de Ven)
>>>> v4: Remove changes for working generations.
>>>> v5: Reduce to a per-cpu wbinvd() call prior to updating the fences.
>>>> v6: Rewrite comments to ellide forgotten history.
>>>>
>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=62191
>>>> Signed-off-by: Chris Wilson <chris at chris-wilson.co.uk>
>>>> Cc: Jon Bloomfield <jon.bloomfield at intel.com>
>>>> Tested-by: Jon Bloomfield <jon.bloomfield at intel.com> (v2)
>>>> Cc: stable at vger.kernel.org
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_gem.c |   28 +++++++++++++++++++++++-----
>>>>   1 file changed, 23 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c
>>>> b/drivers/gpu/drm/i915/i915_gem.c
>>>> index fa4ea1a..8f7739e 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>> @@ -2689,17 +2689,35 @@ static inline int fence_number(struct
>>>> drm_i915_private *dev_priv,
>>>>         return fence - dev_priv->fence_regs;
>>>>   }
>>>>
>>>> +static void i915_gem_write_fence__ipi(void *data)
>>>> +{
>>>> +       wbinvd();
>>>> +}
>>>> +
>>>>   static void i915_gem_object_update_fence(struct drm_i915_gem_object
>>>> *obj,
>>>>                                          struct drm_i915_fence_reg
>>>> *fence,
>>>>                                          bool enable)
>>>>   {
>>>> -       struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
>>>> -       int reg = fence_number(dev_priv, fence);
>>>> -
>>>> -       i915_gem_write_fence(obj->base.dev, reg, enable ? obj : NULL);
>>>> +       struct drm_device *dev = obj->base.dev;
>>>> +       struct drm_i915_private *dev_priv = dev->dev_private;
>>>> +       int fence_reg = fence_number(dev_priv, fence);
>>>> +
>>>> +       /* In order to fully serialize access to the fenced region and
>>>> +        * the update to the fence register we need to take extreme
>>>> +        * measures on SNB+. In theory, the write to the fence register
>>>> +        * flushes all memory transactions before, and coupled with the
>>>> +        * mb() placed around the register write we serialise all memory
>>>> +        * operations with respect to the changes in the tiler. Yet, on
>>>> +        * SNB+ we need to take a step further and emit an explicit
>>>> wbinvd()
>>>> +        * on each processor in order to manually flush all memory
>>>> +        * transactions before updating the fence register.
>>>> +        */
>>>> +       if (HAS_LLC(obj->base.dev))
>>>> +               on_each_cpu(i915_gem_write_fence__ipi, NULL, 1);
>>>> +       i915_gem_write_fence(dev, fence_reg, enable ? obj : NULL);
>>>>
>>>>         if (enable) {
>>>> -               obj->fence_reg = reg;
>>>> +               obj->fence_reg = fence_reg;
>>>>                 fence->obj = obj;
>>>>                 list_move_tail(&fence->lru_list,
>>>> &dev_priv->mm.fence_list);
>>>>         } else {
>>>
>>>
>>> I almost wish x86 just gave up and went fully weakly ordered.  At least
>>> then we'd know we need barriers everywhere, rather than just in
>>> mystical places.
>>>
>>> Reviewed-by: Jesse Barnes <jbarnes at virtuousgeek.org>
>>
>>
>> Queued for -next, thanks for the patch. I am a bit unhappy that the
>> testcase doesn't work too well and can't reliably reproduce this on all
>> affected machines. But I've tried a bit to improve things and it's very
>> fickle. I guess we just have to accept this :(
>
>
> Under real-time conditions when we expect deterministic response to
> external and internal events at any time within a couple of
> microseconds, invalidating and flushing the entire cache in a running
> system is unacceptable. This is introducing latencies of several
> milliseconds which was clearly shown in RT regression tests on a kernel
> with this patch applied (https://www.osadl.org/?id=1543#c7602). We
> therefore have to revert it in the RT patch queue - kind of a
> workaround of a workaround.
>
> Would simply be wonderful, if we could get rid of the hateful wbinvd().

I'm somewhat surprised people have not started to scream earlier about
this one ;-)

We're trying to figure out whether there's a less costly w/a (we have
some benchmark where it kills performance, too) but until that's done
we pretty much need to stick with it. If you want to avoid any bad
side-effects due to that w/a you can alternatively pin all gpu
rendering tasks to the same cpu core/thread.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch