[Intel-gfx] [PATCH 07/10] drm/i915: Gate engine stats collection with a static key

Thu Oct 5 09:04:42 UTC 2017

Quoting Tvrtko Ursulin (2017-10-05 08:07:03)
> 
> On 04/10/2017 18:49, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2017-10-04 18:38:09)
> >>
> >> On 03/10/2017 11:17, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2017-09-29 13:34:57)
> >>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
> >>>>
> >>>> This reduces the cost of the software engine busyness tracking
> >>>> to a single no-op instruction when there are no listeners.
> >>>>
> >>>> We add a new i915 ordered workqueue to be used only for tasks
> >>>> not needing struct mutex.
> >>>>
> >>>> v2: Rebase and some comments.
> >>>> v3: Rebase.
> >>>> v4: Checkpatch fixes.
> >>>> v5: Rebase.
> >>>> v6: Use system_long_wq to avoid being blocked by struct_mutex
> >>>>       users.
> >>>> v7: Fix bad conflict resolution from last rebase. (Dmitry Rogozhkin)
> >>>> v8: Rebase.
> >>>> v9:
> >>>>    * Fix race between unordered enable followed by disable.
> >>>>      (Chris Wilson)
> >>>>    * Prettify order of local variable declarations. (Chris Wilson)
> >>>
> >>> Ok, I can't see a downside to enabling the optimisation even if it will
> >>> be global and not per-device/per-engine.
> >>
> >> For this one I did a quick test with gem_exec_nop and I've seen around
> >> 0.5% reduction in time spend in intel_lrc_irq_handler in the case where
> >> PMU is not active.
> > 
> > Hmm, gem_exec_nop isn't going to be favourable as there we are just
> > extending the busyness coverage of an engine. I think you want something
> > like gem_sync/sequential (or gem_exec_whisper), as there each engine
> > will be starting and stopping, and delays between engines will
> > accumulate.
> 
> Not sure if we are on the same page. Here I was referring to the CPU 
> usage in the "irq" (tasklet) handler. gem_exec_nop generates a good 
> number of interrupts (8k/s) and shows up in the profile at ~1.8% CPU 
> without the static branch optimisation, and ~1.3% with it.

I'm just suggesting that some of the other tests are more sensitive to
delays incurred there, where that ~50% difference in CPU cycles may be
visible to userspace. They may make a more convincing argument?
-Chris