[Intel-gfx] [RFC v3 00/11] i915 PMU and engine busy stats

Rogozhkin, Dmitry V dmitry.v.rogozhkin at intel.com
Tue Sep 12 02:03:04 UTC 2017


Hi,

Just tried v3 series. perf-stat works fine. From the IGT tests which I wrote for i915 PMU
(https://patchwork.freedesktop.org/series/29313/) all pass (assuming pmu.enabled will be exposed
in debugfs) except cpu_online subtest. And this is pretty interesting - see details below.

Ok, be prepared for the long mail:)...

So, cpu_online subtest loads RCS0 engine 100% and starts to put CPUs offline one by one starting
from CPU0 (don't forget to have CONFIG_BOOTPARAM_HOTPLUG_CPU0=y in .config). Test expectation is
that i915 PMU will report almost 100% (with 2% tolerance) busyness of RCS0. Test requests metric
just twice: before running on RCS0 and right after. This fails as follows:

Executed on rcs0 for 32004262184us
  i915/rcs0-busy/: 2225723999us
(perf_pmu:6325) CRITICAL: Test assertion failure function test_cpu_online, file perf_pmu.c:719:
(perf_pmu:6325) CRITICAL: Failed assertion: perf_elapsed(&pm.metrics[0]) > (1-USAGE_TOLERANCE) * elapsed_ns(&start, &now)
Stack trace:
  #0 [__igt_fail_assert+0xf1]
  #1 [__real_main773+0xff1]
  #2 [main+0x35]
  #3 [__libc_start_main+0xf5]
  #4 [_start+0x29]
  #5 [<unknown>+0x29]
Subtest cpu_online failed.
**** DEBUG ****
(perf_pmu:6325) DEBUG: Test requirement passed: is_hotplug_cpu0()
(perf_pmu:6325) INFO: perf_init: enabled 1 metrics from 1 requested
(perf_pmu:6325) ioctl-wrappers-DEBUG: Test requirement passed: gem_has_ring(fd, ring)
(perf_pmu:6325) INFO: Executed on rcs0 for 32004262184us
(perf_pmu:6325) INFO:   i915/rcs0-busy/: 2225723999us
(perf_pmu:6325) CRITICAL: Test assertion failure function test_cpu_online, file perf_pmu.c:719:
(perf_pmu:6325) CRITICAL: Failed assertion: perf_elapsed(&pm.metrics[0]) > (1-USAGE_TOLERANCE) * elapsed_ns(&start, &now)

Now. Looks like that by itself PMU context migration works. For example, if you will comment out
"perf_pmu_migrate_context(&pmu->base, cpu, target)" you will get:

    Executed on rcs0 for 32004434918us
      i915/rcs0-busy/:     76623707us

Compare with previous:
    Executed on rcs0 for 32004262184us
      i915/rcs0-busy/:    2225723999us

This test passed on the previous set of patches, I mean Tvrtko's v2 series + my patches.

So, it seems we are loosing counter values somehow. I saw in the patches that this place really was modified - you
have added subtraction from initial counter value:
static void i915_pmu_event_read(struct perf_event *event)
{

	local64_set(&event->count,
		    __i915_pmu_event_read(event) -
		    local64_read(&event->hw.prev_count));
}

But looks like the problem is that with the PMU context migration we get sequence of events start/stop (or maybe
add/del) which eventually call our i915_pmu_enable/disable. Here is the dmesg log with the obvious printk:

[  153.971096] [IGT] perf_pmu: starting subtest cpu_online
[  153.971151] i915_pmu_enable: event->hw.prev_count=0
[  154.036015] i915_pmu_disable: event->hw.prev_count=0
[  154.048027] i915_pmu_enable: event->hw.prev_count=0
[  154.049343] smpboot: CPU 0 is now offline
[  155.059028] smpboot: Booting Node 0 Processor 0 APIC 0x0
[  155.155078] smpboot: CPU 1 is now offline
[  156.161026] x86: Booting SMP configuration:
[  156.161027] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  156.197065] IRQ 122: no longer affine to CPU2
[  156.198087] smpboot: CPU 2 is now offline
[  157.208028] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  157.263093] smpboot: CPU 3 is now offline
[  158.273026] smpboot: Booting Node 0 Processor 3 APIC 0x6
[  158.310026] i915_pmu_disable: event->hw.prev_count=76648307
[  158.319020] i915_pmu_enable: event->hw.prev_count=76648307
[  158.319098] IRQ 124: no longer affine to CPU4
[  158.320368] smpboot: CPU 4 is now offline
[  159.326030] smpboot: Booting Node 0 Processor 4 APIC 0x1
[  159.365306] smpboot: CPU 5 is now offline
[  160.371030] smpboot: Booting Node 0 Processor 5 APIC 0x3
[  160.421077] IRQ 125: no longer affine to CPU6
[  160.422093] smpboot: CPU 6 is now offline
[  161.429030] smpboot: Booting Node 0 Processor 6 APIC 0x5
[  161.467091] smpboot: CPU 7 is now offline
[  162.473027] smpboot: Booting Node 0 Processor 7 APIC 0x7
[  162.527019] i915_pmu_disable: event->hw.prev_count=4347548222
[  162.546017] i915_pmu_enable: event->hw.prev_count=4347548222
[  162.547317] smpboot: CPU 0 is now offline
[  163.553028] smpboot: Booting Node 0 Processor 0 APIC 0x0
[  163.621089] smpboot: CPU 1 is now offline
[  164.627028] x86: Booting SMP configuration:
[  164.627029] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  164.669308] smpboot: CPU 2 is now offline
[  165.679025] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  165.717089] smpboot: CPU 3 is now offline
[  166.723025] smpboot: Booting Node 0 Processor 3 APIC 0x6
[  166.775016] i915_pmu_disable: event->hw.prev_count=8574197312
[  166.787016] i915_pmu_enable: event->hw.prev_count=8574197312
[  166.788309] smpboot: CPU 4 is now offline
[  167.794025] smpboot: Booting Node 0 Processor 4 APIC 0x1
[  167.837114] smpboot: CPU 5 is now offline
[  168.847025] smpboot: Booting Node 0 Processor 5 APIC 0x3
[  168.889312] smpboot: CPU 6 is now offline
[  169.899030] smpboot: Booting Node 0 Processor 6 APIC 0x5
[  169.944104] smpboot: CPU 7 is now offline
[  170.954032] smpboot: Booting Node 0 Processor 7 APIC 0x7
[  171.000016] i915_pmu_disable: event->hw.prev_count=12815138319
[  171.008017] i915_pmu_enable: event->hw.prev_count=12815138319
[  171.009304] smpboot: CPU 0 is now offline
[  172.017028] smpboot: Booting Node 0 Processor 0 APIC 0x0
[  172.096104] smpboot: CPU 1 is now offline
[  173.106025] x86: Booting SMP configuration:
[  173.106026] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  173.147078] smpboot: CPU 2 is now offline
[  174.153025] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  174.192093] smpboot: CPU 3 is now offline
[  175.198028] smpboot: Booting Node 0 Processor 3 APIC 0x6
[  175.229042] i915_pmu_disable: event->hw.prev_count=17035889079
[  175.242030] i915_pmu_enable: event->hw.prev_count=17035889079
[  175.242163] IRQ fixup: irq 120 move in progress, old vector 131
[  175.242165] IRQ fixup: irq 121 move in progress, old vector 147
[  175.242171] IRQ 124: no longer affine to CPU4
[  175.243435] smpboot: CPU 4 is now offline
[  176.248040] smpboot: Booting Node 0 Processor 4 APIC 0x1
[  176.285328] smpboot: CPU 5 is now offline
[  177.296039] smpboot: Booting Node 0 Processor 5 APIC 0x3
[  177.325067] IRQ 125: no longer affine to CPU6
[  177.326087] smpboot: CPU 6 is now offline
[  178.335036] smpboot: Booting Node 0 Processor 6 APIC 0x5
[  178.377063] IRQ 122: no longer affine to CPU7
[  178.378086] smpboot: CPU 7 is now offline
[  179.388028] smpboot: Booting Node 0 Processor 7 APIC 0x7
[  179.454030] i915_pmu_disable: event->hw.prev_count=21269856967
[  179.470026] i915_pmu_enable: event->hw.prev_count=21269856967
[  179.471110] smpboot: CPU 0 is now offline
[  180.481028] smpboot: Booting Node 0 Processor 0 APIC 0x0
[  180.551075] smpboot: CPU 1 is now offline
[  181.558029] x86: Booting SMP configuration:
[  181.558030] smpboot: Booting Node 0 Processor 1 APIC 0x2
[  181.595096] smpboot: CPU 2 is now offline
[  182.605029] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  182.657084] smpboot: CPU 3 is now offline
[  183.668030] smpboot: Booting Node 0 Processor 3 APIC 0x6
[  183.709017] i915_pmu_disable: event->hw.prev_count=25497358644
[  183.727016] i915_pmu_enable: event->hw.prev_count=25497358644
[  183.728305] smpboot: CPU 4 is now offline
[  184.734027] smpboot: Booting Node 0 Processor 4 APIC 0x1
[  184.767090] smpboot: CPU 5 is now offline
[  185.777036] smpboot: Booting Node 0 Processor 5 APIC 0x3
[  185.823096] smpboot: CPU 6 is now offline
[  186.829051] smpboot: Booting Node 0 Processor 6 APIC 0x5
[  186.856350] smpboot: CPU 7 is now offline
[  187.862051] smpboot: Booting Node 0 Processor 7 APIC 0x7
[  187.871216] [IGT] perf_pmu: exiting, ret=99
[  187.889199] Console: switching to colour frame buffer device 240x67
[  187.889583] i915_pmu_disable: event->hw.prev_count=29754080941

And the result which I got in userspace for this run were
    Executed on rcs0 for 32003587981us
      i915/rcs0-busy/: 2247436461us

After that I decided to roll back the change with counting values which I mentioned before, i.e.:
static void i915_pmu_event_read(struct perf_event *event)
{

	local64_set(&event->count,
		    __i915_pmu_event_read(event) /*-
		    local64_read(&event->hw.prev_count)*/);
}

And I got test PASSED :):
    Executed on rcs0 for 32002282603us                                                                                                                                   
      i915/rcs0-busy/: 31998855052us                                                                                                                                     
    Subtest cpu_online: SUCCESS (33.950s)

At this point I need to go home :). Maybe you will have time to look into this issue? If not, I will continue
tomorrow.

Regards,
Dmitry.




More information about the Intel-gfx mailing list