[Intel-gfx] [PATCH 2/2] drm/i915/pmu: Fix CPU hotplug with multiple GPUs
Tvrtko Ursulin
tvrtko.ursulin at linux.intel.com
Tue Oct 20 12:33:12 UTC 2020
On 20/10/2020 13:10, Chris Wilson wrote:
> Quoting Chris Wilson (2020-10-20 12:59:57)
>> Quoting Tvrtko Ursulin (2020-10-20 11:08:22)
>>> From: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>>
>>> Since we keep a driver global mask of online CPUs and base the decision
>>> whether PMU needs to be migrated upon it, we need to make sure the
>>> migration is done for all registered PMUs (so GPUs).
>>>
>>> To do this we need to track the current CPU for each PMU and base the
>>> decision on whether to migrate on a comparison between global and local
>>> state.
>>>
>>> At the same time, since dynamic CPU hotplug notification slots are a
>>> scarce resource and given how we already register the multi instance type
>>> state, we can and should add multiple instance of the i915 PMU to this
>>> same state and not allocate a new one for every GPU.
>>>
>>> v2:
>>> * Use pr_notice. (Chris)
>>>
>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin at intel.com>
>>> Suggested-by: Daniel Vetter <daniel.vetter at intel.com> # dynamic slot optimisation
>>> Cc: Chris Wilson <chris at chris-wilson.co.uk>
>>> ---
>>> drivers/gpu/drm/i915/i915_pci.c | 7 ++++-
>>> drivers/gpu/drm/i915/i915_pmu.c | 50 ++++++++++++++++++++-------------
>>> drivers/gpu/drm/i915/i915_pmu.h | 6 +++-
>>> 3 files changed, 41 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
>>> index 27964ac0638a..a384f51c91c1 100644
>>> --- a/drivers/gpu/drm/i915/i915_pci.c
>>> +++ b/drivers/gpu/drm/i915/i915_pci.c
>>> @@ -1150,9 +1150,13 @@ static int __init i915_init(void)
>>> return 0;
>>> }
>>>
>>> + i915_pmu_init();
>>> +
>>> err = pci_register_driver(&i915_pci_driver);
>>> - if (err)
>>> + if (err) {
>>> + i915_pmu_exit();
>>> return err;
>>> + }
>>>
>>> i915_perf_sysctl_register();
>>> return 0;
>>> @@ -1166,6 +1170,7 @@ static void __exit i915_exit(void)
>>> i915_perf_sysctl_unregister();
>>> pci_unregister_driver(&i915_pci_driver);
>>> i915_globals_exit();
>>> + i915_pmu_exit();
>>> }
>>>
>>> module_init(i915_init);
>>> diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
>>> index 51ed7d0efcdc..0d6c0945621e 100644
>>> --- a/drivers/gpu/drm/i915/i915_pmu.c
>>> +++ b/drivers/gpu/drm/i915/i915_pmu.c
>>> @@ -30,6 +30,7 @@
>>> #define ENGINE_SAMPLE_BITS (1 << I915_PMU_SAMPLE_BITS)
>>>
>>> static cpumask_t i915_pmu_cpumask;
>>> +static unsigned int i915_pmu_target_cpu = -1;
>>>
>>> static u8 engine_config_sample(u64 config)
>>> {
>>> @@ -1049,25 +1050,32 @@ static int i915_pmu_cpu_online(unsigned int cpu, struct hlist_node *node)
>>> static int i915_pmu_cpu_offline(unsigned int cpu, struct hlist_node *node)
>>> {
>>> struct i915_pmu *pmu = hlist_entry_safe(node, typeof(*pmu), cpuhp.node);
>>> - unsigned int target;
>>> + unsigned int target = i915_pmu_target_cpu;
>>
>> So we still have multiple callbacks, one per pmu. But each callback is
>> now stored in a list from the cpuhp_slot instead of each callback having
>> its own slot.
>>
>>>
>>> GEM_BUG_ON(!pmu->base.event_init);
>>>
>>> if (cpumask_test_and_clear_cpu(cpu, &i915_pmu_cpumask)) {
>>
>> On first callback...
>>
>>> target = cpumask_any_but(topology_sibling_cpumask(cpu), cpu);
>>
>> Pick any other cpu.
>>
>>> +
>>> /* Migrate events if there is a valid target */
>>> if (target < nr_cpu_ids) {
>>> cpumask_set_cpu(target, &i915_pmu_cpumask);
>>> - perf_pmu_migrate_context(&pmu->base, cpu, target);
>>> + i915_pmu_target_cpu = target;
>>
>> Store target for all callbacks.
>>
>>> }
>>> }
>>>
>>> + if (target < nr_cpu_ids && target != pmu->cpuhp.cpu) {
>>
>> If global [i915_pmu_target_cpu] target has changed, update perf.
>>
>>> + perf_pmu_migrate_context(&pmu->base, cpu, target);
>>> + pmu->cpuhp.cpu = target;
>>
>> It is claimed that cpuhp_state_remove_instance() will call the offline
>> callback for all online cpus... Do we need a pmu->base.state != STOPPED
>> guard?
>
> s/claimed/it definitely does :)/
>
> Or rather pmu->closed.
Hm why? You think perf_pmu_migrate_context accesses something in the PMU
outside of the already protected entry points?
Regards,
Tvrtko
More information about the Intel-gfx
mailing list