[PATCH 1/3] drm/amdgpu: fix a typo

Sat Jun 24 00:27:21 UTC 2017

On Fri, Jun 23, 2017 at 3:01 PM, Christian König
<deathsimple at vodafone.de> wrote:
> The key point here is while optimizing this is nice the much bigger pile is
> the locking done for each BO.
>
> In other words even when we optimize all the other locks involved into
> atomics or RCU, the BO reservation lock will still dominate everything.
>
> One possible solution to this would be per process resources like I
> suggested multiple times now.

Mesa can set a per-process resource flag on all resources except
displayable ones. The question is, would it help if an IB contained
1000 per-process resources and 1-2 inter-process sharable?

Marek

>
> Christian.
>
>
> Am 23.06.2017 um 13:37 schrieb Marek Olšák:
>>
>> I agree with you about the spinlock. You seem to be good at this.
>>
>> It's always good to do measurements to validate that a code change
>> improves something, especially when the code size and code complexity
>> has to be increased. A CPU profiler such as sysprof can show you
>> improvements on the order of 1/10000th = 0.01% if you record enough
>> samples. Sometimes you have to un-inline a function to make it visible
>> there. If you see a function that takes 0.3% of CPU time and you
>> optimize it down to 0.1% using the profiler as the measurement tool,
>> you have evidence that the improvement is there and nobody can reject
>> the idea anymore. It also proves that the code size increase is worth
>> it. It's always "added code size and loss of simplicity" vs benefit.
>> It's a transaction. You trade one for the other. You lose something to
>> get something else. OK, we know the code complexity. Now, what's the
>> benefit? Can you do some measurements? The accuracy of 1/10000th
>> should be enough for anybody.
>>
>> I know the feeling when you spend many days working on something,
>> adding 100s or 1000s of lines of code, solving many problems to get
>> there and increasing code complexity significantly, and then you do
>> the measurement and it doesn't improve anything. I know the feeling
>> very well. It sucks. The frustration comes from the investment of time
>> and getting no return on the investment. Many frustrations in life are
>> like that.
>>
>> Marek
>>
>>
>> On Fri, Jun 23, 2017 at 4:23 AM, axie <axie at amd.com> wrote:
>>>
>>> Hi Marek,
>>>
>>>
>>> So do you agree that spinlock disables CPU preemption, contrary to your
>>> original idea?
>>>
>>>
>>> If you have new reason that this patch does not improve, please speak
>>> out.
>>>
>>>
>>> Many patches in GPU driver aim at improving performance and power
>>> efficiency. Does most patches submitted in AMDGPU requires a benchmarking
>>> first?
>>>
>>> If all developers are required to always answer your questions when code
>>> review, I am afraid that most open source community developers cannot
>>> meet
>>> that requirement and stop working on AMDGPU.
>>>
>>>
>>> To improve performance, there are many bottlenecks to clear. When the
>>> last
>>> several bottlenecks are clear, the performance will show faster more
>>> significantly.
>>>
>>> My pass profiling experience told me that clearing a lock can improve
>>> performance for some driver like 0.3% to much bigger percentage. It
>>> depends
>>> on many factors, even depends on the application itself.
>>>
>>>
>>> This is not the first bottleneck fixed. This is surely not the last one.
>>>
>>>
>>> Thanks,
>>>
>>> Alex Bin
>>>
>>>
>>>
>>> On 2017-06-22 07:54 PM, Marek Olšák wrote:
>>>>
>>>> That's all nice, but does it improve performance? Have you been able
>>>> to measure some performance difference with that code? Were you
>>>> targeting a specific inefficiency you had seen e.g. with a CPU
>>>> profiler?
>>>>
>>>> Marek
>>>>
>>>> On Thu, Jun 22, 2017 at 8:19 PM, axie <axie at amd.com> wrote:
>>>>>
>>>>> To clarify, local IRQ is disabled by calling raw_local_irq_save(flags);
>>>>>
>>>>> Function __lock_acquire double checks that the local IRQ is really
>>>>> disabled.
>>>>>
>>>>>
>>>>>
>>>>> On 2017-06-22 01:34 PM, axie wrote:
>>>>>>
>>>>>> Hi Marek,
>>>>>>
>>>>>> Spin lock and spin unlock is fast. But it is not so fast compared with
>>>>>> atomic, which is a single CPU instruction in x86.
>>>>>>
>>>>>>
>>>>>> 1. spinlock does NOT allow preemption at local CPU. Let us have a look
>>>>>> at
>>>>>> how spin lock was implemented.
>>>>>>
>>>>>> static inline void __raw_spin_lock(raw_spinlock_t *lock)
>>>>>> {
>>>>>>       preempt_disable(); <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<--This is
>>>>>> memory barrier operation too.
>>>>>>       spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
>>>>>>       LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
>>>>>> }
>>>>>>
>>>>>> 2.  A function  __lock_acquire called by spinlock. The function is so
>>>>>> long
>>>>>> that I would not attach all of it here.
>>>>>>
>>>>>> There is atomic operation inside and 12 meta data updates and 14 if
>>>>>> statements and it calls quite some other functions.
>>>>>>
>>>>>> Note that it disable IRQ...
>>>>>>
>>>>>> static int __lock_acquire(struct lockdep_map *lock, unsigned int
>>>>>> subclass,
>>>>>>                 int trylock, int read, int check, int hardirqs_off,
>>>>>>                 struct lockdep_map *nest_lock, unsigned long ip,
>>>>>>                 int references, int pin_count)
>>>>>> {
>>>>>>       struct task_struct *curr = current;
>>>>>>       struct lock_class *class = NULL;
>>>>>>       struct held_lock *hlock;
>>>>>>       unsigned int depth;
>>>>>>       int chain_head = 0;
>>>>>>       int class_idx;
>>>>>>       u64 chain_key;
>>>>>>
>>>>>>       if (unlikely(!debug_locks))
>>>>>>           return 0;
>>>>>>
>>>>>>       /*
>>>>>>        * Lockdep should run with IRQs disabled, otherwise we could
>>>>>>        * get an interrupt which would want to take locks, which would
>>>>>>        * end up in lockdep and have you got a head-ache already?
>>>>>>        */
>>>>>>       if (DEBUG_LOCKS_WARN_ON(!irqs_disabled()))
>>>>>> <<<<<<<<<<<<<<<Disable
>>>>>> IRQ
>>>>>>           return 0;
>>>>>>
>>>>>> ....
>>>>>>
>>>>>> 3. Another function called by spinlock in a higher level:
>>>>>>
>>>>>> void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
>>>>>>
>>>>>>                 int trylock, int read, int check,
>>>>>>                 struct lockdep_map *nest_lock, unsigned long ip)
>>>>>> {
>>>>>>       unsigned long flags;
>>>>>>
>>>>>>       if (unlikely(current->lockdep_recursion))
>>>>>>           return;
>>>>>>
>>>>>>       raw_local_irq_save(flags);
>>>>>>       check_flags(flags);
>>>>>>
>>>>>>       current->lockdep_recursion = 1;
>>>>>>       trace_lock_acquire(lock, subclass, trylock, read, check,
>>>>>> nest_lock,
>>>>>> ip);
>>>>>>       __lock_acquire(lock, subclass, trylock, read, check,
>>>>>>                  irqs_disabled_flags(flags), nest_lock, ip, 0, 0);
>>>>>>       current->lockdep_recursion = 0;
>>>>>>       raw_local_irq_restore(flags);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Alex Bin
>>>>>>
>>>>>>
>>>>>> On 2017-06-22 12:27 PM, Marek Olšák wrote:
>>>>>>>
>>>>>>> On Thu, Jun 22, 2017 at 5:33 PM, Xie, AlexBin <AlexBin.Xie at amd.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Christian,
>>>>>>>>
>>>>>>>>
>>>>>>>> In fact, the change from spinlock to atomic is quite painful. When I
>>>>>>>> started, I thought it was easy but later I found there might be race
>>>>>>>> condition here and there. Now I think the change looks more robust.
>>>>>>>> In
>>>>>>>> kernel source, there are several other drivers used the same trick.
>>>>>>>>
>>>>>>>>
>>>>>>>> On the other hand, I think the logic itself might be optimized
>>>>>>>> considering
>>>>>>>> the locking. But I had spent quite some effort to maintain original
>>>>>>>> logic.
>>>>>>>
>>>>>>> It seems quite complicated and I don't know if there is any
>>>>>>> performance benefit. Spinlocks are nice because they allow
>>>>>>> preemption.
>>>>>>>
>>>>>>> It would be more interesting to merge the CS and BO_LIST ioctls into
>>>>>>> one.
>>>>>>>
>>>>>>> Marek
>>>>>>
>>>>>>
>