[PATCH 1/3] drm/amdgpu: fix a typo

Fri Jun 23 06:57:03 UTC 2017

Hi Alex,

actually Marek is right, command submission is actually not much of a 
bottleneck to us because it is handled from a separate userspace thread.

So those micro optimizations you do here on CPU cycles are actually 
rather superfluous.

But giving the CS IOCTL an option for directly specifying the BOs 
instead of a BO list like Marek suggested would indeed save us some time 
here.

Regards,
Christian.

Am 23.06.2017 um 04:23 schrieb axie:
> Hi Marek,
>
>
> So do you agree that spinlock disables CPU preemption, contrary to 
> your original idea?
>
>
> If you have new reason that this patch does not improve, please speak 
> out.
>
>
> Many patches in GPU driver aim at improving performance and power 
> efficiency. Does most patches submitted in AMDGPU requires a 
> benchmarking first?
>
> If all developers are required to always answer your questions when 
> code review, I am afraid that most open source community developers 
> cannot meet that requirement and stop working on AMDGPU.
>
>
> To improve performance, there are many bottlenecks to clear. When the 
> last several bottlenecks are clear, the performance will show faster 
> more significantly.
>
> My pass profiling experience told me that clearing a lock can improve 
> performance for some driver like 0.3% to much bigger percentage. It 
> depends on many factors, even depends on the application itself.
>
>
> This is not the first bottleneck fixed. This is surely not the last one.
>
>
> Thanks,
>
> Alex Bin
>
>
> On 2017-06-22 07:54 PM, Marek Olšák wrote:
>> That's all nice, but does it improve performance? Have you been able
>> to measure some performance difference with that code? Were you
>> targeting a specific inefficiency you had seen e.g. with a CPU
>> profiler?
>>
>> Marek
>>
>> On Thu, Jun 22, 2017 at 8:19 PM, axie <axie at amd.com> wrote:
>>> To clarify, local IRQ is disabled by calling raw_local_irq_save(flags);
>>>
>>> Function __lock_acquire double checks that the local IRQ is really 
>>> disabled.
>>>
>>>
>>>
>>> On 2017-06-22 01:34 PM, axie wrote:
>>>> Hi Marek,
>>>>
>>>> Spin lock and spin unlock is fast. But it is not so fast compared with
>>>> atomic, which is a single CPU instruction in x86.
>>>>
>>>>
>>>> 1. spinlock does NOT allow preemption at local CPU. Let us have a 
>>>> look at
>>>> how spin lock was implemented.
>>>>
>>>> static inline void __raw_spin_lock(raw_spinlock_t *lock)
>>>> {
>>>>      preempt_disable(); <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<--This is
>>>> memory barrier operation too.
>>>>      spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
>>>>      LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
>>>> }
>>>>
>>>> 2.  A function  __lock_acquire called by spinlock. The function is 
>>>> so long
>>>> that I would not attach all of it here.
>>>>
>>>> There is atomic operation inside and 12 meta data updates and 14 if
>>>> statements and it calls quite some other functions.
>>>>
>>>> Note that it disable IRQ...
>>>>
>>>> static int __lock_acquire(struct lockdep_map *lock, unsigned int 
>>>> subclass,
>>>>                int trylock, int read, int check, int hardirqs_off,
>>>>                struct lockdep_map *nest_lock, unsigned long ip,
>>>>                int references, int pin_count)
>>>> {
>>>>      struct task_struct *curr = current;
>>>>      struct lock_class *class = NULL;
>>>>      struct held_lock *hlock;
>>>>      unsigned int depth;
>>>>      int chain_head = 0;
>>>>      int class_idx;
>>>>      u64 chain_key;
>>>>
>>>>      if (unlikely(!debug_locks))
>>>>          return 0;
>>>>
>>>>      /*
>>>>       * Lockdep should run with IRQs disabled, otherwise we could
>>>>       * get an interrupt which would want to take locks, which would
>>>>       * end up in lockdep and have you got a head-ache already?
>>>>       */
>>>>      if (DEBUG_LOCKS_WARN_ON(!irqs_disabled())) 
>>>> <<<<<<<<<<<<<<<Disable IRQ
>>>>          return 0;
>>>>
>>>> ....
>>>>
>>>> 3. Another function called by spinlock in a higher level:
>>>>
>>>> void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
>>>>
>>>>                int trylock, int read, int check,
>>>>                struct lockdep_map *nest_lock, unsigned long ip)
>>>> {
>>>>      unsigned long flags;
>>>>
>>>>      if (unlikely(current->lockdep_recursion))
>>>>          return;
>>>>
>>>>      raw_local_irq_save(flags);
>>>>      check_flags(flags);
>>>>
>>>>      current->lockdep_recursion = 1;
>>>>      trace_lock_acquire(lock, subclass, trylock, read, check, 
>>>> nest_lock,
>>>> ip);
>>>>      __lock_acquire(lock, subclass, trylock, read, check,
>>>>                 irqs_disabled_flags(flags), nest_lock, ip, 0, 0);
>>>>      current->lockdep_recursion = 0;
>>>>      raw_local_irq_restore(flags);
>>>> }
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Alex Bin
>>>>
>>>>
>>>> On 2017-06-22 12:27 PM, Marek Olšák wrote:
>>>>> On Thu, Jun 22, 2017 at 5:33 PM, Xie, AlexBin <AlexBin.Xie at amd.com>
>>>>> wrote:
>>>>>> Hi Christian,
>>>>>>
>>>>>>
>>>>>> In fact, the change from spinlock to atomic is quite painful. When I
>>>>>> started, I thought it was easy but later I found there might be race
>>>>>> condition here and there. Now I think the change looks more 
>>>>>> robust. In
>>>>>> kernel source, there are several other drivers used the same trick.
>>>>>>
>>>>>>
>>>>>> On the other hand, I think the logic itself might be optimized
>>>>>> considering
>>>>>> the locking. But I had spent quite some effort to maintain original
>>>>>> logic.
>>>>> It seems quite complicated and I don't know if there is any
>>>>> performance benefit. Spinlocks are nice because they allow 
>>>>> preemption.
>>>>>
>>>>> It would be more interesting to merge the CS and BO_LIST ioctls 
>>>>> into one.
>>>>>
>>>>> Marek
>>>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx