[BUG] KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]

Mirsad Goran Todorovac mirsad.todorovac at alu.unizg.hr
Thu Aug 24 17:46:52 UTC 2023


Thank you, Christian.

Glad to hear about that.

However, I guess this assumes that this piece of code between

-----<>-----
      preempt_disable();

      tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
      WRITE_ONCE(*tail, node);
      atomic_inc(&queue->job_count);

      /*
       * In case of first element verify new node will be visible to the consumer
       * thread when we ping the kernel thread that there is new work to do.
       */
      smp_wmb();

      preempt_enable();
-----<>-----

... executes only on one CPU/core/thread?

I understood that preempt_disable() disables only interrupts on one core/CPU:

https://kernelnewbies.kernelnewbies.narkive.com/6LTlgsAe/preempt-disable-disables-preemption-on-all-processors

So, we might have a race in theory between WRITE_ONCE() and atomic_inc().

Kind regards,
Mirsad


On 8/21/2023 8:22 PM, Christian König wrote:
> I'm not sure about that.
> 
> On the one hand it might generate some noise. I know tons of cases where logic is: Ok if we see the updated value immediately it will optimize things, but if not it's unproblematic because there is another check after the next memory barrier.
> 
> On the other hand we probably have cases where this is not correctly implemented. So double checking those would most like be good idea.
> 
> Regards,
> Christian.
> 
> Am 21.08.23 um 16:28 schrieb Mirsad Todorovac:
>> Hi Christian,
>>
>> Thank you for the update.
>>
>> Should I continue reporting what KCSAN gives? I will try to filter these to save your time for
>> evaluation ...
>>
>> Kind regards,
>> Mirsad
>>
>> On 8/21/23 15:20, Christian König wrote:
>>> Hi Mirsad,
>>>
>>> well this is a false positive.
>>>
>>> That drm_sched_entity_is_ready() doesn't see the data written by drm_sched_entity_push_job() is part of the logic here.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 18.08.23 um 15:44 schrieb Mirsad Todorovac:
>>>> On 8/17/23 21:54, Mirsad Todorovac wrote:
>>>>> Hi,
>>>>>
>>>>> This is your friendly bug reporter.
>>>>>
>>>>> The environment is vanilla torvalds tree kernel on Ubuntu 22.04 LTS and a Ryzen 7950X box.
>>>>>
>>>>> Please find attached the complete dmesg output from the ring buffer and lshw output.
>>>>>
>>>>> NOTE: The kernel reports tainted kernel, but to my knowledge there are no proprietary (G) modules,
>>>>>        but this taint is turned on by the previous bugs.
>>>>>
>>>>> dmesg excerpt:
>>>>>
>>>>> [ 8791.864576] ==================================================================
>>>>> [ 8791.864648] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
>>>>>
>>>>> [ 8791.864776] write (marked) to 0xffff9b74491b7c40 of 8 bytes by task 3807 on cpu 18:
>>>>> [ 8791.864788]  drm_sched_entity_push_job+0xf4/0x2a0 [gpu_sched]
>>>>> [ 8791.864852]  amdgpu_cs_ioctl+0x3888/0x3de0 [amdgpu]
>>>>> [ 8791.868731]  drm_ioctl_kernel+0x127/0x210 [drm]
>>>>> [ 8791.869222]  drm_ioctl+0x38f/0x6f0 [drm]
>>>>> [ 8791.869711]  amdgpu_drm_ioctl+0x7e/0xe0 [amdgpu]
>>>>> [ 8791.873660]  __x64_sys_ioctl+0xd2/0x120
>>>>> [ 8791.873676]  do_syscall_64+0x58/0x90
>>>>> [ 8791.873688]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
>>>>>
>>>>> [ 8791.873710] read to 0xffff9b74491b7c40 of 8 bytes by task 1119 on cpu 27:
>>>>> [ 8791.873722]  drm_sched_entity_is_ready+0x16/0x50 [gpu_sched]
>>>>> [ 8791.873786]  drm_sched_select_entity+0x1c7/0x220 [gpu_sched]
>>>>> [ 8791.873849]  drm_sched_main+0xd2/0x500 [gpu_sched]
>>>>> [ 8791.873912]  kthread+0x18b/0x1d0
>>>>> [ 8791.873924]  ret_from_fork+0x43/0x70
>>>>> [ 8791.873939]  ret_from_fork_asm+0x1b/0x30
>>>>>
>>>>> [ 8791.873955] value changed: 0x0000000000000000 -> 0xffff9b750ebcfc00
>>>>>
>>>>> [ 8791.873971] Reported by Kernel Concurrency Sanitizer on:
>>>>> [ 8791.873980] CPU: 27 PID: 1119 Comm: gfx_0.0.0 Tainted: G             L 6.5.0-rc6-net-cfg-kcsan-00038-g16931859a650 #35
>>>>> [ 8791.873994] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>>>>> [ 8791.874002] ==================================================================
>>>>
>>>> P.S.
>>>>
>>>> According to Mr. Heo's instructions, I am adding the unwound trace here:
>>>>
>>>> [ 1879.706518] ==================================================================
>>>> [ 1879.706616] BUG: KCSAN: data-race in drm_sched_entity_is_ready [gpu_sched] / drm_sched_entity_push_job [gpu_sched]
>>>>
>>>> [ 1879.706737] write (marked) to 0xffff8f3672748c40 of 8 bytes by task 4087 on cpu 10:
>>>> [ 1879.706748] drm_sched_entity_push_job (./include/drm/spsc_queue.h:74 drivers/gpu/drm/scheduler/sched_entity.c:574) gpu_sched
>>>> [ 1879.706808] amdgpu_cs_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1375 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1469) amdgpu
>>>> [ 1879.710589] drm_ioctl_kernel (drivers/gpu/drm/drm_ioctl.c:788) drm
>>>> [ 1879.711068] drm_ioctl (drivers/gpu/drm/drm_ioctl.c:892) drm
>>>> [ 1879.711551] amdgpu_drm_ioctl (drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:2748) amdgpu
>>>> [ 1879.715319] __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:870 fs/ioctl.c:856 fs/ioctl.c:856)
>>>> [ 1879.715334] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
>>>> [ 1879.715345] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
>>>>
>>>> [ 1879.715365] read to 0xffff8f3672748c40 of 8 bytes by task 1098 on cpu 11:
>>>> [ 1879.715376] drm_sched_entity_is_ready (drivers/gpu/drm/scheduler/sched_entity.c:134) gpu_sched
>>>> [ 1879.715435] drm_sched_select_entity (drivers/gpu/drm/scheduler/sched_main.c:248 drivers/gpu/drm/scheduler/sched_main.c:893) gpu_sched
>>>> [ 1879.715495] drm_sched_main (drivers/gpu/drm/scheduler/sched_main.c:1019) gpu_sched
>>>> [ 1879.715554] kthread (kernel/kthread.c:389)
>>>> [ 1879.715563] ret_from_fork (arch/x86/kernel/process.c:145)
>>>> [ 1879.715575] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
>>>>
>>>> [ 1879.715590] value changed: 0x0000000000000000 -> 0xffff8f360663dc00
>>>>
>>>> [ 1879.715604] Reported by Kernel Concurrency Sanitizer on:
>>>> [ 1879.715612] CPU: 11 PID: 1098 Comm: gfx_0.0.0 Tainted: G             L     6.5.0-rc6+ #47
>>>> [ 1879.715624] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>>>> [ 1879.715631] ==================================================================
>>>>
>>>> It seems that the line in question might be:
>>>>
>>>>     first = spsc_queue_push(&entity->job_queue, &sched_job->queue_node);
>>>>
>>>> which expands to:
>>>>
>>>> static inline bool spsc_queue_push(struct spsc_queue *queue, struct spsc_node *node)
>>>> {
>>>>     struct spsc_node **tail;
>>>>
>>>>     node->next = NULL;
>>>>
>>>>     preempt_disable();
>>>>
>>>>     tail = (struct spsc_node **)atomic_long_xchg(&queue->tail, (long)&node->next);
>>>>     WRITE_ONCE(*tail, node);
>>>>     atomic_inc(&queue->job_count);
>>>>
>>>>     /*
>>>>      * In case of first element verify new node will be visible to the consumer
>>>>      * thread when we ping the kernel thread that there is new work to do.
>>>>      */
>>>>     smp_wmb();
>>>>
>>>>     preempt_enable();
>>>>
>>>>     return tail == &queue->head;
>>>> }
>>>>
>>>> According to the manual, preempt_disable() only guaranteed exclusion on a single CPU/core/thread, so
>>>> we might be plagued with the slow, old fashioned locking unless anyone had a better idea.
>>>>
>>>> Best regards,
>>>> Mirsad Todorovac

-- 
Mirsad Todorovac
Sistem inženjer
Grafički fakultet | Akademija likovnih umjetnosti
Sveučilište u Zagrebu

System engineer
Faculty of Graphic Arts | Academy of Fine Arts
University of Zagreb, Republic of Croatia
tel. +385 (0)1 3711 451
mob. +385 91 57 88 355


More information about the dri-devel mailing list