[Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

Thu Jan 12 18:21:19 UTC 2023

On 11/01/2023 17:52, Matthew Brost wrote:
> On Wed, Jan 11, 2023 at 09:09:45AM +0000, Tvrtko Ursulin wrote:

[snip]

>> Anyway, since you are not buying any arguments on paper perhaps you are more
>> open towards testing. If you would adapt gem_wsim for Xe you would be able
>> to spawn N simulated transcode sessions on any Gen11+ machine and try it
>> out.
>>
>> For example:
>>
>> gem_wsim -w benchmarks/wsim/media_load_balance_fhd26u7.wsim -c 36 -r 600
>>
>> That will run you 36 parallel transcoding sessions streams for 600 frames
>> each. No client setup needed whatsoever apart from compiling IGT.
>>
>> In the past that was quite a handy tool to identify scheduling issues, or
>> validate changes against. All workloads with the media prefix have actually
>> been hand crafted by looking at what real media pipelines do with real data.
>> Few years back at least.
>>
> 
> Porting this is non-trivial as this is 2.5k. Also in Xe we are trending
> to use UMD benchmarks to determine if there are performance problems as
> in the i915 we had tons microbenchmarks / IGT benchmarks that we found
> meant absolutely nothing. Can't say if this benchmark falls into that
> category.

I explained what it does so it was supposed to be obvious it is not a 
micro benchmark.

2.5k what, lines of code? Difficulty of adding Xe support does not scale 
with LOC but with how much it uses the kernel API. You'd essentially 
need to handle context/engine creation and different execbuf.

It's not trivial no, but it would save you downloading gigabytes of test 
streams, building a bunch of tools and libraries etc, and so overall in 
my experience it *significantly* improves the driver development 
turn-around time.

> We VK and compute benchmarks running and haven't found any major issues
> yet. The media UMD hasn't been ported because of the VM bind dependency
> so I can't say if there are any issues with the media UMD + Xe.
> 
> What I can do hack up xe_exec_threads to really hammer Xe - change it to
> 128x xe_engines + 8k execs per thread. Each exec is super simple, it
> just stores a dword. It creates a thread per hardware engine, so on TGL
> this is 5x threads.
> 
> Results below:
> root at DUT025-TGLU:mbrost# xe_exec_threads --r threads-basic
> IGT-Version: 1.26-ge26de4b2 (x86_64) (Linux: 6.1.0-rc1-xe+ x86_64)
> Starting subtest: threads-basic
> Subtest threads-basic: SUCCESS (1.215s)
> root at DUT025-TGLU:mbrost# dumptrace | grep job | wc
>    40960  491520 7401728
> root at DUT025-TGLU:mbrost# dumptrace | grep engine | wc
>      645    7095   82457
> 
> So with 640 xe_engines (5x are VM engines) it takes 1.215 seconds test
> time to run 40960 execs. That seems to indicate we do not have a
> scheduling problem.
> 
> This is 8 core (or at least 8 threads) TGL:
> 
> root at DUT025-TGLU:mbrost# cat /proc/cpuinfo
> ...
> processor       : 7
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 140
> model name      : 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
> stepping        : 1
> microcode       : 0x3a
> cpu MHz         : 2344.098
> cache size      : 12288 KB
> physical id     : 0
> siblings        : 8
> core id         : 3
> cpu cores       : 4
> ...
> 
> Enough data to be convinced there is not issue with this design? I can
> also hack up Xe to use less GPU schedulers /w a kthreads but again that
> isn't trivial and doesn't seem necessary based on these results.

Not yet. It's not only about how many somethings per second you can do. 
It is also about what effect to the rest of the system it creates.

Anyway I think you said in different sub-thread you will move away from 
system_wq, so we can close this one. With that plan at least I don't 
have to worry my mouse will stutter and audio glitch while Xe is 
churning away.

Regards,

Tvrtko