Slow memory access when using OpenCL without X11
Lauri Ehrenpreis
laurioma at gmail.com
Thu Mar 14 16:41:14 UTC 2019
Yes it affects this a bit but it doesn't get the speed up to "normal"
level. I got best results with "profile_peak" - then the memcpy speed on
CPU is 1/3 of what it is without opencl initialization:
echo "profile_peak" >
/sys/class/drm/card0/device/power_dpm_force_performance_level
./cl_slow_test 1 5
got 1 platforms 1 devices
speed 3710.360352 avg 3710.360352 mbytes/s
speed 3713.660400 avg 3712.010254 mbytes/s
speed 3797.630859 avg 3740.550537 mbytes/s
speed 3708.004883 avg 3732.414062 mbytes/s
speed 3796.403076 avg 3745.211914 mbytes/s
Without calling clCreateContext:
./cl_slow_test 0 5
speed 7299.201660 avg 7299.201660 mbytes/s
speed 9298.841797 avg 8299.021484 mbytes/s
speed 9360.181641 avg 8652.742188 mbytes/s
speed 9004.759766 avg 8740.746094 mbytes/s
speed 9414.607422 avg 8875.518555 mbytes/s
--
Lauri
On Thu, Mar 14, 2019 at 5:46 PM Ernst Sjöstrand <ernstp at gmail.com> wrote:
> Does
> echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level
> or setting cpu scaling governor to performance affect it at all?
>
> Regards
> //Ernst
>
> Den tors 14 mars 2019 kl 14:31 skrev Lauri Ehrenpreis <laurioma at gmail.com
> >:
> >
> > I tried also with those 2 boards now:
> > https://www.asrock.com/MB/AMD/Fatal1ty%20B450%20Gaming-ITXac/index.asp
> > https://www.msi.com/Motherboard/B450I-GAMING-PLUS-AC
> >
> > Both are using latest BIOS, ubuntu 18.10, kernel
> https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.0.2/
> >
> > There are some differences in dmesg (asrock has some amdgpu assert in
> dmesg) but otherwise results are exactly the same.
> > In desktop env cl_slow_test works fast, over ssh terminal it doesn't. If
> i move mouse then it starts working fast in terminal as well.
> >
> > So one can't use OpenCL without monitor and desktop env running and this
> happens with 2 different chipsets (b350 & b450), latest bios from 3
> different vendors, latest kernel and latest rocm. This doesn't look like
> edge case with unusual setup to me..
> >
> > Attached dmesg, dmidecode, and clinfo from both boards.
> >
> > --
> > Lauri
> >
> > On Wed, Mar 13, 2019 at 10:15 PM Lauri Ehrenpreis <laurioma at gmail.com>
> wrote:
> >>
> >> For reproduction only the tiny cl_slow_test.cpp is needed which is
> attached to first e-mail.
> >>
> >> System information is following:
> >> CPU: Ryzen5 2400G
> >> Main board: Gigabyte AMD B450 AORUS mini itx:
> https://www.gigabyte.com/Motherboard/B450-I-AORUS-PRO-WIFI-rev-10#kf
> >> BIOS: F5 8.47 MB 2019/01/25 (latest)
> >> Kernel: https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.0/ (amd64)
> >> OS: Ubuntu 18.04 LTS
> >> rocm-opencl-dev installation:
> >> wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo
> apt-key add -
> >> echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial
> main' | sudo tee /etc/apt/sources.list.d/rocm.list
> >> sudo apt install rocm-opencl-dev
> >>
> >> Also exactly the same issue happens with this board:
> https://www.gigabyte.com/Motherboard/GA-AB350-Gaming-3-rev-1x#kf
> >>
> >> I have MSI and Asrock mini itx boards ready as well, So far didn't get
> amdgpu & opencl working there but I'll try again tomorrow..
> >>
> >> --
> >> Lauri
> >>
> >>
> >> On Wed, Mar 13, 2019 at 8:51 PM Kuehling, Felix <Felix.Kuehling at amd.com>
> wrote:
> >>>
> >>> Hi Lauri,
> >>>
> >>> I still think the SMU is doing something funny, but rocm-smi isn't
> >>> showing enough information to really see what's going on.
> >>>
> >>> On APUs the SMU firmware is embedded in the system BIOS. Unlike
> discrete
> >>> GPUs, the SMU firmware is not loaded by the driver. You could try
> >>> updating your system BIOS to the latest version available from your
> main
> >>> board vendor and see if that makes a difference. It may include a newer
> >>> version of the SMU firmware, potentially with a fix.
> >>>
> >>> If that doesn't help, we'd have to reproduce the problem in house to
> see
> >>> what's happening, which may require the same main board and BIOS
> version
> >>> you're using. We can ask our SMU firmware team if they've ever
> >>> encountered your type of problem. But I don't want to give you too much
> >>> hope. It's a tricky problem involving HW, firmware and multiple driver
> >>> components in a fairly unusual configuration.
> >>>
> >>> Regards,
> >>> Felix
> >>>
> >>> On 2019-03-13 7:28 a.m., Lauri Ehrenpreis wrote:
> >>> > What I observe is that moving the mouse made the memory speed go up
> >>> > and also it made mclk=1200Mhz in rocm-smi output.
> >>> > However if I force mclk to 1200Mhz myself then memory speed is still
> >>> > slow.
> >>> >
> >>> > So rocm-smi output when memory speed went fast due to mouse movement:
> >>> > rocm-smi
> >>> > ======================== ROCm System Management Interface
> >>> > ========================
> >>> >
> ================================================================================================
> >>> > GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf
> >>> > PwrCap SCLK OD MCLK OD GPU%
> >>> > GPU[0] : WARNING: Empty SysFS value: pclk
> >>> > GPU[0] : WARNING: Unable to read
> >>> > /sys/class/drm/card0/device/gpu_busy_percent
> >>> > 0 44.0c N/A 400Mhz 1200Mhz N/A 0% manual N/A
> >>> > 0% 0% N/A
> >>> >
> ================================================================================================
> >>> > ======================== End of ROCm SMI Log
> >>> > ========================
> >>> >
> >>> > And rocm-smi output when I forced memclk=1200MHz myself:
> >>> > rocm-smi --setmclk 2
> >>> > rocm-smi
> >>> > ======================== ROCm System Management Interface
> >>> > ========================
> >>> >
> ================================================================================================
> >>> > GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf
> >>> > PwrCap SCLK OD MCLK OD GPU%
> >>> > GPU[0] : WARNING: Empty SysFS value: pclk
> >>> > GPU[0] : WARNING: Unable to read
> >>> > /sys/class/drm/card0/device/gpu_busy_percent
> >>> > 0 39.0c N/A 400Mhz 1200Mhz N/A 0% manual N/A
> >>> > 0% 0% N/A
> >>> >
> ================================================================================================
> >>> > ======================== End of ROCm SMI Log
> >>> > ========================
> >>> >
> >>> > So only difference is that temperature shows 44c when memory speed
> was
> >>> > fast and 39c when it was slow. But mclk was 1200MHz and sclk was
> >>> > 400MHz in both cases.
> >>> > Can it be that rocm-smi just has a bug in reporting and mclk was not
> >>> > actually 1200MHz when I forced it with rocm-smi --setmclk 2 ?
> >>> > That would explain the different behaviour..
> >>> >
> >>> > If so then is there a programmatic way how to really guarantee the
> >>> > high speed mclk? Basically I want do something similar in my program
> >>> > what happens if I move
> >>> > the mouse in desktop env and this way guarantee the normal memory
> >>> > speed each time the program starts.
> >>> >
> >>> > --
> >>> > Lauri
> >>> >
> >>> >
> >>> > On Tue, Mar 12, 2019 at 11:36 PM Deucher, Alexander
> >>> > <Alexander.Deucher at amd.com <mailto:Alexander.Deucher at amd.com>>
> wrote:
> >>> >
> >>> > Forcing the sclk and mclk high may impact the CPU frequency since
> >>> > they share TDP.
> >>> >
> >>> > Alex
> >>> >
> ------------------------------------------------------------------------
> >>> > *From:* amd-gfx <amd-gfx-bounces at lists.freedesktop.org
> >>> > <mailto:amd-gfx-bounces at lists.freedesktop.org>> on behalf of
> Lauri
> >>> > Ehrenpreis <laurioma at gmail.com <mailto:laurioma at gmail.com>>
> >>> > *Sent:* Tuesday, March 12, 2019 5:31 PM
> >>> > *To:* Kuehling, Felix
> >>> > *Cc:* Tom St Denis; amd-gfx at lists.freedesktop.org
> >>> > <mailto:amd-gfx at lists.freedesktop.org>
> >>> > *Subject:* Re: Slow memory access when using OpenCL without X11
> >>> > However it's not only related to mclk and sclk. I tried this:
> >>> > rocm-smi --setsclk 2
> >>> > rocm-smi --setmclk 3
> >>> > rocm-smi
> >>> > ======================== ROCm System Management Interface
> >>> > ========================
> >>> >
> ================================================================================================
> >>> > GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf
> >>> > PwrCap SCLK OD MCLK OD GPU%
> >>> > GPU[0] : WARNING: Empty SysFS value: pclk
> >>> > GPU[0] : WARNING: Unable to read
> >>> > /sys/class/drm/card0/device/gpu_busy_percent
> >>> > 0 34.0c N/A 1240Mhz 1333Mhz N/A 0%
> >>> > manual N/A 0% 0% N/A
> >>> >
> ================================================================================================
> >>> > ======================== End of ROCm SMI Log
> >>> > ========================
> >>> >
> >>> > ./cl_slow_test 1
> >>> > got 1 platforms 1 devices
> >>> > speed 3919.777100 avg 3919.777100 mbytes/s
> >>> > speed 3809.373291 avg 3864.575195 mbytes/s
> >>> > speed 585.796814 avg 2771.649170 mbytes/s
> >>> > speed 188.721848 avg 2125.917236 mbytes/s
> >>> > speed 188.916367 avg 1738.517090 mbytes/s
> >>> >
> >>> > So despite forcing max sclk and mclk the memory speed is still
> slow..
> >>> >
> >>> > --
> >>> > Lauri
> >>> >
> >>> >
> >>> > On Tue, Mar 12, 2019 at 11:21 PM Lauri Ehrenpreis
> >>> > <laurioma at gmail.com <mailto:laurioma at gmail.com>> wrote:
> >>> >
> >>> > IN the case when memory is slow, the rocm-smi outputs this:
> >>> > ======================== ROCm System Management
> >>> > Interface ========================
> >>> >
> ================================================================================================
> >>> > GPU Temp AvgPwr SCLK MCLK PCLK Fan
> >>> > Perf PwrCap SCLK OD MCLK OD GPU%
> >>> > GPU[0] : WARNING: Empty SysFS value: pclk
> >>> > GPU[0] : WARNING: Unable to read
> >>> > /sys/class/drm/card0/device/gpu_busy_percent
> >>> > 0 30.0c N/A 400Mhz 933Mhz N/A 0%
> >>> > auto N/A 0% 0% N/A
> >>> >
> ================================================================================================
> >>> > ======================== End of ROCm SMI Log
> >>> > ========================
> >>> >
> >>> > normal memory speed case gives following:
> >>> > ======================== ROCm System Management
> >>> > Interface ========================
> >>> >
> ================================================================================================
> >>> > GPU Temp AvgPwr SCLK MCLK PCLK Fan
> >>> > Perf PwrCap SCLK OD MCLK OD GPU%
> >>> > GPU[0] : WARNING: Empty SysFS value: pclk
> >>> > GPU[0] : WARNING: Unable to read
> >>> > /sys/class/drm/card0/device/gpu_busy_percent
> >>> > 0 35.0c N/A 400Mhz 1200Mhz N/A 0%
> >>> > auto N/A 0% 0% N/A
> >>> >
> ================================================================================================
> >>> > ======================== End of ROCm SMI Log
> >>> > ========================
> >>> >
> >>> > So there is a difference in MCLK - can this cause such a huge
> >>> > slowdown?
> >>> >
> >>> > --
> >>> > Lauri
> >>> >
> >>> > On Tue, Mar 12, 2019 at 6:39 PM Kuehling, Felix
> >>> > <Felix.Kuehling at amd.com <mailto:Felix.Kuehling at amd.com>>
> wrote:
> >>> >
> >>> > [adding the list back]
> >>> >
> >>> > I'd suspect a problem related to memory clock. This is an
> >>> > APU where
> >>> > system memory is shared with the CPU, so if the SMU
> >>> > changes memory
> >>> > clocks that would affect CPU memory access performance.
> If
> >>> > the problem
> >>> > only occurs when OpenCL is running, then the compute
> power
> >>> > profile could
> >>> > have an effect here.
> >>> >
> >>> > Laurie, can you monitor the clocks during your tests
> using
> >>> > rocm-smi?
> >>> >
> >>> > Regards,
> >>> > Felix
> >>> >
> >>> > On 2019-03-11 1:15 p.m., Tom St Denis wrote:
> >>> > > Hi Lauri,
> >>> > >
> >>> > > I don't have ROCm installed locally (not on that team
> at
> >>> > AMD) but I
> >>> > > can rope in some of the KFD folk and see what they say
> :-).
> >>> > >
> >>> > > (in the mean time I should look into installing the
> ROCm
> >>> > stack on my
> >>> > > Ubuntu disk for experimentation...).
> >>> > >
> >>> > > Only other thing that comes to mind is some sort of
> >>> > stutter due to
> >>> > > power/clock gating (or gfx off/etc). But that
> typically
> >>> > affects the
> >>> > > display/gpu side not the CPU side.
> >>> > >
> >>> > > Felix: Any known issues with Raven and ROCm
> interacting
> >>> > over memory
> >>> > > bus performance?
> >>> > >
> >>> > > Tom
> >>> > >
> >>> > > On Mon, Mar 11, 2019 at 12:56 PM Lauri Ehrenpreis
> >>> > <laurioma at gmail.com <mailto:laurioma at gmail.com>
> >>> > > <mailto:laurioma at gmail.com <mailto:laurioma at gmail.com
> >>>
> >>> > wrote:
> >>> > >
> >>> > > Hi!
> >>> > >
> >>> > > The 100x memory slowdown is hard to belive indeed.
> I
> >>> > attached the
> >>> > > test program with my first e-mail which depends
> only on
> >>> > > rocm-opencl-dev package. Would you mind compiling
> it
> >>> > and checking
> >>> > > if it slows down memory for you as well?
> >>> > >
> >>> > > steps:
> >>> > > 1) g++ cl_slow_test.cpp -o cl_slow_test -I
> >>> > > /opt/rocm/opencl/include/ -L
> >>> > /opt/rocm/opencl/lib/x86_64/ -lOpenCL
> >>> > > 2) logout from desktop env and disconnect
> >>> > hdmi/diplayport etc
> >>> > > 3) log in over ssh
> >>> > > 4) run the program ./cl_slow_test 1
> >>> > >
> >>> > > For me it reproduced even without step 2 as well
> but
> >>> > less
> >>> > > reliably. moving mouse for example could make the
> >>> > memory speed
> >>> > > fast again.
> >>> > >
> >>> > > --
> >>> > > Lauri
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Mon, Mar 11, 2019 at 6:33 PM Tom St Denis
> >>> > <tstdenis82 at gmail.com <mailto:tstdenis82 at gmail.com>
> >>> > > <mailto:tstdenis82 at gmail.com
> >>> > <mailto:tstdenis82 at gmail.com>>> wrote:
> >>> > >
> >>> > > Hi Lauri,
> >>> > >
> >>> > > There's really no connection between the two
> >>> > other than they
> >>> > > run in the same package. I too run a 2400G
> (as my
> >>> > > workstation) and I got the same ~6.6GB/sec
> >>> > transfer rate but
> >>> > > without a CL app running ... The only logical
> >>> > reason is your
> >>> > > CL app is bottlenecking the APUs memory bus but
> >>> > you claim
> >>> > > "simply opening a context is enough" so
> >>> > something else is
> >>> > > going on.
> >>> > >
> >>> > > Your last reply though says "with it running
> in the
> >>> > > background" so it's entirely possible the CPU
> >>> > isn't busy but
> >>> > > the package memory controller (shared between
> >>> > both the CPU and
> >>> > > GPU) is busy. For instance running xonotic in
> a
> >>> > 1080p window
> >>> > > on my 4K display reduced the memory test to
> >>> > 5.8GB/sec and
> >>> > > that's hardly a heavy memory bound GPU app.
> >>> > >
> >>> > > The only other possible connection is the GPU
> is
> >>> > generating so
> >>> > > much heat that it's throttling the package
> which
> >>> > is also
> >>> > > unlikely if you have a proper HSF attached (I
> >>> > use the ones
> >>> > > that came in the retail boxes).
> >>> > >
> >>> > > Cheers,
> >>> > > Tom
> >>> > >
> >>> >
> >
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx at lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20190314/2eeaea4a/attachment-0001.html>
More information about the amd-gfx
mailing list