slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

Sat Mar 13 04:04:44 UTC 2021

[AMD Public Use]

Hi David,

Which benchmark tool you prefer? Memtest86+ or else?

BRs,
Leo
-----Original Message-----
From: David Hildenbrand <david at redhat.com> 
Sent: Saturday, March 13, 2021 12:47 AM
To: Liang, Liang (Leo) <Liang.Liang at amd.com>; Deucher, Alexander <Alexander.Deucher at amd.com>; linux-kernel at vger.kernel.org; amd-gfx list <amd-gfx at lists.freedesktop.org>; Andrew Morton <akpm at linux-foundation.org>
Cc: Huang, Ray <Ray.Huang at amd.com>; Koenig, Christian <Christian.Koenig at amd.com>; Mike Rapoport <rppt at linux.ibm.com>; Rafael J. Wysocki <rafael at kernel.org>; George Kennedy <george.kennedy at oracle.com>
Subject: Re: slow boot with 7fef431be9c9 ("mm/page_alloc: place pages to tail in __free_pages_core()")

On 12.03.21 17:19, Liang, Liang (Leo) wrote:
> [AMD Public Use]
> 
> Dmesg attached.
> 

So, looks like the "real" slowdown starts once the buddy is up and running (no surprise).

[    0.044035] Memory: 6856724K/7200304K available (14345K kernel code, 9699K rwdata, 5276K rodata, 2628K init, 12104K bss, 343324K reserved, 0K cma-reserved)
[    0.044045] random: get_random_u64 called from __kmem_cache_create+0x33/0x460 with crng_init=1
[    0.049025] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
[    0.050036] ftrace: allocating 47158 entries in 185 pages
[    0.097487] ftrace: allocated 185 pages with 5 groups
[    0.109210] rcu: Hierarchical RCU implementation.

vs.

[    0.041115] Memory: 6869396K/7200304K available (14345K kernel code, 3433K rwdata, 5284K rodata, 2624K init, 6088K bss, 330652K reserved, 0K cma-reserved)
[    0.041127] random: get_random_u64 called from __kmem_cache_create+0x31/0x430 with crng_init=1
[    0.041309] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
[    0.041335] ftrace: allocating 47184 entries in 185 pages
[    0.055719] ftrace: allocated 185 pages with 5 groups
[    0.055863] rcu: Hierarchical RCU implementation.

And it gets especially bad during ACPI table processing:

[    4.158303] ACPI: Added _OSI(Module Device)
[    4.158767] ACPI: Added _OSI(Processor Device)
[    4.159230] ACPI: Added _OSI(3.0 _SCP Extensions)
[    4.159705] ACPI: Added _OSI(Processor Aggregator Device)
[    4.160551] ACPI: Added _OSI(Linux-Dell-Video)
[    4.161359] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    4.162264] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[   17.713421] ACPI: 13 ACPI AML tables successfully acquired and loaded
[   18.716065] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[   20.743828] ACPI: EC: EC started
[   20.744155] ACPI: EC: interrupt blocked
[   20.945956] ACPI: EC: EC_CMD/EC_SC=0x666, EC_DATA=0x662
[   20.946618] ACPI: \_SB_.PCI0.LPC0.EC0_: Boot DSDT EC used to handle transactions
[   20.947348] ACPI: Interpreter enabled
[   20.951278] ACPI: (supports S0 S3 S4 S5)
[   20.951632] ACPI: Using IOAPIC for interrupt routing

vs.

[    0.216039] ACPI: Added _OSI(Module Device)
[    0.216041] ACPI: Added _OSI(Processor Device)
[    0.216043] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.216044] ACPI: Added _OSI(Processor Aggregator Device)
[    0.216046] ACPI: Added _OSI(Linux-Dell-Video)
[    0.216048] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    0.216049] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[    0.228259] ACPI: 13 ACPI AML tables successfully acquired and loaded
[    0.229527] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[    0.231663] ACPI: EC: EC started
[    0.231666] ACPI: EC: interrupt blocked
[    0.233664] ACPI: EC: EC_CMD/EC_SC=0x666, EC_DATA=0x662
[    0.233667] ACPI: \_SB_.PCI0.LPC0.EC0_: Boot DSDT EC used to handle transactions
[    0.233670] ACPI: Interpreter enabled
[    0.233685] ACPI: (supports S0 S3 S4 S5)
[    0.233687] ACPI: Using IOAPIC for interrupt routing

The jump from 4.1 -> 17.7 is especially bad.

Which might in fact indicate that this could be related to using some very special slow (ACPI?) memory for ordinary purposes, interfering with actual ACPI users?

But again, just a wild guess, because the system is extremely slow afterwards, however, we don't have any pauses without any signs of life for that long.

It would be interesting to run a simple memory bandwidth benchmark on the fast kernel with differing sizes up to running OOM to see if there is really some memory that is just horribly slow once allocated and used.

--
Thanks,

David / dhildenb