[Bug 16148] page allocation failure. order:1, mode:0x50d0

Wed Jun 16 06:54:03 PDT 2010

Hi Dave,

On Wed, Jun 16, 2010 at 08:57:09AM +1000, Dave Airlie wrote:
> > (switching back to email, actually)
> > 
> > On Sun, 13 Jun 2010 13:01:57 GMT
> > bugzilla-daemon at bugzilla.kernel.org wrote:
> > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=16148
> > >
> > >
> > > Mikko C. <mikko.cal at gmail.com> changed:
> > >
> > >            What    |Removed                     |Added
> > > ----------------------------------------------------------------------------
> > >                  CC|                            |mikko.cal at gmail.com
> > >
> > >
> > >
> > >
> > > --- Comment #8 from Mikko C. <mikko.cal at gmail.com>  2010-06-13 13:01:53 ---
> > > I have been getting this with 2.6.35-rc2 and rc3.
> > > Could it be the same problem?
> > >
> > >
> > > X: page allocation failure. order:0, mode:0x4
> > > Pid: 1514, comm: X Not tainted 2.6.35-rc3 #1
> > > Call Trace:
> > >  [<ffffffff8108ce49>] ? __alloc_pages_nodemask+0x629/0x680
> > >  [<ffffffff8108c920>] ? __alloc_pages_nodemask+0x100/0x680
> > >  [<ffffffffa00db8f3>] ? ttm_get_pages+0x2c3/0x448 [ttm]
> > >  [<ffffffffa00d4658>] ? __ttm_tt_get_page+0x98/0xc0 [ttm]
> > >  [<ffffffffa00d4988>] ? ttm_tt_populate+0x48/0x90 [ttm]
> > >  [<ffffffffa00d4a26>] ? ttm_tt_bind+0x56/0xa0 [ttm]
> > >  [<ffffffffa00d5230>] ? ttm_bo_handle_move_mem+0x1d0/0x430 [ttm]
> > >  [<ffffffffa00d76d6>] ? ttm_bo_move_buffer+0x166/0x180 [ttm]
> > >  [<ffffffffa00b9736>] ? drm_mm_kmalloc+0x26/0xc0 [drm]
> > >  [<ffffffff81030ea9>] ? get_parent_ip+0x9/0x20
> > >  [<ffffffffa00d7786>] ? ttm_bo_validate+0x96/0x130 [ttm]
> > >  [<ffffffffa00d7b35>] ? ttm_bo_init+0x315/0x390 [ttm]
> > >  [<ffffffffa0122eb8>] ? radeon_bo_create+0x118/0x210 [radeon]
> > >  [<ffffffffa0122fb0>] ? radeon_ttm_bo_destroy+0x0/0xb0 [radeon]
> > >  [<ffffffffa013704c>] ? radeon_gem_object_create+0x8c/0x110 [radeon]
> > >  [<ffffffffa013711f>] ? radeon_gem_create_ioctl+0x4f/0xe0 [radeon]
> > >  [<ffffffffa00b10e6>] ? drm_ioctl+0x3d6/0x470 [drm]
> > >  [<ffffffffa01370d0>] ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
> > >  [<ffffffff810b965f>] ? do_sync_read+0xbf/0x100
> > >  [<ffffffff810c8965>] ? vfs_ioctl+0x35/0xd0
> > >  [<ffffffff810c8b28>] ? do_vfs_ioctl+0x88/0x530
> > >  [<ffffffff81031ed7>] ? sub_preempt_count+0x87/0xb0
> > >  [<ffffffff810c9019>] ? sys_ioctl+0x49/0x80
> > >  [<ffffffff810ba4fe>] ? sys_read+0x4e/0x90
> > >  [<ffffffff810024ab>] ? system_call_fastpath+0x16/0x1b
> > 
> > That's different.  ttm_get_pages() looks pretty busted to me.  It's not
> > using __GFP_WAIT and it's not using __GFP_FS.  It's using a plain
> > GFP_DMA32 so it's using atomic allocations even though it doesn't need
> > to.  IOW, it's shooting itself in the head.
> > 

This is an accurate assessment.

> > Given that it will sometimes use GFP_HIGHUSER which includes __GFP_FS
> > and __GFP_WAIT, I assume it can always include __GFP_FS and __GFP_WAIT.
> > If so, it should very much do so.  If not then the function is
> > misdesigned and should be altered to take a gfp_t argument so the
> > caller can tell ttm_get_pages() which is the strongest allocation mode
> > which it may use.
> > 
> > > [TTM] Unable to allocate page.
> > > radeon 0000:01:05.0: object_init failed for (7827456, 0x00000002)
> > > [drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object (7827456,
> > > 2, 4096, -12)
> > 
> > This bug actually broke stuff for you.
> > 
> > Something like this:
> > 
> > --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c~a
> > +++ a/drivers/gpu/drm/ttm/ttm_page_alloc.c
> > @@ -677,7 +677,7 @@ int ttm_get_pages(struct list_head *page
> >  	/* No pool for cached pages */
> >  	if (pool == NULL) {
> >  		if (flags & TTM_PAGE_FLAG_DMA32)
> > -			gfp_flags |= GFP_DMA32;
> > +			gfp_flags |= GFP_KERNEL|GFP_DMA32;
> >  		else
> >  			gfp_flags |= GFP_HIGHUSER;
> >  
> > _
> > 
> > although I wonder whether it should be using pool->gfp_flags.
> > 
> > 
> > It's a shame that this code was developed and merged in secret :( Had
> > we known, we could have looked at enhancing mempools to cover the
> > requirement, or at implementing this in some generic fashion rather
> > than hiding it down in drivers/gpu/drm.
> 
> Its been post to lkml at least once or twice over the past few years
> though not as much as it was posted to dri-devel, but that was because
> we had never seen anyone show any interest in it outside of kernel
> hackers. Originally I was going to use the generic allocator stuff ia64
> uses for uncached allocations but it allocates memory ranges not pages
> so it wasn't useful. I also suggested getting a page flag for uncached
> allocator stuff, I was told to go write the code in my own corner and
> prove it was required. So I did, and it was cleaned up and worked on by
> others and I merged it. So can we lay off with the "in secret", the
> original code is nearly 2 years old at this point and just because -mm
> hackers choose to ignore it isn't our fault. Patches welcome.
> 

No comment on this aspect of things.

> So now back to the bug:
> 
> So the page pools are setup with gfp flags, in the normal case, 4 pools,
> one WC GFP_HIGHUSER pages, one UC HIGHUSER pages, one WC GFP_USER|
> GFP_DMA32, one UC GFP_USER|GFP_DMA32, so the pools are all fine, the
> problem here is the same as before we added the pools, which is the
> normal page allocation path, which needs the GFP_USER added instead of
> GFP_KERNEL.
> 
> That said I've noticed a lot more page allocation failure reports in
> 2.6.35-rcX than we've gotten for a long time, in code that hasn't
> changed (the AGP ones the other day for example) has something in the
> core MM regressed (again... didn't this happen back in 2.6.31 or
> something).
> 
> (cc'ing Mel who tracked these down before).
> 

Ok, so there has been very little changed in the page allocator that
might have caused this. The only possibility I can think of is
[6dda9d: page allocator: reduce fragmentation in buddy allocator by
adding buddies that are merging to the tail of the free lists] but I
would have expected the patch to help, not hinder, this situation.

In the last few kernel releases when there has been a spike in high-order
allocation, it has been largely due to an increased source of high-order
callers from somewhere. I am 99% sure the last time I encountered this, I used
the following script to debug it even though it's a bit of a hack. It uses
ftrace to track high-order allocators and reports the number of high-order
allocations (either atomic or normal) and the count of unique backtraces
(to point the finger at bad callers)

Unfortunately, I don't have access to a suitable machine to test this with
right now (my laptop isn't making high-order allocations to confirm yet)
but I'm pretty sure it's the right one!

If you do some fixed set of tests with 2.6.34 and 2.6.35-rc2 with this script
running and compare the reports, it might point the finger at a new source
of frequent high-order allocations that might be leading to more failures.
Do something like

./watch-highorder.pl -o highorder-2.6.34-report.txt
PID=$!
Do your tests
# Two sigints in quick succession to make the monitor exit
kill $PID
kill $PID 

==== CUT HERE ====
#!/usr/bin/perl
# This is a tool that analyses the trace output related to page allocation,
# sums up the number of high-order allocations taking place and who the
# callers are
#
# Example usage: trace-pagealloc-highorder.pl -o output-report.txt
# options
# -o, --output	Where to store the report
#
# Copyright (c) IBM Corporation 2009
# Author: Mel Gorman <mel at csn.ul.ie>
use strict;
use Getopt::Long;

# Tracepoint events
use constant MM_PAGE_ALLOC		=> 1;
use constant EVENT_UNKNOWN		=> 7;

use constant HIGH_NORMAL_HIGHORDER_ALLOC	=> 10;
use constant HIGH_ATOMIC_HIGHORDER_ALLOC	=> 11;

my $opt_output;
my %stats;
my %unique_events;
my $last_updated = 0;

$stats{HIGH_NORMAL_HIGHORDER_ALLOC} = 0;
$stats{HIGH_ATOMIC_HIGHORDER_ALLOC} = 0;

# Catch sigint and exit on request
my $sigint_report = 0;
my $sigint_exit = 0;
my $sigint_pending = 0;
my $sigint_received = 0;
sub sigint_handler {
	my $current_time = time;
	if ($current_time - 2 > $sigint_received) {
		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
		$sigint_report = 1;
	} else {
		if (!$sigint_exit) {
			print "Second SIGINT received quickly, exiting\n";
		}
		$sigint_exit++;
	}

	if ($sigint_exit > 3) {
		print "Many SIGINTs received, exiting now without report\n";
		exit;
	}

	$sigint_received = $current_time;
	$sigint_pending = 1;
}
$SIG{INT} = "sigint_handler";

# Parse command line options
GetOptions(
	'output=s'    => \$opt_output,
);

# Defaults for dynamically discovered regex's
my $regex_pagealloc_default = 'page=([0-9a-f]*) pfn=([0-9]*) order=([-0-9]*) migratetype=([-0-9]*) gfp_flags=([A-Z_|]*)';

# Dyanically discovered regex
my $regex_pagealloc;

# Static regex used. Specified like this for readability and for use with /o
#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
my $regex_statname = '[-0-9]*\s\((.*)\).*';
my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';

sub generate_traceevent_regex {
	my $event = shift;
	my $default = shift;
	my $regex;

	# Read the event format or use the default
	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
		$regex = $default;
	} else {
		my $line;
		while (!eof(FORMAT)) {
			$line = <FORMAT>;
			if ($line =~ /^print fmt:\s"(.*)",.*/) {
				$regex = $1;
				$regex =~ s/%p/\([0-9a-f]*\)/g;
				$regex =~ s/%d/\([-0-9]*\)/g;
				$regex =~ s/%lu/\([0-9]*\)/g;
				$regex =~ s/%s/\([A-Z_|]*\)/g;
				$regex =~ s/\(REC->gfp_flags\).*/REC->gfp_flags/;
				$regex =~ s/\",.*//;
			}
		}
	}

	# Verify fields are in the right order
	my $tuple;
	foreach $tuple (split /\s/, $regex) {
		my ($key, $value) = split(/=/, $tuple);
		my $expected = shift;
		if ($key ne $expected) {
			print("WARNING: Format not as expected '$key' != '$expected'");
			$regex =~ s/$key=\((.*)\)/$key=$1/;
		}
	}

	if (defined shift) {
		die("Fewer fields than expected in format");
	}

	return $regex;
}
$regex_pagealloc = generate_traceevent_regex("kmem/mm_page_alloc",
			$regex_pagealloc_default,
			"page", "pfn",
			"order", "migratetype", "gfp_flags");

sub process_events {
	my $traceevent;
	my $process_pid = 0;
	my $cpus;
	my $timestamp;
	my $tracepoint;
	my $details;
	my $statline;
	my $nextline = 1;

	# Read each line of the event log
EVENT_PROCESS:
	while (($traceevent = <TRACING>) && !$sigint_exit) {
SKIP_LINEREAD:

		if ($traceevent eq "") {
			last EVENT_PROCSS;
		}

		if ($traceevent =~ /$regex_traceevent/o) {
			$process_pid = $1;
			$tracepoint = $4;
		} else {
			next;
		}

		# Perl Switch() sucks majorly
		if ($tracepoint eq "mm_page_alloc") {
			my ($page, $order, $gfp_flags, $type);
			my ($atomic);
			my $details = $5;

			if ($details !~ /$regex_pagealloc/o) {
				print "WARNING: Failed to parse mm_page_alloc as expected\n";
				print "$details\n";
				print "$regex_pagealloc\n";
				print "\n";
				next;
			}
			$page = $1;
			$order = $3;
			$gfp_flags = $5;

			# Only concerned with high-order allocs
			if ($order == 0) {
				next;
			}

			$stats{MM_PAGE_ALLOC}++;

			if ($gfp_flags =~ /ATOMIC/) {
				$stats{HIGH_ATOMIC_HIGHORDER_ALLOC}++;
				$type = "atomic";
			} else {
				$stats{HIGH_NORMAL_HIGHORDER_ALLOC}++;
				$type = "normal";
			}

			# Record the stack trace
			$traceevent = <TRACING>;
			if ($traceevent !~ /stack trace/) {
				goto SKIP_LINEREAD;
			}
			my $event = "order=$order $type gfp_flags=$gfp_flags\n";;
			while ($traceevent = <TRACING>) {
				if ($traceevent !~ /^ =>/) {
					$unique_events{$event}++;
					my $current = time;

					if ($current - $last_updated > 60) {
						$last_updated = $current;
						update_report();
					}
					goto SKIP_LINEREAD;
				}
				$event .= $traceevent;
			}
		} else {
			$stats{EVENT_UNKNOWN}++;
		}

		if ($sigint_pending) {
			last EVENT_PROCESS;
		}
	}
}

sub update_report() {
	my $event;
	open (REPORT, ">$opt_output") || die ("Failed to open $opt_output for writing");

	foreach $event (keys %unique_events) {
		print REPORT $unique_events{$event} . " instances $event\n";
	}
	print REPORT "High-order normal allocations: " . $stats{HIGH_NORMAL_HIGHORDER_ALLOC} . "\n";
	print REPORT "High-order atomic allocations: " . $stats{HIGH_ATOMIC_HIGHORDER_ALLOC} . "\n";

	close REPORT;
}

sub print_report() {
	print "\nReport\n";
	open (REPORT, $opt_output) || die ("Failed to open $opt_output for reading");
	while (<REPORT>) {
		print $_;
	}
	close REPORT;
}

# Process events or signals until neither is available
sub signal_loop() {
	my $sigint_processed;
	do {
		$sigint_processed = 0;
		process_events();

		# Handle pending signals if any
		if ($sigint_pending) {
			my $current_time = time;

			if ($sigint_exit) {
				print "Received exit signal\n";
				$sigint_pending = 0;
			}
			if ($sigint_report) {
				if ($current_time >= $sigint_received + 2) {
					update_report();
					print_report();
					$sigint_report = 0;
					$sigint_pending = 0;
					$sigint_processed = 1;
					$sigint_exit = 0;
				}
			}
		}
	} while ($sigint_pending || $sigint_processed);
}

sub set_traceoption($) {
	my $option = shift;

	open(TRACEOPT, ">/sys/kernel/debug/tracing/trace_options") || die("Failed to open trace_options");
	print TRACEOPT $option;
	close TRACEOPT;
}

sub enable_tracing($) {
	my $tracing = shift;

	open(TRACING, ">/sys/kernel/debug/tracing/events/$tracing/enable") || die("Failed to open tracing event $tracing");
	print TRACING "1";
	close TRACING;
}

sub disable_tracing($) {
	my $tracing = shift;

	open(TRACING, ">/sys/kernel/debug/tracing/events/$tracing/enable") || die("Failed to open tracing event $tracing");
	print TRACING "0";
	close TRACING;

}

set_traceoption("stacktrace");
set_traceoption("sym-offset");
set_traceoption("sym-addr");
enable_tracing("kmem/mm_page_alloc");
open(TRACING, "/sys/kernel/debug/tracing/trace_pipe") || die("Failed to open trace_pipe");
signal_loop();
close TRACING;
disable_tracing("kmem/mm_page_alloc");
update_report();
print_report();