[systemd-devel] [PATCH] systemd: add RDTCacheReservation= option to support CAT (Cache Allocation Technology)

Marcelo Tosatti mtosatti at redhat.com
Mon Jan 9 11:10:46 UTC 2017


On Fri, Jan 06, 2017 at 07:51:05PM +0100, Lennart Poettering wrote:
> On Fri, 06.01.17 11:59, Marcelo Tosatti (mtosatti at redhat.com) wrote:
> 
> > Cache Allocation Technology is a feature on selected recent Intel Xeon
> > processors which allows control over L3 cache allocation.
> 
> What precisely is the benefit of making this configurable? Can you
> describe a basic usecase why I'd want to set this on some service?

http://www.intel.com/content/www/us/en/communications/cache-allocation-technology-white-paper.html

"On systems with multiple workloads running simultaneously, which
heavily utilize the cache; system administrators can utilize CAT to
improve performance. CAT can be used to better utilize the caches and
as a result some applications are more likely to hit the cache which
results in fewer accesses to system memory. This results in lower
latency and greater performance."

--------

For example, with a job with large memory reaccess distances (where
caching hardly brings any benefit), you can limit that job to reclaim a
small portion of L3 cache (so you don't throw away data from other jobs
from L3 cache).

> 
> > Index: systemd/man/systemd.exec.xml
> > ===================================================================
> > --- systemd.orig/man/systemd.exec.xml
> > +++ systemd/man/systemd.exec.xml
> 
> Hmm, this doesn't look like a git patch? We generally prefer
> submissions as github PRs these days, btw.
> 
> > @@ -233,6 +233,31 @@
> >        </varlistentry>
> >  
> >        <varlistentry>
> > +        <term><varname>RDTCacheReservation=</varname></term>
> > +
> > +        <listitem><para>Creates a L3 CAT reservation in resctrlfs
> > +        for the executed processes. Takes a ";" separated list of
> > +        tuples (optionally triples) containing type,size and
> > +        optionally cacheid:
> 
> if I read this, then I am puzzled what "RDT Cache" is, what "L3 CAT"
> is. Can you explain?

RDT: Resource Director Technology(RDT)
CAT: Cache Allocation Technology (CAT), a subcomponent of RDT 
(which includes other things such as CMT: Cache Monitoring Technology).

> 
> > +            type=type,size=size,cacheid=id;
> > +            type=type,size=size,cacheid=id;...
> > +
> > +        Where:
> > +
> > +            type is one of: both, data, code.
> > +            size is size in kbytes.
> > +            cacheid (optional) is the cacheid for
> > +            the reservation.
> 
> what is a "cacheid"?

Cache IDs
---------
On current generation systems there is one L3 cache per socket and L2
caches are generally just shared by the hyperthreads on a core, but this
isn't an architectural requirement. We could have multiple separate L3
caches on a socket, multiple cores could share an L2 cache. So instead
of using "socket" or "core" to define the set of logical cpus sharing
a resource we use a "Cache ID". At a given cache level this will be a
unique number across the whole system (but it isn't guaranteed to be a
contiguous sequence, there may be gaps).  To find the ID for each
logical
CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id


> 
> > +
> > +        Rules for the parameters:
> > +            * type=code must be paired with type=data entry.
> > +
> > +        See the output of resctrltool for more details.
> > +
> > +        </para></listitem>
> > +      </varlistentry>
> 
> This syntax doesn't appear to be in line with how we'd do this in
> systemd usually. Ideally, we try to hide the fact that our settings
> syntax is one-dimensional as much as we can. Hence, a syntax like this
> sounds more appropriate to me:
> 
>        RDTCacheReservation=4K code
>        RDTCacheReservation=16K data
> 
> and so on...

What you suggest is to essentially to split

RDTCacheReservation=type=type,size=size,cacheid=id;type=type,size=size,cacheid=id

Into

RDTCacheReservation=type=type,size=size,cacheid=id
RDTCacheReservation=type=type,size=size,cacheid=id

(yes sure can get rid of the commas as well, looks better).

> But then again, I have no idea what all of this actually really is,
> hence I am not sure my proposed syntax makes much sense.
> 
> sizes are generally parsed with parse_bytes() in systemd, so that K,
> M, G suffixes are properly recognized...
> 
> > +        ExecContext *c = userdata;
> > +        int r;
> > +        int i;
> > +
> > +        assert(bus);
> > +        assert(reply);
> > +        assert(c);
> > +
> > +        r = sd_bus_message_open_container(reply, 'a', "s");
> > +        if (r < 0)
> > +                return r;
> > +
> > +        for (i = 0; i < c->argmidx; i++) {
> > +            r = sd_bus_message_append(reply, "s", c->argm[i]);
> > +            if (r < 0)
> > +                return r;
> > +        }
> 
> indentation is borked. Please indent to multiples of 8ch.
> 
> > +static char* convert_rdt_back(char *argm[], int argmidx)
> > +{
> 
> Please follow coding style, place opening bracket on same name as
> funciton name.
> 
> > +    	int i, ret;
> > +    	char *buf, *str;
> > +    	int size = 0;
> > +    	int printcomma = 0;
> > +    	int printsemicolon = 0;
> > +    	int localmode = 0;
> 
> Please use C99 "bool" when you actually mean a boolean.
> 
> > +
> > +    	for (i=0; i < argmidx; i++) {
> > +        	/* ',', ';', '=' */
> > +        	size += strlen(argm[i]) + 3;
> > +    	}
> > +
> > +    	buf = malloc(size);
> > +    	if (buf == NULL)
> > +		return NULL;
> > +    	memset(buf, 0, size);
> 
> Use "buf = new0(char, size)" for allocations of this kind.
> 
> > +
> > +    	str = buf;
> > +
> > +    	/* cache-id specified */
> > +    	for (i=0; i < argmidx; i++) {
> > +        	if (strlen(argm[i]) == 10) {
> > +            		if (strcmp(argm[i], "--cache-id") == 0)
> > +                		localmode = 1;
> > +        	}
> > +    	}
> 
> Please use "streq()" instead of strcmp() to compare strings for equality.
> 
> > +
> > +    	for (i=0; i<argmidx; i++) {
> > +        	char *cur = argm[i];
> > +
> > +        	if (strlen(cur) == 0)
> > +                	return buf;
> > +
> > +        	if (strlen(cur) == 6) {
> > +            		if (strcmp(cur, "--type") == 0) {
> > +                		ret = sprintf(str, "type=");
> > +                		str = str + ret;
> > +                		printcomma = 1;
> > +                		continue;
> > +            		}
> > +
> > +            		if (strcmp(cur, "--size") == 0) {
> > +                		ret = sprintf(str, "size=");
> > +                		str = str + ret;
> > +                		if (localmode == 1)
> > +                    			printcomma = 1;
> > +                		else
> > +                    			printsemicolon = 1;
> > +
> > +                		continue;
> > +            		}
> > +        	}
> > +
> > +        	if (strlen(cur) == 10) {
> > +            		if (strcmp(cur, "--cache-id") == 0) {
> > +                		ret = sprintf(str, "cache-id=");
> > +                		str = str + ret;
> > +                		printsemicolon = 1;
> > +                		continue;
> > +            		}
> > +        	}
> > +
> > +	        ret = sprintf(str, "%s", cur);
> > +        	str = str + ret;
> > +
> > +        	if (i != argmidx - 1) {
> > +            		if (printsemicolon) {
> > +                		ret = sprintf(str, ";");
> > +                		str = str + 1;
> > +            		} else if (printcomma) {
> > +                		ret = sprintf(str, ",");
> > +                		str = str + 1;
> > +            		}
> > +
> > +            		printcomma = 0;
> > +            		printsemicolon = 0;
> > +        	}
> > +    	}
> > +
> > +	return buf;
> > +}
> > +
> >  int bus_exec_context_set_transient_property(
> >                  Unit *u,
> >                  ExecContext *c,
> > @@ -1377,6 +1492,46 @@ int bus_exec_context_set_transient_prope
> >                  }
> >  
> >                  return 1;
> > +        } else if (streq(name, "RDTCacheReservation")) {
> > +
> > +                r = sd_bus_message_enter_container(message, 'a', "s");
> > +                if (r < 0)
> > +                        return r;
> > +
> > +                while ((r = sd_bus_message_enter_container(message, 'r', "s")) > 0) {
> > +                        const char *str;
> > +
> > +                        r = sd_bus_message_read(message, "s", &str);
> > +                        if (r < 0)
> > +                                return r;
> > +
> > +                        if (mode != UNIT_CHECK)
> > +                            c->argm[c->argmidx++] = strdup(str);
> 
> Indentation borked. Missing OOM check.
> 
> > +
> > +                        r = sd_bus_message_exit_container(message);
> > +                        if (r < 0)
> > +                                return r;
> > +                }
> > +
> > +		if (mode != UNIT_CHECK) {
> > +			char *conv;
> > +
> > +			conv = convert_rdt_back(c->argm, c->argmidx);
> > +			if (conv == NULL) {
> > +				return sd_bus_error_setf(error,
> > +					SD_BUS_ERROR_NO_MEMORY,
> > +					"out of memory for rdt string
> > convertion");
> 
> you may return -ENOMEM here, the right thing will happen.
> 
> > +			}
> > +
> > +			unit_write_drop_in_private_format(u, mode, name, "RDTCacheReservation=%s", conv);
> > +			free(conv);
> > +		}
> 
> indentation really borked.
> 
> 
> > +static int fork_exec_resctrltool(Unit *u, char *argv[])
> > +{
> 
> coding style... opening bracket on same name as function.
> 
> > +        int outpipe[2];
> > +        int errpipe[2];
> > +        pid_t cpid;
> > +
> > +        pipe(outpipe);
> > +        pipe(errpipe);
> 
> missing error checks.
> 
> > +        cpid = fork();
> 
> Umpff... This is not how we work in systemd. We do not glue things
> together form the various system tools we shell out to. Sorry, but
> this isn't OK to add this way.

Fine, will write a C library.

> If access to the reservation file system is straight-forward, please
> add some glue code for it to src/basic/ or so, so that we can do the
> accesses directly. Or, alternatively, prepare a proper C library that
> we can link against, maintained as part of resctrltool.
> 
> Sorry, but this is a total blocker. Either we do the fs access
> oursvles, or we call into a properly designed C library we link
> against, but shelling out external tools is not an option.
> 
> > +        if(cpid == 0) {
> > +                int r;
> > +
> > +                dup2(errpipe[1], STDERR_FILENO);
> > +                dup2(outpipe[1], STDOUT_FILENO);
> 
> error checking...
> 
> > +
> > +                r = execve(argv[0], argv, NULL);
> > +                if (r == -1) {
> > +                        log_open();
> > +                        log_unit_error_errno(u, r, "Failed to execve resctrltool, ignoring: %m");
> > +                        log_close();
> > +                        return -errno;
> > +                }
> > +        } else {
> > +                char buffer[100];
> 
> arbitrarily sized buffers... are urks.  please use LINE_MAX or so at least....
> 
> > +                fd_set fdset;
> > +                int count;
> > +                int ret;
> > +                int nfds;
> > +                int status = 0;
> > +                int retst;
> > +
> > +                ret = waitpid(cpid, &status, 0);
> > +                if (ret == -1) {
> > +                        log_open();
> > +                        log_unit_error_errno(u, ret, "Failed to waitpid resctrltool, ignoring: %m");
> > +                        log_close();
> > +                        return -errno;
> > +                }
> > +
> > +                if (!WIFEXITED(status)) {
> > +                        log_open();
> > +                        log_unit_error_errno(u, ret, "resctrltool exits abnormally, ignoring: %m");
> > +                        log_close();
> > +                        return -1;
> > +                } else {
> > +                        retst = WEXITSTATUS(status);
> > +
> > +                        if (retst == 0) {
> > +                        	return 0;
> > +                        }
> > +                }
> > +
> > +                /*
> > +                 * error, read stderr and stdout and log.
> > +                 */
> > +
> > +                FD_ZERO(&fdset);
> > +                FD_SET(outpipe[0], &fdset);
> > +                FD_SET(errpipe[0], &fdset);
> > +
> > +                if (outpipe[0] > errpipe[0])
> > +                        nfds = outpipe[0] + 1;
> > +                else
> > +                        nfds = errpipe[0] + 1;
> > +
> > +                ret = select(nfds, &fdset, NULL, NULL, NULL);
> 
> select() is not an option. We regularly have to deal with fds > 1024,
> and select() doesn't support those. it's not just slow, it simply
> won't work.
> 
> Really, select() is obsolete IRL.
> 
> > +                if (ret == -1) {
> > +                        log_open();
> > +                        log_unit_error_errno(u, ret, "select error, ignoring RDTCacheReservation: %m");
> > +                        log_close();
> > +                        return -1;
> > +                }
> > +
> > +                if (FD_ISSET(outpipe[0], &fdset)) {
> > +                        count = read(outpipe[0], buffer, sizeof(buffer)-1);
> > +                        if (count >= 0) {
> > +                                buffer[count] = 0;
> > +                                log_open();
> > +                                log_unit_error(u, "resctrltool stdout: %s", buffer);
> > +                                log_close();
> > +                        } else {
> > +                                log_open();
> > +                                log_unit_error(u, "resctrltool io error\n");
> > +                                log_close();
> > +                        }
> > +                }
> > +
> > +                if (FD_ISSET(errpipe[0], &fdset)) {
> > +                        count = read(errpipe[0], buffer, sizeof(buffer)-1);
> > +                        if (count >= 0) {
> > +                                buffer[count] = 0;
> > +                                log_open();
> > +                                log_unit_error(u, "resctrltool stderr: %s", buffer);
> > +                                log_close();
> > +                        } else {
> > +                                log_open();
> > +                                log_unit_error(u, "resctrltool io error\n");
> > +                                log_close();
> > +                        }
> > +                }
> > +
> > +                return retst;
> 
> 
> This will deadlock if the tool prints more than PIPE_BUF output...
> 
> > +        }
> > +
> > +    return -1;
> 
> Please do not make up fake error codes such as "-1". We pretty
> universally follow kernel style "return -EFOOBAR".
> 
> > +}
> > +
> > +static int setup_rdt_cat(Unit *u, const ExecContext *c, pid_t pid)
> > +{
> > +        int ret, i;
> > +        const char *arg[5];
> > +        char pidstr[100];
> 
> fixed size arrays... If you only write a pid into this, then use DECIMAL_STR_MAX(pid_t)...
> 
> > +        char *name = u->id;
> > +        const char *largm[ARGMSIZE + 4];
> > +
> > +        sprintf(pidstr, "%d", pid);
> 
> Use PID_FMT for this...
> 
> > +
> > +        arg[0] = "/usr/bin/resctrltool";
> > +        arg[1] = "delete";
> > +        arg[2] = name;
> > +        arg[3] = 0;
> > +
> > +        ret = fork_exec_resctrltool(u, (char **)arg);
> > +        /* we want to ignore 'reservation does not exist'
> > +         * errors, so skip error checking entirely.
> > +         * any other errors will be caught when trying
> > +         * to execute create below
> > +         */
> > +
> > +        memset(largm, 0, sizeof(largm));
> 
> Use memzero()...
> 
> Please have a look at CODING_STYLE.
> 
> And please: the shelling out is not acceptable. Sorry!
> 
> Lennart
> 
> -- 
> Lennart Poettering, Red Hat


More information about the systemd-devel mailing list