[Intel-gfx] [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression
Linus Torvalds
torvalds at linux-foundation.org
Sun Nov 26 20:23:57 UTC 2023
On Sun, 19 Nov 2023 at 23:11, kernel test robot <oliver.sang at intel.com> wrote:
>
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:
>
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")
Ok, so __fget_light() is one of our more performance-critical things,
and that commit ends up making it call a rather more expensive version
in __get_file_rcu(), so we have:
> 30.90 ą 4% -20.6 10.35 ą 2% perf-profile.self.cycles-pp.__fget_light
> 0.00 +26.5 26.48 perf-profile.self.cycles-pp.__get_file_rcu
and that "20% decrease balanced by 26% increase elsewhere" then
directly causes the ~3% regression.
I took a look at the code generation, and honestly, I think we're
better off just making __fget_files_rcu() have special logic for this
all, and not use __get_file_rcu().
The 'fd' case really is special because we need to do that
non-speculative pointer access.
Because it turns out that we already have to use array_index_nospec()
to safely generate that pointer to the fd entry, and once you have to
do that "use non-speculative accesses to generate a safe pointer", you
might as well just go whole hog.
So this takes a different approach, and uses the
array_index_mask_nospec() thing that we have exactly to generate that
non-speculative mask for these things:
/* Mask is a 0 for invalid fd's, ~0 for valid ones */
mask = array_index_mask_nospec(fd, fdt->max_fds);
and then it does something you can consider either horribly clever, or
horribly ugly (this first part is basically just
array_index_nospec()):
/* fdentry points to the 'fd' offset, or fdt->fd[0] */
fdentry = fdt->fd + (fd&mask);
and then we can just *unconditionally* do the load - but since we
might be loading fd[0] for an invalid case, we need to mask the result
too:
/* Do the load, then mask any invalid result */
file = rcu_dereference_raw(*fdentry);
file = (void *)(mask & (unsigned long)file);
but now we have done everything without any conditionals, and the only
conditional is now "did we load NULL" - which includes that "we masked
the bad value".
Then we just do that atomic_long_inc_not_zero() on the file, and
re-check the pointer chain we used.
I made files_lookup_fd_raw() do the same thing.
The end result is much nicer code generation at least from my quick
check. And I assume the regression will be gone, and hopefully even
turned into an improvement since this is so hot.
Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that
__get_file_rcu() does, and I don't get it. Both things come from
volatile accesses, I don't see the point of those games, but I also
didn't care, since it's no longer in a critical code path.
Christian?
NOTE! This patch is not well tested. I verified an earlier version of
this, but have been playing with it since, so caveat emptor.
IOW, I might have messed up some "trivial cleanup" when prepping for
sending it out...
Linus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch.diff
Type: text/x-patch
Size: 4120 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/intel-gfx/attachments/20231126/9bb9baa7/attachment.bin>
More information about the Intel-gfx
mailing list