[Intel-gfx] [PATCH] file, i915: fix file reference for mmap_singleton()
Jann Horn
jannh at google.com
Wed Oct 25 15:23:42 UTC 2023
On Wed, Oct 25, 2023 at 2:01 PM Christian Brauner <brauner at kernel.org> wrote:
> Today we got a report at [1] for rcu stalls on the i915 testsuite in [2]
> due to the conversion of files to SLAB_TYPSSAFE_BY_RCU. Afaict,
> get_file_rcu() goes into an infinite loop trying to carefully verify
> that i915->gem.mmap_singleton hasn't changed - see the splat below.
>
> So I stared at this code to figure out what it actually does. It seems
> that the i915->gem.mmap_singleton pointer itself never had rcu semantics.
>
> The i915->gem.mmap_singleton is replaced in
> file->f_op->release::singleton_release():
>
> static int singleton_release(struct inode *inode, struct file *file)
> {
> struct drm_i915_private *i915 = file->private_data;
>
> cmpxchg(&i915->gem.mmap_singleton, file, NULL);
> drm_dev_put(&i915->drm);
>
> return 0;
> }
>
> The cmpxchg() is ordered against a concurrent update of
> i915->gem.mmap_singleton from mmap_singleton(). IOW, when
> mmap_singleton() fails to get a reference on i915->gem.mmap_singleton
> via mmap_singleton():
>
> rcu_read_lock();
> file = get_file_rcu(&i915->gem.mmap_singleton);
> rcu_read_unlock();
>
> mmap_singleton() allocates a new file via anon_inode_getfile() and does
>
> smp_store_mb(i915->gem.mmap_singleton, file);
>
> So, afaiu, then what happens in the case of this bug is that at some
> point fput() is called and drops the file->f_count to zero but obviously
> leaving the pointer in i915->gem.mmap_singleton in tact until
> file->f_op->release::singleton_release() is called.
>
> Now, there might be possibly larger delays until
> file->f_op->release::singleton_release() is called and
> i915->gem.mmap_singleton is set to NULL?
>
> Say concurrently another task hits mmap_singleton() and does:
>
> rcu_read_lock();
> file = get_file_rcu(&i915->gem.mmap_singleton);
> rcu_read_unlock();
>
> When get_file_rcu() fails to get a reference via atomic_inc_not_zero()
> it will try the reload from i915->gem.mmap_singleton assuming it has
> comparable semantics as __fget_files_rcu() that also reloads on
> atomic_inc_not_zero() failure.
>
> But since i915->gem.mmap_singleton doesn't have proper rcu semantics it
> reloads the same pointer again, trying the same atomic_inc_not_zero()
> again and doing so until file->f_op->release::singleton_release() of the
> old file has been called.
>
> So, in contrast to __fget_files_rcu() here we want to not retry when
> atomic_inc_not_zero() has failed. We only want to retry in case we
> managed to get a reference but the pointer did change on reload.
[...]
> Link: [1]: https://lore.kernel.org/intel-gfx/SJ1PR11MB6129CB39EED831784C331BAFB9DEA@SJ1PR11MB6129.namprd11.prod.outlook.com
> Link: [2]: https://intel-gfx-ci.01.org/tree/linux-next/next-20231013/bat-dg2-11/igt@i915_selftest@live@mman.html#dmesg-warnings10963
> Signed-off-by: Christian Brauner <brauner at kernel.org>
Patch makes sense to me. I'm not sure why you changed EAGAIN to
EINVAL, and it's obviously a bit ugly, but it looks like a valid fix
for what the SLUB_TYPESAFE_BY_RCU conversion broke.
Reviewed-by: Jann Horn <jannh at google.com>
But as a side note, I am a bit confused about how the concurrency of
the existing code is supposed to work... it looks like two parallel
calls to mmap_singleton() can end up returning different pointers, and
then the singleton is not actually a singleton anymore? If that part
of the existing code is wrong even before the SLAB_TYPESAFE_BY_RCU
conversion, we might later have to open-code get_file_rcu() in
mmap_singleton(), so that we can do a cmpxchg at the end that checks
whether the i915->gem.mmap_singleton pointer is still the same?
Like:
static struct file *mmap_singleton(struct drm_i915_private *i915)
{
struct file *file, *new_file;
rcu_read_lock();
file = READ_ONCE(i915->gem.mmap_singleton);
if (file && atomic_long_inc_not_zero(&file->f_count)) {
rcu_read_unlock();
return file;
}
rcu_read_unlock();
new_file = anon_inode_getfile("i915.gem", &singleton_fops,
i915, O_RDWR);
if (IS_ERR(new_file))
return new_file;
/* Everyone shares a single global address space */
new_file->f_mapping = i915->drm.anon_inode->i_mapping;
if (try_cmpxchg(&i915->gem.mmap_singleton, &file, new_file)) {
// something with drm_dev refcount ?
return new_file;
}
// something with drm_dev refcount ?
fput(new_file);
return file;
}
It would be nice to get some i915 maintainer's comment on how the
singleton stuff is supposed to work.
> ---
>
> Jann, Linus,
>
> Since you've been quite involved, can you check that what I'm babbling
> here makes sense? If this isn't the fix then I would have to drop the
> SLAB_TYPESAFE_BY_RCU conversion patch for now since I'd like to send PRs
> by the end of the week.
>
> This is on top of
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc
>
> Appreciate the help, thanks!
> Christian
>
> ---
> drivers/gpu/drm/i915/gem/i915_gem_mman.c | 2 +-
> fs/file.c | 17 ++++++++++++++++-
> include/linux/fs.h | 1 +
> 3 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
> index 97bf10861cad..da97e61b18d4 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
> @@ -916,7 +916,7 @@ static struct file *mmap_singleton(struct drm_i915_private *i915)
> struct file *file;
>
> rcu_read_lock();
> - file = get_file_rcu(&i915->gem.mmap_singleton);
> + file = get_file_rcu_once(&i915->gem.mmap_singleton);
> rcu_read_unlock();
> if (file)
> return file;
> diff --git a/fs/file.c b/fs/file.c
> index 1a475d7d636e..2c64d6836f0c 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -867,7 +867,7 @@ static struct file *__get_file_rcu(struct file __rcu **f)
> return NULL;
>
> if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
> - return ERR_PTR(-EAGAIN);
> + return ERR_PTR(-EINVAL);
>
> file_reloaded = rcu_dereference_raw(*f);
>
> @@ -927,6 +927,21 @@ struct file *get_file_rcu(struct file __rcu **f)
> }
> EXPORT_SYMBOL_GPL(get_file_rcu);
>
> +struct file *get_file_rcu_once(struct file __rcu **f)
> +{
> + for (;;) {
> + struct file __rcu *file;
> +
> + file = __get_file_rcu(f);
> + if (!IS_ERR(file))
> + return file;
> +
> + if (PTR_ERR(file) == -EINVAL)
> + return NULL;
> + }
> +}
> +EXPORT_SYMBOL_GPL(get_file_rcu_once);
> +
> static inline struct file *__fget_files_rcu(struct files_struct *files,
> unsigned int fd, fmode_t mask)
> {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index cb8bfa1f8ecb..657bbd824490 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1044,6 +1044,7 @@ static inline struct file *get_file(struct file *f)
> }
>
> struct file *get_file_rcu(struct file __rcu **f);
> +struct file *get_file_rcu_once(struct file __rcu **f);
>
> #define file_count(x) atomic_long_read(&(x)->f_count)
>
> --
> 2.34.1
>
More information about the Intel-gfx
mailing list