Skip to content
Snippets Groups Projects
Forked from drm / kernel
Source project has a limited visibility.
  • Linus Torvalds's avatar
    8f72c31f
    Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs · 8f72c31f
    Linus Torvalds authored
    Pull misc vfs updates from Christian Brauner:
     "This contains the usual pile of misc updates:
    
      Features:
    
       - Add F_CREATED_QUERY fcntl() that allows userspace to query whether
         a file was actually created. Often userspace wants to know whether
         an O_CREATE request did actually create a file without using
         O_EXCL. The current logic is that to first attempts to open the
         file without O_CREAT | O_EXCL and if ENOENT is returned userspace
         tries again with both flags. If that succeeds all is well. If it
         now reports EEXIST it retries.
    
         That works fairly well but some corner cases make this more
         involved. If this operates on a dangling symlink the first openat()
         without O_CREAT | O_EXCL will return ENOENT but the second openat()
         with O_CREAT | O_EXCL will fail with EEXIST.
    
         The reason is that openat() without O_CREAT | O_EXCL follows the
         symlink while O_CREAT | O_EXCL doesn't for security reasons. So
         it's not something we can really change unless we add an explicit
         opt-in via O_FOLLOW which seems really ugly.
    
         All available workarounds are really nasty (fanotify, bpf lsm etc)
         so add a simple fcntl().
    
       - Try an opportunistic lookup for O_CREAT. Today, when opening a file
         we'll typically do a fast lookup, but if O_CREAT is set, the kernel
         always takes the exclusive inode lock. This was likely done with
         the expectation that O_CREAT means that we always expect to do the
         create, but that's often not the case. Many programs set O_CREAT
         even in scenarios where the file already exists (see related
         F_CREATED_QUERY patch motivation above).
    
         The series contained in the pr rearranges the pathwalk-for-open
         code to also attempt a fast_lookup in certain O_CREAT cases. If a
         positive dentry is found, the inode_lock can be avoided altogether
         and it can stay in rcuwalk mode for the last step_into.
    
       - Expose the 64 bit mount id via name_to_handle_at()
    
         Now that we provide a unique 64-bit mount ID interface in statx(2),
         we can now provide a race-free way for name_to_handle_at(2) to
         provide a file handle and corresponding mount without needing to
         worry about racing with /proc/mountinfo parsing or having to open a
         file just to do statx(2).
    
         While this is not necessary if you are using AT_EMPTY_PATH and
         don't care about an extra statx(2) call, users that pass full paths
         into name_to_handle_at(2) need to know which mount the file handle
         comes from (to make sure they don't try to open_by_handle_at a file
         handle from a different filesystem) and switching to AT_EMPTY_PATH
         would require allocating a file for every name_to_handle_at(2) call
    
       - Add a per dentry expire timeout to autofs
    
         There are two fairly well known automounter map formats, the autofs
         format and the amd format (more or less System V and Berkley).
    
         Some time ago Linux autofs added an amd map format parser that
         implemented a fair amount of the amd functionality. This was done
         within the autofs infrastructure and some functionality wasn't
         implemented because it either didn't make sense or required extra
         kernel changes. The idea was to restrict changes to be within the
         existing autofs functionality as much as possible and leave changes
         with a wider scope to be considered later.
    
         One of these changes is implementing the amd options:
          1) "unmount", expire this mount according to a timeout (same as
             the current autofs default).
          2) "nounmount", don't expire this mount (same as setting the
             autofs timeout to 0 except only for this specific mount) .
          3) "utimeout=<seconds>", expire this mount using the specified
             timeout (again same as setting the autofs timeout but only for
             this mount)
    
         To implement these options per-dentry expire timeouts need to be
         implemented for autofs indirect mounts. This is because all map
         keys (mounts) for autofs indirect mounts use an expire timeout
         stored in the autofs mount super block info. structure and all
         indirect mounts use the same expire timeout.
    
      Fixes:
    
       - Fix missing fput for FSCONFIG_SET_FD in autofs
    
       - Use param->file for FSCONFIG_SET_FD in coda
    
       - Delete the 'fs/netfs' proc subtreee when netfs module exits
    
       - Make sure that struct uid_gid_map fits into a single cacheline
    
       - Don't flush in-flight wb switches for superblocks without cgroup
         writeback
    
       - Correcting the idmapping mount example in the idmapping
         documentation
    
       - Fix a race between evice_inodes() and find_inode() and iput()
    
       - Refine the show_inode_state() macro definition in writeback code
    
       - Prevent dump_mapping() from accessing invalid dentry.d_name.name
    
       - Show actual source for debugfs in /proc/mounts
    
       - Annotate data-race of busy_poll_usecs in eventpoll
    
       - Don't WARN for racy path_noexec check in exec code
    
       - Handle OOM on mnt_warn_timestamp_expiry()
    
       - Fix some spelling in the iomap design documentation
    
       - Fix typo in procfs comment
    
       - Fix typo in fs/namespace.c comment
    
      Cleanups:
    
       - Add the VFS git tree to the MAINTAINERS file
    
       - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode
         bit in struct file bringing us to 5 free f_mode bits
    
       - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify
         the wait mechanism
    
       - Remove the unused path_put_init() helper
    
       - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi
         specific
    
       - Replace the unsigned long i_state member with a u32 i_state member
         in struct inode freeing up 4 bytes in struct inode. Instead of
         using the bit based wait apis we're now using the var event apis
         and using the individual bytes of the i_state member to wait on
         state changes
    
       - Explain how per-syscall AT_* flags should be allocated
    
       - Use in_group_or_capable() helper to simplify the posix acl mode
         update code
    
       - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code
    
       - Removed comment about d_rcu_to_refcount() as that function doesn't
         exist anymore
    
       - Add kernel documentation for lookup_fast()
    
       - Don't re-zero evenpoll fields
    
       - Remove outdated comment after close_fd()
    
       - Fix imprecise wording in comment about the pipe filesystem
    
       - Drop GFP_NOFAIL mode from alloc_page_buffers
    
       - Missing blank line warnings and struct declaration improved in
         file_table
    
       - Annotate struct poll_list with __counted_by()
    
       - Remove the unused read parameter in percpu-rwsem
    
       - Remove linux/prefetch.h include from direct-io code
    
       - Use kmemdup_array instead of kmemdup for multiple allocation in
         mnt_idmapping code
    
       - Remove unused mnt_cursor_del() declaration
    
      Performance tweaks:
    
       - Dodge smp_mb in break_lease and break_deleg in the common case
    
       - Only read fops once in fops_{get,put}()
    
       - Use RCU in ilookup()
    
       - Elide smp_mb in iversion handling in the common case
    
       - Drop one lock trip in evict()"
    
    * tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits)
      uidgid: make sure we fit into one cacheline
      proc: Fix typo in the comment
      fs/pipe: Correct imprecise wording in comment
      fhandle: expose u64 mount id to name_to_handle_at(2)
      uapi: explain how per-syscall AT_* flags should be allocated
      fs: drop GFP_NOFAIL mode from alloc_page_buffers
      writeback: Refine the show_inode_state() macro definition
      fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
      mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation
      netfs: Delete subtree of 'fs/netfs' when netfs module exits
      fs: use LIST_HEAD() to simplify code
      inode: make i_state a u32
      inode: port __I_LRU_ISOLATING to var event
      vfs: fix race between evice_inodes() and find_inode()&iput()
      inode: port __I_NEW to var event
      inode: port __I_SYNC to var event
      fs: reorder i_state bits
      fs: add i_state helpers
      MAINTAINERS: add the VFS git tree
      fs: s/__u32/u32/ for s_fsnotify_mask
      ...
    8f72c31f
    History
    Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
    Linus Torvalds authored
    Pull misc vfs updates from Christian Brauner:
     "This contains the usual pile of misc updates:
    
      Features:
    
       - Add F_CREATED_QUERY fcntl() that allows userspace to query whether
         a file was actually created. Often userspace wants to know whether
         an O_CREATE request did actually create a file without using
         O_EXCL. The current logic is that to first attempts to open the
         file without O_CREAT | O_EXCL and if ENOENT is returned userspace
         tries again with both flags. If that succeeds all is well. If it
         now reports EEXIST it retries.
    
         That works fairly well but some corner cases make this more
         involved. If this operates on a dangling symlink the first openat()
         without O_CREAT | O_EXCL will return ENOENT but the second openat()
         with O_CREAT | O_EXCL will fail with EEXIST.
    
         The reason is that openat() without O_CREAT | O_EXCL follows the
         symlink while O_CREAT | O_EXCL doesn't for security reasons. So
         it's not something we can really change unless we add an explicit
         opt-in via O_FOLLOW which seems really ugly.
    
         All available workarounds are really nasty (fanotify, bpf lsm etc)
         so add a simple fcntl().
    
       - Try an opportunistic lookup for O_CREAT. Today, when opening a file
         we'll typically do a fast lookup, but if O_CREAT is set, the kernel
         always takes the exclusive inode lock. This was likely done with
         the expectation that O_CREAT means that we always expect to do the
         create, but that's often not the case. Many programs set O_CREAT
         even in scenarios where the file already exists (see related
         F_CREATED_QUERY patch motivation above).
    
         The series contained in the pr rearranges the pathwalk-for-open
         code to also attempt a fast_lookup in certain O_CREAT cases. If a
         positive dentry is found, the inode_lock can be avoided altogether
         and it can stay in rcuwalk mode for the last step_into.
    
       - Expose the 64 bit mount id via name_to_handle_at()
    
         Now that we provide a unique 64-bit mount ID interface in statx(2),
         we can now provide a race-free way for name_to_handle_at(2) to
         provide a file handle and corresponding mount without needing to
         worry about racing with /proc/mountinfo parsing or having to open a
         file just to do statx(2).
    
         While this is not necessary if you are using AT_EMPTY_PATH and
         don't care about an extra statx(2) call, users that pass full paths
         into name_to_handle_at(2) need to know which mount the file handle
         comes from (to make sure they don't try to open_by_handle_at a file
         handle from a different filesystem) and switching to AT_EMPTY_PATH
         would require allocating a file for every name_to_handle_at(2) call
    
       - Add a per dentry expire timeout to autofs
    
         There are two fairly well known automounter map formats, the autofs
         format and the amd format (more or less System V and Berkley).
    
         Some time ago Linux autofs added an amd map format parser that
         implemented a fair amount of the amd functionality. This was done
         within the autofs infrastructure and some functionality wasn't
         implemented because it either didn't make sense or required extra
         kernel changes. The idea was to restrict changes to be within the
         existing autofs functionality as much as possible and leave changes
         with a wider scope to be considered later.
    
         One of these changes is implementing the amd options:
          1) "unmount", expire this mount according to a timeout (same as
             the current autofs default).
          2) "nounmount", don't expire this mount (same as setting the
             autofs timeout to 0 except only for this specific mount) .
          3) "utimeout=<seconds>", expire this mount using the specified
             timeout (again same as setting the autofs timeout but only for
             this mount)
    
         To implement these options per-dentry expire timeouts need to be
         implemented for autofs indirect mounts. This is because all map
         keys (mounts) for autofs indirect mounts use an expire timeout
         stored in the autofs mount super block info. structure and all
         indirect mounts use the same expire timeout.
    
      Fixes:
    
       - Fix missing fput for FSCONFIG_SET_FD in autofs
    
       - Use param->file for FSCONFIG_SET_FD in coda
    
       - Delete the 'fs/netfs' proc subtreee when netfs module exits
    
       - Make sure that struct uid_gid_map fits into a single cacheline
    
       - Don't flush in-flight wb switches for superblocks without cgroup
         writeback
    
       - Correcting the idmapping mount example in the idmapping
         documentation
    
       - Fix a race between evice_inodes() and find_inode() and iput()
    
       - Refine the show_inode_state() macro definition in writeback code
    
       - Prevent dump_mapping() from accessing invalid dentry.d_name.name
    
       - Show actual source for debugfs in /proc/mounts
    
       - Annotate data-race of busy_poll_usecs in eventpoll
    
       - Don't WARN for racy path_noexec check in exec code
    
       - Handle OOM on mnt_warn_timestamp_expiry()
    
       - Fix some spelling in the iomap design documentation
    
       - Fix typo in procfs comment
    
       - Fix typo in fs/namespace.c comment
    
      Cleanups:
    
       - Add the VFS git tree to the MAINTAINERS file
    
       - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode
         bit in struct file bringing us to 5 free f_mode bits
    
       - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify
         the wait mechanism
    
       - Remove the unused path_put_init() helper
    
       - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi
         specific
    
       - Replace the unsigned long i_state member with a u32 i_state member
         in struct inode freeing up 4 bytes in struct inode. Instead of
         using the bit based wait apis we're now using the var event apis
         and using the individual bytes of the i_state member to wait on
         state changes
    
       - Explain how per-syscall AT_* flags should be allocated
    
       - Use in_group_or_capable() helper to simplify the posix acl mode
         update code
    
       - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code
    
       - Removed comment about d_rcu_to_refcount() as that function doesn't
         exist anymore
    
       - Add kernel documentation for lookup_fast()
    
       - Don't re-zero evenpoll fields
    
       - Remove outdated comment after close_fd()
    
       - Fix imprecise wording in comment about the pipe filesystem
    
       - Drop GFP_NOFAIL mode from alloc_page_buffers
    
       - Missing blank line warnings and struct declaration improved in
         file_table
    
       - Annotate struct poll_list with __counted_by()
    
       - Remove the unused read parameter in percpu-rwsem
    
       - Remove linux/prefetch.h include from direct-io code
    
       - Use kmemdup_array instead of kmemdup for multiple allocation in
         mnt_idmapping code
    
       - Remove unused mnt_cursor_del() declaration
    
      Performance tweaks:
    
       - Dodge smp_mb in break_lease and break_deleg in the common case
    
       - Only read fops once in fops_{get,put}()
    
       - Use RCU in ilookup()
    
       - Elide smp_mb in iversion handling in the common case
    
       - Drop one lock trip in evict()"
    
    * tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits)
      uidgid: make sure we fit into one cacheline
      proc: Fix typo in the comment
      fs/pipe: Correct imprecise wording in comment
      fhandle: expose u64 mount id to name_to_handle_at(2)
      uapi: explain how per-syscall AT_* flags should be allocated
      fs: drop GFP_NOFAIL mode from alloc_page_buffers
      writeback: Refine the show_inode_state() macro definition
      fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
      mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation
      netfs: Delete subtree of 'fs/netfs' when netfs module exits
      fs: use LIST_HEAD() to simplify code
      inode: make i_state a u32
      inode: port __I_LRU_ISOLATING to var event
      vfs: fix race between evice_inodes() and find_inode()&iput()
      inode: port __I_NEW to var event
      inode: port __I_SYNC to var event
      fs: reorder i_state bits
      fs: add i_state helpers
      MAINTAINERS: add the VFS git tree
      fs: s/__u32/u32/ for s_fsnotify_mask
      ...