1. 20 Jul, 2020 1 commit
  2. 10 Jun, 2020 39 commits
    • Tomeu Vizoso's avatar
      panfrost: Add compatible string for bifrost · 6f9441f8
      Tomeu Vizoso authored
      
      
      Mesa now supports some Bifrost devices, so enable it.
      Signed-off-by: Tomeu Vizoso's avatarTomeu Vizoso <tomeu.vizoso@collabora.com>
      6f9441f8
    • Tomeu Vizoso's avatar
      panfrost: Make sure GPU is powered on when reading GPU_LATEST_FLUSH_ID · 025731ee
      Tomeu Vizoso authored
      
      
      Bifrost devices do support the flush reduction feature, so on first job
      submit we were trying to read the register while still powered off.
      
      If the GPU is powered off, the feature doesn't bring any benefit, so
      don't try to read.
      Signed-off-by: Tomeu Vizoso's avatarTomeu Vizoso <tomeu.vizoso@collabora.com>
      025731ee
    • Miklos Szeredi's avatar
      fuse: copy_file_range should truncate cache · 7d9c5c14
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      After the copy operation completes the cache is not up-to-date.  Truncate
      all pages in the interval that has successfully been copied.
      
      Truncating completely copied dirty pages is okay, since the data has been
      overwritten anyway.  Truncating partially copied dirty pages is not okay;
      add a comment for now.
      
      Fixes: 88bc7d50
      
       ("fuse: add support for copy_file_range()")
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      7d9c5c14
    • Miklos Szeredi's avatar
      fuse: fix copy_file_range cache issues · 650aa849
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      a) Dirty cache needs to be written back not just in the writeback_cache
      case, since the dirty pages may come from memory maps.
      
      b) The fuse_writeback_range() helper takes an inclusive interval, so the
      end position needs to be pos+len-1 instead of pos+len.
      
      Fixes: 88bc7d50
      
       ("fuse: add support for copy_file_range()")
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      650aa849
    • Maxim Patlasov's avatar
      fuse: optimize writepages search · 58b9f8f5
      Maxim Patlasov authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Re-work fi->writepages, replacing list with rb-tree.  This improves
      performance because kernel fuse iterates through fi->writepages for each
      writeback page and typical number of entries is about 800 (for 100MB of
      fuse writeback).
      
      Before patch:
      
      10240+0 records in
      10240+0 records out
      10737418240 bytes (11 GB) copied, 41.3473 s, 260 MB/s
      
       2  1      0 57445400  40416 6323676    0    0    33 374743 8633 19210  1  8 88  3  0
      
        29.86%  [kernel]               [k] _raw_spin_lock
        26.62%  [fuse]                 [k] fuse_page_is_writeback
      
      After patch:
      
      10240+0 records in
      10240+0 records out
      10737418240 bytes (11 GB) copied, 21.4954 s, 500 MB/s
      
       2  9      0 53676040  31744 10265984    0    0    64 854790 10956 48387  1  6 88  6  0
      
        23.55%  [kernel]             [k] copy_user_enhanced_fast_string
         9.87%  [kernel]             [k] __memcpy
         3.10%  [kernel]             [k] _raw_spin_lock
      Signed-off-by: default avatarMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      58b9f8f5
    • Miklos Szeredi's avatar
      fuse: update attr_version counter on fuse_notify_inval_inode() · 28212fed
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      A GETATTR request can race with FUSE_NOTIFY_INVAL_INODE, resulting in the
      attribute cache being updated with stale information after the
      invalidation.
      
      Fix this by bumping the attribute version in fuse_reverse_inval_inode().
      Reported-by: default avatarKrzysztof Rusek <rusek@9livesdata.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      28212fed
    • Miklos Szeredi's avatar
      fuse: don't check refcount after stealing page · 7d5f9f95
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      page_count() is unstable.  Unless there has been an RCU grace period
      between when the page was removed from the page cache and now, a
      speculative reference may exist from the page cache.
      Reported-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      7d5f9f95
    • Miklos Szeredi's avatar
      fuse: fix weird page warning · 72c71463
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      When PageWaiters was added, updating this check was missed.
      Reported-by: default avatarNikolaus Rath <Nikolaus@rath.org>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Fixes: 62906027
      
       ("mm: add PageWaiters indicating tasks are waiting for a page bit")
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      72c71463
    • Miklos Szeredi's avatar
      fuse: use dump_page · 4e08e055
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Instead of custom page dumping, use the standard helper.
      Reported-by: default avatarMatthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      4e08e055
    • Vivek Goyal's avatar
      virtiofs: do not use fuse_fill_super_common() for device installation · f2dd4a96
      Vivek Goyal authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      fuse_fill_super_common() allocates and installs one fuse_device.  Hence
      virtiofs allocates and install all fuse devices by itself except one.
      
      This makes logic little twisted.  There does not seem to be any real need
      that why virtiofs can't allocate and install all fuse devices itself.
      
      So opt out of fuse device allocation and installation while calling
      fuse_fill_super_common().
      
      Regular fuse still wants fuse_fill_super_common() to install fuse_device.
      It needs to prevent against races where two mounters are trying to mount
      fuse using same fd.  In that case one will succeed while other will get
      -EINVAL.
      
      virtiofs does not have this issue because sget_fc() resolves the race
      w.r.t multiple mounters and only one instance of virtio_fs_fill_super()
      should be in progress for same filesystem.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      f2dd4a96
    • Miklos Szeredi's avatar
      fuse: always allow query of st_dev · 99d3552d
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Fuse mounts without "allow_other" are off-limits to all non-owners.  Yet it
      makes sense to allow querying st_dev on the root, since this value is
      provided by the kernel, not the userspace filesystem.
      
      Allow statx(2) with a zero request mask to succeed on a fuse mounts for all
      users.
      Reported-by: default avatarNikolaus Rath <Nikolaus@rath.org>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      99d3552d
    • Miklos Szeredi's avatar
      fuse: always flush dirty data on close(2) · 77f20f9e
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      We want cached data to synced with the userspace filesystem on close(), for
      example to allow getting correct st_blocks value.  Do this regardless of
      whether the userspace filesystem implements a FLUSH method or not.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      77f20f9e
    • Eryu Guan's avatar
      fuse: invalidate inode attr in writeback cache mode · 9be06313
      Eryu Guan authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Under writeback mode, inode->i_blocks is not updated, making utils du
      read st.blocks as 0.
      
      For example, when using virtiofs (cache=always & nondax mode) with
      writeback_cache enabled, writing a new file and check its disk usage
      with du, du reports 0 usage.
      
        # uname -r
        5.6.0-rc6+
        # mount -t virtiofs virtiofs /mnt/virtiofs
        # rm -f /mnt/virtiofs/testfile
      
        # create new file and do extend write
        # xfs_io -fc "pwrite 0 4k" /mnt/virtiofs/testfile
        wrote 4096/4096 bytes at offset 0
        4 KiB, 1 ops; 0.0001 sec (28.103 MiB/sec and 7194.2446 ops/sec)
        # du -k /mnt/virtiofs/testfile
        0               <==== disk usage is 0
        # stat -c %s,%b /mnt/virtiofs/testfile
        4096,0          <==== i_size is correct, but st_blocks is 0
      
      Fix it by invalidating attr in fuse_flush(), so we get up-to-date attr
      from server on next getattr.
      Signed-off-by: default avatarEryu Guan <eguan@linux.alibaba.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      9be06313
    • Kirill Tkhai's avatar
      fuse: Update stale comment in queue_interrupt() · 8621e830
      Kirill Tkhai authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      Fixes: 04ec5af0
      
       "fuse: export fuse_end_request()"
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      8621e830
    • Vasily Averin's avatar
      fuse: BUG_ON correction in fuse_dev_splice_write() · 40a0e2c5
      Vasily Averin authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      commit 96354535 ("fuse: reduce allocation size for splice_write")
      changed size of bufs array, so BUG_ON which checks the index of the array
      shold also be fixed.
      
      [SzM: turn BUG_ON into WARN_ON]
      
      Fixes: 96354535
      
       ("fuse: reduce allocation size for splice_write")
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      40a0e2c5
    • Masayoshi Mizuma's avatar
      virtiofs: Add mount option and atime behavior to the doc · 5ef032f6
      Masayoshi Mizuma authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Add a section to show the mount option and a subsection to show
      the atime behavior.
      Signed-off-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      5ef032f6
    • Vivek Goyal's avatar
      virtiofs: schedule blocking async replies in separate worker · 1c0645eb
      Vivek Goyal authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      In virtiofs (unlike in regular fuse) processing of async replies is
      serialized.  This can result in a deadlock in rare corner cases when
      there's a circular dependency between the completion of two or more async
      replies.
      
      Such a deadlock can be reproduced with xfstests:generic/503 if TEST_DIR ==
      SCRATCH_MNT (which is a misconfiguration):
      
       - Process A is waiting for page lock in worker thread context and blocked
         (virtio_fs_requests_done_work()).
       - Process B is holding page lock and waiting for pending writes to
         finish (fuse_wait_on_page_writeback()).
       - Write requests are waiting in virtqueue and can't complete because
         worker thread is blocked on page lock (process A).
      
      Fix this by creating a unique work_struct for each async reply that can
      block (O_DIRECT read).
      
      Fixes: a62a8ef9
      
       ("virtio-fs: add virtiofs filesystem")
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      1c0645eb
    • youngjun's avatar
      ovl: remove unnecessary lock check · ceb54edc
      youngjun authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Directory is always locked until "out_unlock" label.  So lock check is not
      needed.
      Signed-off-by: default avataryoungjun <her0gyugyu@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      ceb54edc
    • Miklos Szeredi's avatar
      ovl: make oip->index bool · b19e4608
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      ovl_get_inode() uses oip->index as a bool value, not as a pointer.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      b19e4608
    • Miklos Szeredi's avatar
      ovl: only pass ->ki_flags to ovl_iocb_to_rwf() · 543fbed3
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Next patch will want to pass a modified set of flags, so...
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      543fbed3
    • Miklos Szeredi's avatar
      ovl: make private mounts longterm · 6422a01e
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Overlayfs is using clone_private_mount() to create internal mounts for
      underlying layers.  These are used for operations requiring a path, such as
      dentry_open().
      
      Since these private mounts are not in any namespace they are treated as
      short term, "detached" mounts and mntput() involves taking the global
      mount_lock, which can result in serious cacheline pingpong.
      
      Make these private mounts longterm instead, which trade the penalty on
      mntput() for a slightly longer shutdown time due to an added RCU grace
      period when putting these mounts.
      
      Introduce a new helper kern_unmount_many() that can take care of multiple
      longterm mounts with a single RCU grace period.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      6422a01e
    • Miklos Szeredi's avatar
      ovl: get rid of redundant members in struct ovl_fs · 131a39c8
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      ofs->upper_mnt is copied to ->layers[0].mnt and ->layers[0].trap could be
      used instead of a separate ->upperdir_trap.
      
      Split the lowerdir option early to get the number of layers, then allocate
      the ->layers array, and finally fill the upper and lower layers, as before.
      
      Get rid of path_put_init() in ovl_lower_dir(), since the only caller will
      take care of that.
      
      [Colin Ian King] Fix null pointer dereference on null stack pointer on
      error return found by Coverity.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      131a39c8
    • Miklos Szeredi's avatar
      ovl: add accessor for ofs->upper_mnt · bd229d8e
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Next patch will remove ofs->upper_mnt, so add an accessor function for this
      field.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      bd229d8e
    • yshui's avatar
      ovl: initialize error in ovl_copy_xattr · f2d630d0
      yshui authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      In ovl_copy_xattr, if all the xattrs to be copied are overlayfs private
      xattrs, the copy loop will terminate without assigning anything to the
      error variable, thus returning an uninitialized value.
      
      If ovl_copy_xattr is called from ovl_clear_empty, this uninitialized error
      value is put into a pointer by ERR_PTR(), causing potential invalid memory
      accesses down the line.
      
      This commit initialize error with 0. This is the correct value because when
      there's no xattr to copy, because all xattrs are private, ovl_copy_xattr
      should succeed.
      
      This bug is discovered with the help of INIT_STACK_ALL and clang.
      Signed-off-by: yshui's avatarYuxuan Shui <yshuiv7@gmail.com>
      Link: https://bugs.chromium.org/p/chromium/issues/detail?id=1050405
      Fixes: 0956254a
      
       ("ovl: don't copy up opaqueness")
      Cc: stable@vger.kernel.org # v4.8
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      f2d630d0
    • Chengguang Xu's avatar
      ovl: drop negative dentry in upper layer · c6b2aa35
      Chengguang Xu authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Negative dentries of upper layer are useless after construction of
      overlayfs' own dentry and may keep in the memory long time even after
      unmount of overlayfs instance. This patch tries to drop unnecessary
      negative dentry of upper layer to effectively reclaim memory.
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      c6b2aa35
    • Miklos Szeredi's avatar
      ovl: check permission to open real file · 74885c16
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Call inode_permission() on real inode before opening regular file on one of
      the underlying layers.
      
      In some cases ovl_permission() already checks access to an underlying file,
      but it misses the metacopy case, and possibly other ones as well.
      
      Removing the redundant permission check from ovl_permission() should be
      considered later.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      74885c16
    • Miklos Szeredi's avatar
      ovl: call secutiry hook in ovl_real_ioctl() · 5e919869
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Verify LSM permissions for underlying file, since vfs_ioctl() doesn't do
      it.
      
      [Stephen Rothwell] export security_file_ioctl
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      5e919869
    • Miklos Szeredi's avatar
      ovl: verify permissions in ovl_path_open() · 6cfd9208
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Check permission before opening a real file.
      
      ovl_path_open() is used by readdir and copy-up routines.
      
      ovl_permission() theoretically already checked copy up permissions, but it
      doesn't hurt to re-do these checks during the actual copy-up.
      
      For directory reading ovl_permission() only checks access to topmost
      underlying layer.  Readdir on a merged directory accesses layers below the
      topmost one as well.  Permission wasn't checked for these layers.
      
      Note: modifying ovl_permission() to perform this check would be far more
      complex and hence more bug prone.  The result is less precise permissions
      returned in access(2).  If this turns out to be an issue, we can revisit
      this bug.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      6cfd9208
    • Miklos Szeredi's avatar
      ovl: switch to mounter creds in readdir · 6dcbca50
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      In preparation for more permission checking, override credentials for
      directory operations on the underlying filesystems.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      6dcbca50
    • Miklos Szeredi's avatar
      ovl: pass correct flags for opening real directory · 1120f780
      Miklos Szeredi authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      The three instances of ovl_path_open() in overlayfs/readdir.c do three
      different things:
      
       - pass f_flags from overlay file
       - pass O_RDONLY | O_DIRECTORY
       - pass just O_RDONLY
      
      The value of f_flags can be (other than O_RDONLY):
      
      O_WRONLY	- not possible for a directory
      O_RDWR		- not possible for a directory
      O_CREAT		- masked out by dentry_open()
      O_EXCL		- masked out by dentry_open()
      O_NOCTTY	- masked out by dentry_open()
      O_TRUNC		- masked out by dentry_open()
      O_APPEND	- no effect on directory ops
      O_NDELAY	- no effect on directory ops
      O_NONBLOCK	- no effect on directory ops
      __O_SYNC	- no effect on directory ops
      O_DSYNC		- no effect on directory ops
      FASYNC		- no effect on directory ops
      O_DIRECT	- no effect on directory ops
      O_LARGEFILE	- ?
      O_DIRECTORY	- only affects lookup
      O_NOFOLLOW	- only affects lookup
      O_NOATIME	- overlay sets this unconditionally in ovl_path_open()
      O_CLOEXEC	- only affects fd allocation
      O_PATH		- no effect on directory ops
      __O_TMPFILE	- not possible for a directory
      
      
      Fon non-merge directories we use the underlying filesystem's iterate; in
      this case honor O_LARGEFILE from the original file to make sure that open
      doesn't get rejected.
      
      For merge directories it's safe to pass O_LARGEFILE unconditionally since
      userspace will only see the artificial offsets created by overlayfs.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      1120f780
    • Vivek Goyal's avatar
      ovl: fix redirect traversal on metacopy dentries · 3c3f53e4
      Vivek Goyal authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Amir pointed me to metacopy test cases in unionmount-testsuite and I
      decided to run "./run --ov=10 --meta" and it failed while running test
      "rename-mass-5.py".
      
      Problem is w.r.t absolute redirect traversal on intermediate metacopy
      dentry.  We do not store intermediate metacopy dentries and also skip
      current loop/layer and move onto lookup in next layer.  But at the end of
      loop, we have logic to reset "poe" and layer index if currnently looked up
      dentry has absolute redirect.  We skip all that and that means lookup in
      next layer will fail.
      
      Following is simple test case to reproduce this.
      
      - mkdir -p lower upper work merged lower/a lower/b
      - touch lower/a/foo.txt
      - mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work,metacopy=on none merged
      
      # Following will create absolute redirect "/a/foo.txt" on upper/b/bar.txt.
      - mv merged/a/foo.txt merged/b/bar.txt
      
      # unmount overlay and use upper as lower layer (lower2) for next mount.
      - umount merged
      - mv upper lower2
      - rm -rf work; mkdir -p upper work
      - mount -t overlay -o lowerdir=lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged
      
      # Force a metacopy copy-up
      - chown bin:bin merged/b/bar.txt
      
      # unmount overlay and use upper as lower layer (lower3) for next mount.
      - umount merged
      - mv upper lower3
      - rm -rf work; mkdir -p upper work
      - mount -t overlay -o lowerdir=lower3:lower2:lower,upperdir=upper,workdir=work,metacopy=on none merged
      
      # ls merged/b/bar.txt
      ls: cannot access 'bar.txt': Input/output error
      
      Intermediate lower layer (lower2) has metacopy dentry b/bar.txt with
      absolute redirect "/a/foo.txt".  We skipped redirect processing at the end
      of loop which sets poe to roe and sets the appropriate next lower layer
      index.  And that means lookup failed in next layer.
      
      Fix this by continuing the loop for any intermediate dentries.  We still do
      not save these at lower stack.  With this fix applied unionmount-testsuite,
      "./run --ov-10 --meta" now passes.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      3c3f53e4
    • Vivek Goyal's avatar
      ovl: initialize OVL_UPPERDATA in ovl_lookup() · 9ce96a34
      Vivek Goyal authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Currently ovl_get_inode() initializes OVL_UPPERDATA flag and for that it
      has to call ovl_check_metacopy_xattr() and check if metacopy xattr is
      present or not.
      
      yangerkun reported sometimes underlying filesystem might return -EIO and in
      that case error handling path does not cleanup properly leading to various
      warnings.
      
      Run generic/461 with ext4 upper/lower layer sometimes may trigger the bug
      as below(linux 4.19):
      
      [  551.001349] overlayfs: failed to get metacopy (-5)
      [  551.003464] overlayfs: failed to get inode (-5)
      [  551.004243] overlayfs: cleanup of 'd44/fd51' failed (-5)
      [  551.004941] overlayfs: failed to get origin (-5)
      [  551.005199] ------------[ cut here ]------------
      [  551.006697] WARNING: CPU: 3 PID: 24674 at fs/inode.c:1528 iput+0x33b/0x400
      ...
      [  551.027219] Call Trace:
      [  551.027623]  ovl_create_object+0x13f/0x170
      [  551.028268]  ovl_create+0x27/0x30
      [  551.028799]  path_openat+0x1a35/0x1ea0
      [  551.029377]  do_filp_open+0xad/0x160
      [  551.029944]  ? vfs_writev+0xe9/0x170
      [  551.030499]  ? page_counter_try_charge+0x77/0x120
      [  551.031245]  ? __alloc_fd+0x160/0x2a0
      [  551.031832]  ? do_sys_open+0x189/0x340
      [  551.032417]  ? get_unused_fd_flags+0x34/0x40
      [  551.033081]  do_sys_open+0x189/0x340
      [  551.033632]  __x64_sys_creat+0x24/0x30
      [  551.034219]  do_syscall_64+0xd5/0x430
      [  551.034800]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      One solution is to improve error handling and call iget_failed() if error
      is encountered.  Amir thinks that this path is little intricate and there
      is not real need to check and initialize OVL_UPPERDATA in ovl_get_inode().
      Instead caller of ovl_get_inode() can initialize this state.  And this will
      avoid double checking of metacopy xattr lookup in ovl_lookup() and
      ovl_get_inode().
      
      OVL_UPPERDATA is inode flag.  So I was little concerned that initializing
      it outside ovl_get_inode() might have some races.  But this is one way
      transition.  That is once a file has been fully copied up, it can't go back
      to metacopy file again.  And that seems to help avoid races.  So as of now
      I can't see any races w.r.t OVL_UPPERDATA being set wrongly.  So move
      settingof OVL_UPPERDATA inside the callers of ovl_get_inode().
      ovl_obtain_alias() already does it.  So only two callers now left are
      ovl_lookup() and ovl_instantiate().
      Reported-by: default avataryangerkun <yangerkun@huawei.com>
      Suggested-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      9ce96a34
    • Vivek Goyal's avatar
      ovl: use only uppermetacopy state in ovl_lookup() · f390b4bd
      Vivek Goyal authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Currently we use a variable "metacopy" which signifies that dentry could be
      either uppermetacopy or lowermetacopy.  Amir suggested that we can move
      code around and use d.metacopy in such a way that we don't need
      lowermetacopy and just can do away with uppermetacopy.
      
      So this patch replaces "metacopy" with "uppermetacopy".
      
      It also moves some code little higher to keep reading little simpler.
      Suggested-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      f390b4bd
    • Vivek Goyal's avatar
      ovl: simplify setting of origin for index lookup · b144a351
      Vivek Goyal authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      overlayfs can keep index of copied up files and directories and it seems to
      serve two primary puroposes.  For regular files, it avoids breaking lower
      hardlinks over copy up.  For directories it seems to be used for various
      error checks.
      
      During ovl_lookup(), we lookup for index using lower dentry in many a
      cases.  That lower dentry is called "origin" and following is a summary of
      current logic.
      
      If there is no upperdentry, always lookup for index using lower dentry.
      For regular files it helps avoiding breaking hard links over copyup and for
      directories it seems to be just error checks.
      
      If there is an upperdentry, then there are 3 possible cases.
      
       - For directories, lower dentry is found using two ways.  One is regular
        path based lookup in lower layers and second is using ORIGIN xattr on
        upper dentry.  First verify that path based lookup lower dentry matches
        the one pointed by upper ORIGIN xattr.  If yes, use this verified origin
        for index lookup.
      
       - For regular files (non-metacopy), there is no path based lookup in lower
        layers as lookup stops once we find upper dentry.  So there is no origin
        verification.  If there is ORIGIN xattr present on upper, use that to
        lookup index otherwise don't.
      
       - For regular metacopy files, again lower dentry is found using path based
        lookup as well as ORIGIN xattr on upper.  Path based lookup is continued
        in this case to find lower data dentry for metacopy upper.  So like
        directories we only use verified origin.  If ORIGIN xattr is not present
        (Either because lower did not support file handles or because this is
        hardlink copied up with index=off), then don't use path lookup based
        lower dentry as origin.  This is same as regular non-metacopy file case.
      Suggested-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      b144a351
    • Amir Goldstein's avatar
      ovl: fix out of bounds access warning in ovl_check_fb_len() · da51e2cb
      Amir Goldstein authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      syzbot reported out of bounds memory access from open_by_handle_at()
      with a crafted file handle that looks like this:
      
        { .handle_bytes = 2, .handle_type = OVL_FILEID_V1 }
      
      handle_bytes gets rounded down to 0 and we end up calling:
        ovl_check_fh_len(fh, 0) => ovl_check_fb_len(fh + 3, -3)
      
      But fh buffer is only 2 bytes long, so accessing struct ovl_fb at
      fh + 3 is illegal.
      
      Fixes: cbe7fba8
      
       ("ovl: make sure that real fid is 32bit aligned in memory")
      Reported-and-tested-by: syzbot+61958888b1c60361a791@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org> # v5.5
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      da51e2cb
    • Lubos Dolezel's avatar
      ovl: return required buffer size for file handles · 8ff6df89
      Lubos Dolezel authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Overlayfs doesn't work well with the fanotify mechanism.
      
      Fanotify first probes for the required buffer size for the file handle,
      but overlayfs currently bails out without passing the size back.
      
      That results in errors in the kernel log, such as:
      
      [527944.485384] overlayfs: failed to encode file handle (/, err=-75, buflen=0, len=29, type=1)
      [527944.485386] fanotify: failed to encode fid (fsid=ae521e68.a434d95f, type=255, bytes=0, err=-2)
      Signed-off-by: default avatarLubos Dolezel <lubos@dolezel.info>
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      8ff6df89
    • Chengguang Xu's avatar
      ovl: sync dirty data when remounting to ro mode · 9089b9d0
      Chengguang Xu authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      sync_filesystem() does not sync dirty data for readonly filesystem during
      umount, so before changing to readonly filesystem we should sync dirty data
      for data integrity.
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      9089b9d0
    • Chengguang Xu's avatar
      ovl: whiteout inode sharing · bbd1419a
      Chengguang Xu authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      
      
      Share inode with different whiteout files for saving inode and speeding up
      delete operation.
      
      If EMLINK is encountered when linking a shared whiteout, create a new one.
      In case of any other error, disable sharing for this super block.
      
      Note: ofs->whiteout is protected by inode lock on workdir.
      Signed-off-by: default avatarChengguang Xu <cgxu519@mykernel.net>
      Reviewed-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      bbd1419a
    • Jeffle Xu's avatar
      ovl: inherit SB_NOSEC flag from upperdir · ac59a5a2
      Jeffle Xu authored and Tomeu Vizoso's avatar Tomeu Vizoso committed
      Since the stacking of regular file operations [1], the overlayfs edition of
      write_iter() is called when writing regular files.
      
      Since then, xattr lookup is needed on every write since file_remove_privs()
      is called from ovl_write_iter(), which would become the performance
      bottleneck when writing small chunks of data. In my test case,
      file_remove_privs() would consume ~15% CPU when running fstime of unixbench
      (the workload is repeadly writing 1 KB to the same file) [2].
      
      Inherit the SB_NOSEC flag from upperdir. Since then xattr lookup would be
      done only once on the first write. Unixbench fstime gets a ~20% performance
      gain with this patch.
      
      [1] https://lore.kernel.org/lkml/20180606150905.GC9426@magnolia/T/
      [2] https://www.spinics.net/lists/linux-unionfs/msg07153.html
      
      Signed-off-by: default avatarJeffle Xu <jefflexu@linux.alibaba.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      ac59a5a2