1. 17 Jan, 2019 1 commit
    • Marc Dionne's avatar
      afs: Don't set vnode->cb_s_break in afs_validate() · 4882a27c
      Marc Dionne authored
      A cb_interest record is not necessarily attached to the vnode on entry to
      afs_validate(), which can cause an oops when we try to bring the vnode's
      cb_s_break up to date in the default case (ie. no current callback promise
      and the vnode has not been deleted).
      Fix this by simply removing the line, as vnode->cb_s_break will be set when
      needed by afs_register_server_cb_interest() when we next get a callback
      promise from RPC call.
      The oops looks something like:
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          RIP: 0010:afs_validate+0x66/0x250 [kafs]
          Call Trace:
           afs_d_revalidate+0x8d/0x340 [kafs]
           ? __d_lookup+0x61/0x150
           ? lookup_dcache+0x44/0x70
      Fixes: ae3b7361 ("afs: Fix validation/callback interaction")
      Signed-off-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  2. 30 Nov, 2018 1 commit
    • David Howells's avatar
      afs: Fix validation/callback interaction · ae3b7361
      David Howells authored
      When afs_validate() is called to validate a vnode (inode), there are two
      unhandled cases in the fastpath at the top of the function:
       (1) If the vnode is promised (AFS_VNODE_CB_PROMISED is set), the break
           counters match and the data has expired, then there's an implicit case
           in which the vnode needs revalidating.
           This has no consequences since the default "valid = false" set at the
           top of the function happens to do the right thing.
       (2) If the vnode is not promised and it hasn't been deleted
           (AFS_VNODE_DELETED is not set) then there's a default case we're not
           handling in which the vnode is invalid.  If the vnode is invalid, we
           need to bring cb_s_break and cb_v_break up to date before we refetch
           the status.
           As a consequence, once the server loses track of the client
           (ie. sufficient time has passed since we last sent it an operation),
           it will send us a CB.InitCallBackState* operation when we next try to
           talk to it.  This calls afs_init_callback_state() which increments
           afs_server::cb_s_break, but this then doesn't propagate to the
           afs_vnode record.
           The result being that every afs_validate() call thereafter sends a
           status fetch operation to the server.
      Clarify and fix this by:
       (A) Setting valid in all the branches rather than initialising it at the
           top so that the compiler catches where we've missed.
       (B) Restructuring the logic in the 'promised' branch so that we set valid
           to false if the callback is due to expire (or has expired) and so that
           the final case is that the vnode is still valid.
       (C) Adding an else-statement that ups cb_s_break and cb_v_break if the
           promised and deleted cases don't match.
      Fixes: c435ee34 ("afs: Overhaul the callback handling")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  3. 23 Oct, 2018 3 commits
  4. 14 May, 2018 2 commits
    • David Howells's avatar
      afs: Fix whole-volume callback handling · 68251f0a
      David Howells authored
      It's possible for an AFS file server to issue a whole-volume notification
      that callbacks on all the vnodes in the file have been broken.  This is
      done for R/O and backup volumes (which don't have per-file callbacks) and
      for things like a volume being taken offline.
      Fix callback handling to detect whole-volume notifications, to track it
      across operations and to check it during inode validation.
      Fixes: c435ee34 ("afs: Overhaul the callback handling")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Fix directory page locking · b61f7dcf
      David Howells authored
      The afs directory loading code (primarily afs_read_dir()) locks all the
      pages that hold a directory's content blob to defend against
      getdents/getdents races and getdents/lookup races where the competitors
      issue conflicting reads on the same data.  As the reads will complete
      consecutively, they may retrieve different versions of the data and
      one may overwrite the data that the other is busy parsing.
      Fix this by not locking the pages at all, but rather by turning the
      validation lock into an rwsem and getting an exclusive lock on it whilst
      reading the data or validating the attributes and a shared lock whilst
      parsing the data.  Sharing the attribute validation lock should be fine as
      the data fetch will retrieve the attributes also.
      The individual page locks aren't needed at all as the only place they're
      being used is to serialise data loading.
      Without this patch, the:
       	if (!test_bit(AFS_VNODE_DIR_VALID, &dvnode->flags)) {
      part of afs_read_dir() may be skipped, leaving the pages unlocked when we
      hit the success: clause - in which case we try to unlock the not-locked
      pages, leading to the following oops:
        page:ffffe38b405b4300 count:3 mapcount:0 mapping:ffff98156c83a978 index:0x0
        flags: 0xfffe000001004(referenced|private)
        raw: 000fffe000001004 ffff98156c83a978 0000000000000000 00000003ffffffff
        raw: dead000000000100 dead000000000200 0000000000000001 ffff98156b27c000
        page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
        ------------[ cut here ]------------
        kernel BUG at mm/filemap.c:1205!
        RIP: 0010:unlock_page+0x43/0x50
        Call Trace:
         afs_dir_iterate+0x789/0x8f0 [kafs]
         ? _cond_resched+0x15/0x30
         ? kmem_cache_alloc_trace+0x166/0x1d0
         ? afs_do_lookup+0x69/0x490 [kafs]
         ? afs_do_lookup+0x101/0x490 [kafs]
         ? key_default_cmp+0x20/0x20
         ? request_key+0x3c/0x80
         ? afs_lookup+0xf1/0x340 [kafs]
         ? __lookup_slow+0x97/0x150
         ? lookup_slow+0x35/0x50
         ? walk_component+0x1bf/0x490
         ? path_lookupat.isra.52+0x75/0x200
         ? filename_lookup.part.66+0xa0/0x170
         ? afs_end_vnode_operation+0x41/0x60 [kafs]
         ? __check_object_size+0x9c/0x171
         ? strncpy_from_user+0x4a/0x170
         ? vfs_statx+0x73/0xe0
         ? __do_sys_newlstat+0x39/0x70
         ? __x64_sys_getdents+0xc9/0x140
         ? __x64_sys_getdents+0x140/0x140
         ? do_syscall_64+0x5b/0x160
         ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Fixes: f3ddee8d ("afs: Fix directory handling")
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  5. 09 Apr, 2018 9 commits
    • David Howells's avatar
      afs: Trace protocol errors · 5f702c8e
      David Howells authored
      Trace protocol errors detected in afs.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Locally edit directory data for mkdir/create/unlink/... · 63a4681f
      David Howells authored
      Locally edit the contents of an AFS directory upon a successful inode
      operation that modifies that directory (such as mkdir, create and unlink)
      so that we can avoid the current practice of re-downloading the directory
      after each change.
      This is viable provided that the directory version number we get back from
      the modifying RPC op is exactly incremented by 1 from what we had
      previously.  The data in the directory contents is in a defined format that
      we have to parse locally to perform lookups and readdir, so modifying isn't
      a problem.
      If the edit fails, we just clear the VALID flag on the directory and it
      will be reloaded next time it is needed.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Fix directory handling · f3ddee8d
      David Howells authored
      AFS directories are structured blobs that are downloaded just like files
      and then parsed by the lookup and readdir code and, as such, are currently
      handled in the pagecache like any other file, with the entire directory
      content being thrown away each time the directory changes.
      However, since the blob is a known structure and since the data version
      counter on a directory increases by exactly one for each change committed
      to that directory, we can actually edit the directory locally rather than
      fetching it from the server after each locally-induced change.
      What we can't do, though, is mix data from the server and data from the
      client since the server is technically at liberty to rearrange or compress
      a directory if it sees fit, provided it updates the data version number
      when it does so and breaks the callback (ie. sends a notification).
      Further, lookup with lookup-ahead, readdir and, when it arrives, local
      editing are likely want to scan the whole of a directory.
      So directory handling needs to be improved to maintain the coherency of the
      directory blob prior to permitting local directory editing.
      To this end:
       (1) If any directory page gets discarded, invalidate and reread the entire
       (2) If readpage notes that if when it fetches a single page that the
           version number has changed, the entire directory is flagged for
       (3) Read as much of the directory in one go as we can.
      Note that this removes local caching of directories in fscache for the
      moment as we can't pass the pages to fscache_read_or_alloc_pages() since
      page->lru is in use by the LRU.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Keep track of invalid-before version for dentry coherency · a4ff7401
      David Howells authored
      Each afs dentry is tagged with the version that the parent directory was at
      last time it was validated and, currently, if this differs, the directory
      is scanned and the dentry is refreshed.
      However, this leads to an excessive amount of revalidation on directories
      that get modified on the client without conflict with another client.  We
      know there's no conflict because the parent directory's data version number
      got incremented by exactly 1 on any create, mkdir, unlink, etc., therefore
      we can trust the current state of the unaffected dentries when we perform a
      local directory modification.
      Optimise by keeping track of the last version of the parent directory that
      was changed outside of the client in the parent directory's vnode and using
      that to validate the dentries rather than the current version.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Rearrange status mapping · dd9fbcb8
      David Howells authored
      Rearrange the AFSFetchStatus to inode attribute mapping code in a number of
       (1) Use an XDR structure rather than a series of incremented pointer
           accesses when decoding an AFSFetchStatus object.  This allows
           out-of-order decode.
       (2) Don't store the if_version value but rather just check it and abort if
           it's not something we can handle.
       (3) Store the owner and group in the status record as raw values rather
           than converting them to kuid/kgid.  Do that when they're mapped into
       (4) Validate the type and abort code up front and abort if they're wrong.
       (5) Split the inode attribute setting out into its own function from the
           XDR decode of an AFSFetchStatus object.  This allows it to be called
           from elsewhere too.
       (6) Differentiate changes to data from changes to metadata.
       (7) Use the split-out attribute mapping function from afs_iget().
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Make it possible to get the data version in readpage · 0c3a5ac2
      David Howells authored
      Store the data version number indicated by an FS.FetchData op into the read
      request structure so that it's accessible by the page reader.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Init inode before accessing cache · 5800db81
      David Howells authored
      We no longer parse symlinks when we get the inode to determine if this
      symlink is actually a mountpoint as we detect that by examining the mode
      instead (symlinks are always 0777 and mountpoints 0644).
      Access the cache after mapping the status so that we don't have to manually
      set the inode size now.
      Note that this may need adjusting if the disconnected operation is
      implemented as the file metadata may have to be obtained from the cache.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Introduce a statistics proc file · d55b4da4
      David Howells authored
      Introduce a proc file that displays a bunch of statistics for the AFS
      filesystem in the current network namespace.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Fix checker warnings · fe342cf7
      David Howells authored
      Fix warnings raised by checker, including:
       (*) Warnings raised by unequal comparison for the purposes of sorting,
           where the endianness doesn't matter:
      fs/afs/addr_list.c:246:21: warning: restricted __be16 degrades to integer
      fs/afs/addr_list.c:246:30: warning: restricted __be16 degrades to integer
      fs/afs/addr_list.c:248:21: warning: restricted __be32 degrades to integer
      fs/afs/addr_list.c:248:49: warning: restricted __be32 degrades to integer
      fs/afs/addr_list.c:283:21: warning: restricted __be16 degrades to integer
      fs/afs/addr_list.c:283:30: warning: restricted __be16 degrades to integer
       (*) afs_set_cb_interest() is not actually used and can be removed.
       (*) afs_cell_gc_delay() should be provided with a sysctl.
       (*) afs_cell_destroy() needs to use rcu_access_pointer() to read
       (*) afs_init_fs_cursor() should be static.
       (*) struct afs_vnode::permit_cache needs to be marked __rcu.
       (*) afs_server_rcu() needs to use rcu_access_pointer().
       (*) afs_destroy_server() should use rcu_access_pointer() on
           server->addresses as the server object is no longer accessible.
       (*) afs_find_server() casts __be16/__be32 values to int in order to
           directly compare them for the purpose of finding a match in a list,
           but is should also annotate the cast with __force to avoid checker
       (*) afs_check_permit() accesses vnode->permit_cache outside of the RCU
           readlock, though it doesn't then access the value; the extraneous
           access is deleted.
      False positives:
       (*) Conditional locking around the code in xdr_decode_AFSFetchStatus.  This
           can be dealt with in a separate patch.
      fs/afs/fsclient.c:148:9: warning: context imbalance in 'xdr_decode_AFSFetchStatus' - different lock contexts for basic block
       (*) Incorrect handling of seq-retry lock context balance:
      fs/afs/inode.c:455:38: warning: context imbalance in 'afs_getattr' - different
      lock contexts for basic block
      fs/afs/server.c:52:17: warning: context imbalance in 'afs_find_server' - different lock contexts for basic block
      fs/afs/server.c:128:17: warning: context imbalance in 'afs_find_server_by_uuid' - different lock contexts for basic block
       (*) afs_lookup_cell_rcu() needs to break out of the seq-retry loop, not go
           round again if it successfully found the workstation cell.
       (*) Fix UUID decode in afs_deliver_cb_probe_uuid().
       (*) afs_cache_permit() has a missing rcu_read_unlock() before one of the
           jumps to the someone_else_changed_it label.  Move the unlock to after
           the label.
       (*) afs_vl_get_addrs_u() is using ntohl() rather than htonl() when
           encoding to XDR.
       (*) afs_deliver_yfsvl_get_endpoints() is using htonl() rather than ntohl()
           when decoding from XDR.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  6. 06 Apr, 2018 1 commit
  7. 04 Apr, 2018 3 commits
    • David Howells's avatar
      fscache: Attach the index key and aux data to the cookie · 402cb8dd
      David Howells authored
      Attach copies of the index key and auxiliary data to the fscache cookie so
       (1) The callbacks to the netfs for this stuff can be eliminated.  This
           can simplify things in the cache as the information is still
           available, even after the cache has relinquished the cookie.
       (2) Simplifies the locking requirements of accessing the information as we
           don't have to worry about the netfs object going away on us.
       (3) The cache can do lazy updating of the coherency information on disk.
           As long as the cache is flushed before reboot/poweroff, there's no
           need to update the coherency info on disk every time it changes.
       (4) Cookies can be hashed or put in a tree as the index key is easily
           available.  This allows:
           (a) Checks for duplicate cookies can be made at the top fscache layer
           	 rather than down in the bowels of the cache backend.
           (b) Caching can be added to a netfs object that has a cookie if the
           	 cache is brought online after the netfs object is allocated.
      A certain amount of space is made in the cookie for inline copies of the
      data, but if it won't fit there, extra memory will be allocated for it.
      The downside of this is that live cache operation requires more memory.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Acked-by: default avatarAnna Schumaker <anna.schumaker@netapp.com>
      Tested-by: default avatarSteve Dickson <steved@redhat.com>
    • David Howells's avatar
      afs: Be more aggressive in retiring cached vnodes · 678edd09
      David Howells authored
      When relinquishing cookies, either due to iget failure or to inode
      eviction, retire a cookie if we think the corresponding vnode got deleted
      on the server rather than just letting it lie in the cache.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Invalidate cache on server data change · c1515999
      David Howells authored
      Invalidate any data stored in fscache for a vnode that changes on the
      server so that we don't end up with the cache in a bad state locally.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  8. 06 Feb, 2018 1 commit
    • David Howells's avatar
      afs: Support the AFS dynamic root · 4d673da1
      David Howells authored
      Support the AFS dynamic root which is a pseudo-volume that doesn't connect
      to any server resource, but rather is just a root directory that
      dynamically creates mountpoint directories where the name of such a
      directory is the name of the cell.
      Such a mount can be created thus:
      	mount -t afs none /afs -o dyn
      Dynamic root superblocks aren't shared except by bind mounts and
      propagation.  Cell root volumes can then be mounted by referring to them by
      name, e.g.:
      	ls /afs/grand.central.org/
      	ls /afs/.grand.central.org/
      The kernel will upcall to consult the DNS if the address wasn't supplied
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  9. 29 Jan, 2018 1 commit
    • Jeff Layton's avatar
      afs: convert to new i_version API · a01179e6
      Jeff Layton authored
      For AFS, it's generally treated as an opaque value, so we use the
      *_raw variants of the API here.
      Note that AFS has quite a different definition for this counter. AFS
      only increments it on changes to the data to the data in regular files
      and contents of the directories. Inode metadata changes do not result
      in a version increment.
      We'll need to reconcile that somehow if we ever want to present this to
      userspace via statx.
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
  10. 02 Jan, 2018 1 commit
    • David Howells's avatar
      afs: Fix unlink · 440fbc3a
      David Howells authored
      Repeating creation and deletion of a file on an afs mount will run the box
      out of memory, e.g.:
      	dd if=/dev/zero of=/afs/scratch/m0 bs=$((1024*1024)) count=512
      	rm /afs/scratch/m0
      The problem seems to be that it's not properly decrementing the nlink count
      so that the inode can be scrapped.
      Note that this doesn't fix local creation followed by remote deletion.
      That's harder to handle and will require a separate patch as we're not told
      that the file has been deleted - only that the directory has changed.
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  11. 13 Nov, 2017 6 commits
    • David Howells's avatar
      afs: Get rid of the afs_writeback record · 4343d008
      David Howells authored
      Get rid of the afs_writeback record that kAFS is using to match keys with
      writes made by that key.
      Instead, keep a list of keys that have a file open for writing and/or
      sync'ing and iterate through those.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Introduce a file-private data record · 215804a9
      David Howells authored
      Introduce a file-private data record for kAFS and put the key into it
      rather than storing the key in file->private_data.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Overhaul volume and server record caching and fileserver rotation · d2ddc776
      David Howells authored
      The current code assumes that volumes and servers are per-cell and are
      never shared, but this is not enforced, and, indeed, public cells do exist
      that are aliases of each other.  Further, an organisation can, say, set up
      a public cell and a private cell with overlapping, but not identical, sets
      of servers.  The difference is purely in the database attached to the VL
      The current code will malfunction if it sees a server in two cells as it
      assumes global address -> server record mappings and that each server is in
      just one cell.
      Further, each server may have multiple addresses - and may have addresses
      of different families (IPv4 and IPv6, say).
      To this end, the following structural changes are made:
       (1) Server record management is overhauled:
           (a) Server records are made independent of cell.  The namespace keeps
           	 track of them, volume records have lists of them and each vnode
           	 has a server on which its callback interest currently resides.
           (b) The cell record no longer keeps a list of servers known to be in
           	 that cell.
           (c) The server records are now kept in a flat list because there's no
           	 single address to sort on.
           (d) Server records are now keyed by their UUID within the namespace.
           (e) The addresses for a server are obtained with the VL.GetAddrsU
           	 rather than with VL.GetEntryByName, using the server's UUID as a
           (f) Cached server records are garbage collected after a period of
           	 non-use and are counted out of existence before purging is allowed
           	 to complete.  This protects the work functions against rmmod.
           (g) The servers list is now in /proc/fs/afs/servers.
       (2) Volume record management is overhauled:
           (a) An RCU-replaceable server list is introduced.  This tracks both
           	 servers and their coresponding callback interests.
           (b) The superblock is now keyed on cell record and numeric volume ID.
           (c) The volume record is now tied to the superblock which mounts it,
           	 and is activated when mounted and deactivated when unmounted.
           	 This makes it easier to handle the cache cookie without causing a
           	 double-use in fscache.
           (d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
           	 to get the server UUID list.
           (e) The volume name is updated if it is seen to have changed when the
           	 volume is updated (the update is keyed on the volume ID).
       (3) The vlocation record is got rid of and VLDB records are no longer
           cached.  Sufficient information is stored in the volume record, though
           an update to a volume record is now no longer shared between related
           volumes (volumes come in bundles of three: R/W, R/O and backup).
      and the following procedural changes are made:
       (1) The fileserver cursor introduced previously is now fleshed out and
           used to iterate over fileservers and their addresses.
       (2) Volume status is checked during iteration, and the server list is
           replaced if a change is detected.
       (3) Server status is checked during iteration, and the address list is
           replaced if a change is detected.
       (4) The abort code is saved into the address list cursor and -ECONNABORTED
           returned in afs_make_call() if a remote abort happened rather than
           translating the abort into an error message.  This allows actions to
           be taken depending on the abort code more easily.
           (a) If a VMOVED abort is seen then this is handled by rechecking the
           	 volume and restarting the iteration.
           (b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
               handled by sleeping for a short period and retrying and/or trying
               other servers that might serve that volume.  A message is also
               displayed once until the condition has cleared.
           (c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
           (d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
           	 see if it has been deleted; if not, the fileserver is probably
           	 indicating that the volume couldn't be attached and needs
           (e) If statfs() sees one of these aborts, it does not sleep, but
           	 rather returns an error, so as not to block the umount program.
       (5) The fileserver iteration functions in vnode.c are now merged into
           their callers and more heavily macroised around the cursor.  vnode.c
           is removed.
       (6) Operations on a particular vnode are serialised on that vnode because
           the server will lock that vnode whilst it operates on it, so a second
           op sent will just have to wait.
       (7) Fileservers are probed with FS.GetCapabilities before being used.
           This is where service upgrade will be done.
       (8) A callback interest on a fileserver is set up before an FS operation
           is performed and passed through to afs_make_call() so that it can be
           set on the vnode if the operation returns a callback.  The callback
           interest is passed through to afs_iget() also so that it can be set
           there too.
      In general, record updating is done on an as-needed basis when we try to
      access servers, volumes or vnodes rather than offloading it to work items
      and special threads.
       (1) Pre AFS-3.4 servers are no longer supported, though this can be added
           back if necessary (AFS-3.4 was released in 1998).
       (2) VBUSY is retried forever for the moment at intervals of 1s.
       (3) /proc/fs/afs/<cell>/servers no longer exists.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Overhaul permit caching · be080a6f
      David Howells authored
      Overhaul permit caching in AFS by making it per-vnode and sharing permit
      lists where possible.
      When most of the fileserver operations are called, they return a status
      structure indicating the (revised) details of the vnode or vnodes involved
      in the operation.  This includes the access mark derived from the ACL
      (named CallerAccess in the protocol definition file).  This is cacheable
      and if the ACL changes, the server will tell us that it is breaking the
      callback promise, at which point we can discard the currently cached
      With this patch, the afs_permits structure has, at the end, an array of
      { key, CallerAccess } elements, sorted by key pointer.  This is then cached
      in a hash table so that it can be shared between vnodes with the same
      access permits.
      Permit lists can only be shared if they contain the exact same set of
      key->CallerAccess mappings.
      Note that that table is global rather than being per-net_ns.  If the keys
      in a permit list cross net_ns boundaries, there is no problem sharing the
      cached permits, since the permits are just integer masks.
      Since permit lists pin keys, the permit cache also makes it easier for a
      future patch to find all occurrences of a key and remove them by means of
      setting the afs_permits::invalidated flag and then clearing the appropriate
      key pointer.  In such an event, memory barriers will need adding.
      Lastly, the permit caching is skipped if the server has sent either a
      vnode-specific or an entire-server callback since the start of the
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Overhaul the callback handling · c435ee34
      David Howells authored
      Overhaul the AFS callback handling by the following means:
       (1) Don't give up callback promises on vnodes that we are no longer using,
           rather let them just expire on the server or let the server break
           them.  This is actually more efficient for the server as the callback
           lookup is expensive if there are lots of extant callbacks.
       (2) Only give up the callback promises we have from a server when the
           server record is destroyed.  Then we can just give up *all* the
           callback promises on it in one go.
       (3) Servers can end up being shared between cells if cells are aliased, so
           don't add all the vnodes being backed by a particular server into a
           big FID-indexed tree on that server as there may be duplicates.
           Instead have each volume instance (~= superblock) register an interest
           in a server as it starts to make use of it and use this to allow the
           processor for callbacks from the server to find the superblock and
           thence the inode corresponding to the FID being broken by means of
       (4) Rather than iterating over the entire callback list when a mass-break
           comes in from the server, maintain a counter of mass-breaks in
           afs_server (cb_seq) and make afs_validate() check it against the copy
           in afs_vnode.
           It would be nice not to have to take a read_lock whilst doing this,
           but that's tricky without using RCU.
       (5) Save a ref on the fileserver we're using for a call in the afs_call
           struct so that we can access its cb_s_break during call decoding.
       (6) Write-lock around callback and status storage in a vnode and read-lock
           around getattr so that we don't see the status mid-update.
      This has the following consequences:
       (1) Data invalidation isn't seen until someone calls afs_validate() on a
           vnode.  Unfortunately, we need to use a key to query the server, but
           getting one from a background thread is tricky without caching loads
           of keys all over the place.
       (2) Mass invalidation isn't seen until someone calls afs_validate().
       (3) Callback breaking is going to hit the inode_hash_lock quite a bit.
           Could this be replaced with rcu_read_lock() since inodes are destroyed
           under RCU conditions.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Push the net ns pointer to more places · 9ed900b1
      David Howells authored
      Push the network namespace pointer to more places in AFS, including the
      afs_server structure (which doesn't hold a ref on the netns).
      In particular, afs_put_cell() now takes requires a net ns parameter so that
      it can safely alter the netns after decrementing the cell usage count - the
      cell will be deallocated by a background thread after being cached for a
      period, which means that it's not safe to access it after reducing its
      usage count.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  12. 09 Jul, 2017 1 commit
    • David Howells's avatar
      afs: Add metadata xattrs · d3e3b7ea
      David Howells authored
      Add xattrs to allow the user to get/set metadata in lieu of having pioctl()
      available.  The following xattrs are now available:
       - "afs.cell"
         The name of the cell in which the vnode's volume resides.
       - "afs.fid"
         The volume ID, vnode ID and vnode uniquifier of the file as three hex
         numbers separated by colons.
       - "afs.volume"
         The name of the volume in which the vnode resides.
      For example:
      	# getfattr -d -m ".*" /mnt/scratch
      	getfattr: Removing leading '/' from absolute path names
      	# file: mnt/scratch
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 16 Mar, 2017 5 commits
  14. 03 Mar, 2017 1 commit
    • David Howells's avatar
      statx: Add a system call to make enhanced file info available · a528d35e
      David Howells authored
      Add a system call to make extended file information available, including
      file creation and some attribute flags where available through the
      underlying filesystem.
      The getattr inode operation is altered to take two additional arguments: a
      u32 request_mask and an unsigned int flags that indicate the
      synchronisation mode.  This change is propagated to the vfs_getattr*()
      Functions like vfs_stat() are now inline wrappers around new functions
      vfs_statx() and vfs_statx_fd() to reduce stack usage.
      The idea was initially proposed as a set of xattrs that could be retrieved
      with getxattr(), but the general preference proved to be for a new syscall
      with an extended stat structure.
      A number of requests were gathered for features to be included.  The
      following have been included:
       (1) Make the fields a consistent size on all arches and make them large.
       (2) Spare space, request flags and information flags are provided for
           future expansion.
       (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
       (4) Creation time: The SMB protocol carries the creation time, which could
           be exported by Samba, which will in turn help CIFS make use of
           FS-Cache as that can be used for coherency data (stx_btime).
           This is also specified in NFSv4 as a recommended attribute and could
           be exported by NFSD [Steve French].
       (5) Lightweight stat: Ask for just those details of interest, and allow a
           netfs (such as NFS) to approximate anything not of interest, possibly
           without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
           Dilger] (AT_STATX_DONT_SYNC).
       (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
           its cached attributes are up to date [Trond Myklebust]
      And the following have been left out for future extension:
       (7) Data version number: Could be used by userspace NFS servers [Aneesh
           Can also be used to modify fill_post_wcc() in NFSD which retrieves
           i_version directly, but has just called vfs_getattr().  It could get
           it from the kstat struct if it used vfs_xgetattr() instead.
           (There's disagreement on the exact semantics of a single field, since
           not all filesystems do this the same way).
       (8) BSD stat compatibility: Including more fields from the BSD stat such
           as creation time (st_btime) and inode generation number (st_gen)
           [Jeremy Allison, Bernd Schubert].
       (9) Inode generation number: Useful for FUSE and userspace NFS servers
           [Bernd Schubert].
           (This was asked for but later deemed unnecessary with the
           open-by-handle capability available and caused disagreement as to
           whether it's a security hole or not).
      (10) Extra coherency data may be useful in making backups [Andreas Dilger].
           (No particular data were offered, but things like last backup
           timestamp, the data version number and the DOS archive bit would come
           into this category).
      (11) Allow the filesystem to indicate what it can/cannot provide: A
           filesystem can now say it doesn't support a standard stat feature if
           that isn't available, so if, for instance, inode numbers or UIDs don't
           exist or are fabricated locally...
           (This requires a separate system call - I have an fsinfo() call idea
           for this).
      (12) Store a 16-byte volume ID in the superblock that can be returned in
           struct xstat [Steve French].
           (Deferred to fsinfo).
      (13) Include granularity fields in the time data to indicate the
           granularity of each of the times (NFSv4 time_delta) [Steve French].
           (Deferred to fsinfo).
      (14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
           Note that the Linux IOC flags are a mess and filesystems such as Ext4
           define flags that aren't in linux/fs.h, so translation in the kernel
           may be a necessity (or, possibly, we provide the filesystem type too).
           (Some attributes are made available in stx_attributes, but the general
           feeling was that the IOC flags were to ext[234]-specific and shouldn't
           be exposed through statx this way).
      (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
           Michael Kerrisk].
           (Deferred, probably to fsinfo.  Finding out if there's an ACL or
           seclabal might require extra filesystem operations).
      (16) Femtosecond-resolution timestamps [Dave Chinner].
           (A __reserved field has been left in the statx_timestamp struct for
           this - if there proves to be a need).
      (17) A set multiple attributes syscall to go with this.
      The new system call is:
      	int ret = statx(int dfd,
      			const char *filename,
      			unsigned int flags,
      			unsigned int mask,
      			struct statx *buffer);
      The dfd, filename and flags parameters indicate the file to query, in a
      similar way to fstatat().  There is no equivalent of lstat() as that can be
      emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
      also no equivalent of fstat() as that can be emulated by passing a NULL
      filename to statx() with the fd of interest in dfd.
      Whether or not statx() synchronises the attributes with the backing store
      can be controlled by OR'ing a value into the flags argument (this typically
      only affects network filesystems):
       (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
       (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
           its attributes with the server - which might require data writeback to
           occur to get the timestamps correct.
       (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
           network filesystem.  The resulting values should be considered
      mask is a bitmask indicating the fields in struct statx that are of
      interest to the caller.  The user should set this to STATX_BASIC_STATS to
      get the basic set returned by stat().  It should be noted that asking for
      more information may entail extra I/O operations.
      buffer points to the destination for the data.  This must be 256 bytes in
      The following structures are defined in which to return the main attribute
      	struct statx_timestamp {
      		__s64	tv_sec;
      		__s32	tv_nsec;
      		__s32	__reserved;
      	struct statx {
      		__u32	stx_mask;
      		__u32	stx_blksize;
      		__u64	stx_attributes;
      		__u32	stx_nlink;
      		__u32	stx_uid;
      		__u32	stx_gid;
      		__u16	stx_mode;
      		__u16	__spare0[1];
      		__u64	stx_ino;
      		__u64	stx_size;
      		__u64	stx_blocks;
      		__u64	__spare1[1];
      		struct statx_timestamp	stx_atime;
      		struct statx_timestamp	stx_btime;
      		struct statx_timestamp	stx_ctime;
      		struct statx_timestamp	stx_mtime;
      		__u32	stx_rdev_major;
      		__u32	stx_rdev_minor;
      		__u32	stx_dev_major;
      		__u32	stx_dev_minor;
      		__u64	__spare2[14];
      The defined bits in request_mask and stx_mask are:
      	STATX_TYPE		Want/got stx_mode & S_IFMT
      	STATX_MODE		Want/got stx_mode & ~S_IFMT
      	STATX_NLINK		Want/got stx_nlink
      	STATX_UID		Want/got stx_uid
      	STATX_GID		Want/got stx_gid
      	STATX_ATIME		Want/got stx_atime{,_ns}
      	STATX_MTIME		Want/got stx_mtime{,_ns}
      	STATX_CTIME		Want/got stx_ctime{,_ns}
      	STATX_INO		Want/got stx_ino
      	STATX_SIZE		Want/got stx_size
      	STATX_BLOCKS		Want/got stx_blocks
      	STATX_BASIC_STATS	[The stuff in the normal stat struct]
      	STATX_BTIME		Want/got stx_btime{,_ns}
      	STATX_ALL		[All currently available stuff]
      stx_btime is the file creation time, stx_mask is a bitmask indicating the
      data provided and __spares*[] are where as-yet undefined fields can be
      Time fields are structures with separate seconds and nanoseconds fields
      plus a reserved field in case we want to add even finer resolution.  Note
      that times will be negative if before 1970; in such a case, the nanosecond
      fields will also be negative if not zero.
      The bits defined in the stx_attributes field convey information about a
      file, how it is accessed, where it is and what it does.  The following
      attributes map to FS_*_FL flags and are the same numerical value:
      	STATX_ATTR_COMPRESSED		File is compressed by the fs
      	STATX_ATTR_IMMUTABLE		File is marked immutable
      	STATX_ATTR_APPEND		File is append-only
      	STATX_ATTR_NODUMP		File is not to be dumped
      	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs
      Within the kernel, the supported flags are listed by:
      [Are any other IOC flags of sufficient general interest to be exposed
      through this interface?]
      New flags include:
      	STATX_ATTR_AUTOMOUNT		Object is an automount trigger
      These are for the use of GUI tools that might want to mark files specially,
      depending on what they are.
      Fields in struct statx come in a number of classes:
       (0) stx_dev_*, stx_blksize.
           These are local system information and are always available.
       (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
           stx_size, stx_blocks.
           These will be returned whether the caller asks for them or not.  The
           corresponding bits in stx_mask will be set to indicate whether they
           actually have valid values.
           If the caller didn't ask for them, then they may be approximated.  For
           example, NFS won't waste any time updating them from the server,
           unless as a byproduct of updating something requested.
           If the values don't actually exist for the underlying object (such as
           UID or GID on a DOS file), then the bit won't be set in the stx_mask,
           even if the caller asked for the value.  In such a case, the returned
           value will be a fabrication.
           Note that there are instances where the type might not be valid, for
           instance Windows reparse points.
       (2) stx_rdev_*.
           This will be set only if stx_mode indicates we're looking at a
           blockdev or a chardev, otherwise will be 0.
       (3) stx_btime.
           Similar to (1), except this will be set to 0 if it doesn't exist.
      The following test program can be used to test the statx system call:
      Just compile and run, passing it paths to the files you want to examine.
      The file is built automatically if CONFIG_SAMPLES is enabled.
      Here's some example output.  Firstly, an NFS directory that crosses to
      another FSID.  Note that the AUTOMOUNT attribute is set because transiting
      this directory will cause d_automount to be invoked by the VFS.
      	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
      	statx(/warthog/data) = 0
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:26           Inode: 1703937     Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
      Secondly, the result of automounting on that directory.
      	[root@andromeda ~]# /tmp/test-statx /warthog/data
      	statx(/warthog/data) = 0
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:27           Inode: 2           Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  15. 09 Dec, 2015 1 commit
    • Al Viro's avatar
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro authored
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  16. 15 Apr, 2015 1 commit
  17. 19 Nov, 2014 1 commit
  18. 03 Apr, 2014 1 commit
    • Johannes Weiner's avatar
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner authored
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>