1. 17 Nov, 2017 1 commit
    • David Howells's avatar
      afs: Fix file locking · 0fafdc9f
      David Howells authored
      Fix the AFS file locking whereby the use of the big kernel lock (which
      could be slept with) was replaced by a spinlock (which couldn't).  The
      problem is that the AFS code was doing stuff inside the critical section
      that might call schedule(), so this is a broken transformation.
      Fix this by the following means:
       (1) Use a state machine with a proper state that can only be changed under
           the spinlock rather than using a collection of bit flags.
       (2) Cache the key used for the lock and the lock type in the afs_vnode
           struct so that the manager work function doesn't have to refer to a
           file_lock struct that's been dequeued.  This makes signal handling
       (4) Move the unlock from afs_do_unlk() to afs_fl_release_private() which
           means that unlock is achieved in other circumstances too.
       (5) Unlock the file on the server before taking the next conflicting lock.
      Also change:
       (1) Check the permits on a file before actually trying the lock.
       (2) fsync the file before effecting an explicit unlock operation.  We
           don't fsync if the lock is erased otherwise as we might not be in a
           context where we can actually do that.
      Further fixes:
       (1) Fixed-fileserver address rotation is made to work.  It's only used by
           the locking functions, so couldn't be tested before.
      Fixes: 72f98e72 ("locks: turn lock_flocks into a spinlock")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: jlayton@redhat.com
  2. 13 Nov, 2017 3 commits
    • David Howells's avatar
      afs: Overhaul volume and server record caching and fileserver rotation · d2ddc776
      David Howells authored
      The current code assumes that volumes and servers are per-cell and are
      never shared, but this is not enforced, and, indeed, public cells do exist
      that are aliases of each other.  Further, an organisation can, say, set up
      a public cell and a private cell with overlapping, but not identical, sets
      of servers.  The difference is purely in the database attached to the VL
      The current code will malfunction if it sees a server in two cells as it
      assumes global address -> server record mappings and that each server is in
      just one cell.
      Further, each server may have multiple addresses - and may have addresses
      of different families (IPv4 and IPv6, say).
      To this end, the following structural changes are made:
       (1) Server record management is overhauled:
           (a) Server records are made independent of cell.  The namespace keeps
           	 track of them, volume records have lists of them and each vnode
           	 has a server on which its callback interest currently resides.
           (b) The cell record no longer keeps a list of servers known to be in
           	 that cell.
           (c) The server records are now kept in a flat list because there's no
           	 single address to sort on.
           (d) Server records are now keyed by their UUID within the namespace.
           (e) The addresses for a server are obtained with the VL.GetAddrsU
           	 rather than with VL.GetEntryByName, using the server's UUID as a
           (f) Cached server records are garbage collected after a period of
           	 non-use and are counted out of existence before purging is allowed
           	 to complete.  This protects the work functions against rmmod.
           (g) The servers list is now in /proc/fs/afs/servers.
       (2) Volume record management is overhauled:
           (a) An RCU-replaceable server list is introduced.  This tracks both
           	 servers and their coresponding callback interests.
           (b) The superblock is now keyed on cell record and numeric volume ID.
           (c) The volume record is now tied to the superblock which mounts it,
           	 and is activated when mounted and deactivated when unmounted.
           	 This makes it easier to handle the cache cookie without causing a
           	 double-use in fscache.
           (d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
           	 to get the server UUID list.
           (e) The volume name is updated if it is seen to have changed when the
           	 volume is updated (the update is keyed on the volume ID).
       (3) The vlocation record is got rid of and VLDB records are no longer
           cached.  Sufficient information is stored in the volume record, though
           an update to a volume record is now no longer shared between related
           volumes (volumes come in bundles of three: R/W, R/O and backup).
      and the following procedural changes are made:
       (1) The fileserver cursor introduced previously is now fleshed out and
           used to iterate over fileservers and their addresses.
       (2) Volume status is checked during iteration, and the server list is
           replaced if a change is detected.
       (3) Server status is checked during iteration, and the address list is
           replaced if a change is detected.
       (4) The abort code is saved into the address list cursor and -ECONNABORTED
           returned in afs_make_call() if a remote abort happened rather than
           translating the abort into an error message.  This allows actions to
           be taken depending on the abort code more easily.
           (a) If a VMOVED abort is seen then this is handled by rechecking the
           	 volume and restarting the iteration.
           (b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
               handled by sleeping for a short period and retrying and/or trying
               other servers that might serve that volume.  A message is also
               displayed once until the condition has cleared.
           (c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
           (d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
           	 see if it has been deleted; if not, the fileserver is probably
           	 indicating that the volume couldn't be attached and needs
           (e) If statfs() sees one of these aborts, it does not sleep, but
           	 rather returns an error, so as not to block the umount program.
       (5) The fileserver iteration functions in vnode.c are now merged into
           their callers and more heavily macroised around the cursor.  vnode.c
           is removed.
       (6) Operations on a particular vnode are serialised on that vnode because
           the server will lock that vnode whilst it operates on it, so a second
           op sent will just have to wait.
       (7) Fileservers are probed with FS.GetCapabilities before being used.
           This is where service upgrade will be done.
       (8) A callback interest on a fileserver is set up before an FS operation
           is performed and passed through to afs_make_call() so that it can be
           set on the vnode if the operation returns a callback.  The callback
           interest is passed through to afs_iget() also so that it can be set
           there too.
      In general, record updating is done on an as-needed basis when we try to
      access servers, volumes or vnodes rather than offloading it to work items
      and special threads.
       (1) Pre AFS-3.4 servers are no longer supported, though this can be added
           back if necessary (AFS-3.4 was released in 1998).
       (2) VBUSY is retried forever for the moment at intervals of 1s.
       (3) /proc/fs/afs/<cell>/servers no longer exists.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Overhaul permit caching · be080a6f
      David Howells authored
      Overhaul permit caching in AFS by making it per-vnode and sharing permit
      lists where possible.
      When most of the fileserver operations are called, they return a status
      structure indicating the (revised) details of the vnode or vnodes involved
      in the operation.  This includes the access mark derived from the ACL
      (named CallerAccess in the protocol definition file).  This is cacheable
      and if the ACL changes, the server will tell us that it is breaking the
      callback promise, at which point we can discard the currently cached
      With this patch, the afs_permits structure has, at the end, an array of
      { key, CallerAccess } elements, sorted by key pointer.  This is then cached
      in a hash table so that it can be shared between vnodes with the same
      access permits.
      Permit lists can only be shared if they contain the exact same set of
      key->CallerAccess mappings.
      Note that that table is global rather than being per-net_ns.  If the keys
      in a permit list cross net_ns boundaries, there is no problem sharing the
      cached permits, since the permits are just integer masks.
      Since permit lists pin keys, the permit cache also makes it easier for a
      future patch to find all occurrences of a key and remove them by means of
      setting the afs_permits::invalidated flag and then clearing the appropriate
      key pointer.  In such an event, memory barriers will need adding.
      Lastly, the permit caching is skipped if the server has sent either a
      vnode-specific or an entire-server callback since the start of the
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
    • David Howells's avatar
      afs: Overhaul the callback handling · c435ee34
      David Howells authored
      Overhaul the AFS callback handling by the following means:
       (1) Don't give up callback promises on vnodes that we are no longer using,
           rather let them just expire on the server or let the server break
           them.  This is actually more efficient for the server as the callback
           lookup is expensive if there are lots of extant callbacks.
       (2) Only give up the callback promises we have from a server when the
           server record is destroyed.  Then we can just give up *all* the
           callback promises on it in one go.
       (3) Servers can end up being shared between cells if cells are aliased, so
           don't add all the vnodes being backed by a particular server into a
           big FID-indexed tree on that server as there may be duplicates.
           Instead have each volume instance (~= superblock) register an interest
           in a server as it starts to make use of it and use this to allow the
           processor for callbacks from the server to find the superblock and
           thence the inode corresponding to the FID being broken by means of
       (4) Rather than iterating over the entire callback list when a mass-break
           comes in from the server, maintain a counter of mass-breaks in
           afs_server (cb_seq) and make afs_validate() check it against the copy
           in afs_vnode.
           It would be nice not to have to take a read_lock whilst doing this,
           but that's tricky without using RCU.
       (5) Save a ref on the fileserver we're using for a call in the afs_call
           struct so that we can access its cb_s_break during call decoding.
       (6) Write-lock around callback and status storage in a vnode and read-lock
           around getattr so that we don't see the status mid-update.
      This has the following consequences:
       (1) Data invalidation isn't seen until someone calls afs_validate() on a
           vnode.  Unfortunately, we need to use a key to query the server, but
           getting one from a background thread is tricky without caching loads
           of keys all over the place.
       (2) Mass invalidation isn't seen until someone calls afs_validate().
       (3) Callback breaking is going to hit the inode_hash_lock quite a bit.
           Could this be replaced with rcu_read_lock() since inodes are destroyed
           under RCU conditions.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  3. 09 Jul, 2017 1 commit
    • Marc Dionne's avatar
      afs: Ignore AFS_ACE_READ and AFS_ACE_WRITE for directories · fd249821
      Marc Dionne authored
      The AFS_ACE_READ and AFS_ACE_WRITE permission bits should not
      be used to make access decisions for the directory itself.  They
      are meant to control access for the objects contained in that
      Reading a directory is allowed if the AFS_ACE_LOOKUP bit is set.
      This would cause an incorrect access denied error for a directory
      with AFS_ACE_LOOKUP but not AFS_ACE_READ.
      The AFS_ACE_WRITE bit does not allow operations that modify the
      directory.  For a directory with AFS_ACE_WRITE but neither
      AFS_ACE_INSERT nor AFS_ACE_DELETE, this would result in trying
      operations that would ultimately be denied by the server.
      Signed-off-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  4. 16 Mar, 2017 2 commits
  5. 20 Jul, 2011 3 commits
  6. 07 Jan, 2011 1 commit
  7. 22 Mar, 2010 1 commit
  8. 27 Jul, 2008 1 commit
    • Al Viro's avatar
      [PATCH] sanitize ->permission() prototype · e6305c43
      Al Viro authored
      * kill nameidata * argument; map the 3 bits in ->flags anybody cares
        about to new MAY_... ones and pass with the mask.
      * kill redundant gfs2_iop_permission()
      * sanitize ecryptfs_permission()
      * fix remaining places where ->permission() instances might barf on new
        MAY_... found in mask.
      The obvious next target in that direction is permission(9)
      folded fix for nfs_permission() breakage from Miklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  9. 08 Feb, 2008 1 commit
  10. 07 Feb, 2008 1 commit
  11. 21 May, 2007 1 commit
    • Alexey Dobriyan's avatar
      Detach sched.h from mm.h · e8edc6e0
      Alexey Dobriyan authored
      First thing mm.h does is including sched.h solely for can_do_mlock() inline
      function which has "current" dereference inside. By dealing with can_do_mlock()
      mm.h can be detached from sched.h which is good. See below, why.
      This patch
      a) removes unconditional inclusion of sched.h from mm.h
      b) makes can_do_mlock() normal function in mm/mlock.c
      c) exports can_do_mlock() to not break compilation
      d) adds sched.h inclusions back to files that were getting it indirectly.
      e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
         getting them indirectly
      Net result is:
      a) mm.h users would get less code to open, read, preprocess, parse, ... if
         they don't need sched.h
      b) sched.h stops being dependency for significant number of files:
         on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
         after patch it's only 3744 (-8.3%).
      Cross-compile tested on
      	all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
      	alpha alpha-up
      	i386 i386-up i386-defconfig i386-allnoconfig
      	ia64 ia64-up
      	parisc parisc-up
      	powerpc powerpc-up
      	s390 s390-up
      	sparc sparc-up
      	sparc64 sparc64-up
      	x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
      as well as my two usual configs.
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  12. 09 May, 2007 1 commit
    • David Howells's avatar
      AFS: AFS fixups · 416351f2
      David Howells authored
      Make some miscellaneous changes to the AFS filesystem:
       (1) Assert RCU barriers on module exit to make sure RCU has finished with
           callbacks in this module.
       (2) Correctly handle the AFS server returning a zero-length read.
       (3) Split out data zapping calls into one function (afs_zap_data).
       (4) Rename some afs_file_*() functions to afs_*() where they apply to
           non-regular files too.
       (5) Be consistent about the presentation of volume ID:vnode ID in debugging
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  13. 26 Apr, 2007 2 commits