Skip to content
Snippets Groups Projects
  1. Nov 24, 2022
  2. Nov 21, 2022
  3. Oct 28, 2022
  4. Oct 25, 2022
  5. Oct 03, 2022
  6. Sep 29, 2022
  7. Sep 28, 2022
  8. Sep 27, 2022
    • Akhil Raj's avatar
      Remove duplicate words inside documentation · d2bef8e1
      Akhil Raj authored
      
      I have removed repeated `the` inside the documentation
      
      Signed-off-by: default avatarAkhil Raj <lf32.dev@gmail.com>
      Link: https://lore.kernel.org/r/20220827145359.32599-1-lf32.dev@gmail.com
      
      
      Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
      d2bef8e1
    • Liam R. Howlett's avatar
      Maple Tree: add new data structure · 54a611b6
      Liam R. Howlett authored
      Patch series "Introducing the Maple Tree"
      
      The maple tree is an RCU-safe range based B-tree designed to use modern
      processor cache efficiently.  There are a number of places in the kernel
      that a non-overlapping range-based tree would be beneficial, especially
      one with a simple interface.  If you use an rbtree with other data
      structures to improve performance or an interval tree to track
      non-overlapping ranges, then this is for you.
      
      The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
      nodes.  With the increased branching factor, it is significantly shorter
      than the rbtree so it has fewer cache misses.  The removal of the linked
      list between subsequent entries also reduces the cache misses and the need
      to pull in the previous and next VMA during many tree alterations.
      
      The first user that is covered in this patch set is the vm_area_struct,
      where three data structures are replaced by the maple tree: the augmented
      rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
      long term goal is to reduce or remove the mmap_lock contention.
      
      The plan is to get to the point where we use the maple tree in RCU mode.
      Readers will not block for writers.  A single write operation will be
      allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
      would be RCU enabled and this mode would be entered once multiple tasks
      are using the mm_struct.
      
      Davidlor said
      
      : Yes I like the maple tree, and at this stage I don't think we can ask for
      : more from this series wrt the MM - albeit there seems to still be some
      : folks reporting breakage.  Fundamentally I see Liam's work to (re)move
      : complexity out of the MM (not to say that the actual maple tree is not
      : complex) by consolidating the three complimentary data structures very
      : much worth it considering performance does not take a hit.  This was very
      : much a turn off with the range locking approach, which worst case scenario
      : incurred in prohibitive overhead.  Also as Liam and Matthew have
      : mentioned, RCU opens up a lot of nice performance opportunities, and in
      : addition academia[1] has shown outstanding scalability of address spaces
      : with the foundation of replacing the locked rbtree with RCU aware trees.
      
      A similar work has been discovered in the academic press
      
      	https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
      
      Sheer coincidence.  We designed our tree with the intention of solving the
      hardest problem first.  Upon settling on a b-tree variant and a rough
      outline, we researched ranged based b-trees and RCU b-trees and did find
      that article.  So it was nice to find reassurances that we were on the
      right path, but our design choice of using ranges made that paper unusable
      for us.
      
      This patch (of 70):
      
      The maple tree is an RCU-safe range based B-tree designed to use modern
      processor cache efficiently.  There are a number of places in the kernel
      that a non-overlapping range-based tree would be beneficial, especially
      one with a simple interface.  If you use an rbtree with other data
      structures to improve performance or an interval tree to track
      non-overlapping ranges, then this is for you.
      
      The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
      nodes.  With the increased branching factor, it is significantly shorter
      than the rbtree so it has fewer cache misses.  The removal of the linked
      list between subsequent entries also reduces the cache misses and the need
      to pull in the previous and next VMA during many tree alterations.
      
      The first user that is covered in this patch set is the vm_area_struct,
      where three data structures are replaced by the maple tree: the augmented
      rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
      long term goal is to reduce or remove the mmap_lock contention.
      
      The plan is to get to the point where we use the maple tree in RCU mode.
      Readers will not block for writers.  A single write operation will be
      allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
      would be RCU enabled and this mode would be entered once multiple tasks
      are using the mm_struct.
      
      There is additional BUG_ON() calls added within the tree, most of which
      are in debug code.  These will be replaced with a WARN_ON() call in the
      future.  There is also additional BUG_ON() calls within the code which
      will also be reduced in number at a later date.  These exist to catch
      things such as out-of-range accesses which would crash anyways.
      
      Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
      Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com
      
      
      Signed-off-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarSven Schnelle <svens@linux.ibm.com>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      54a611b6
  9. Aug 18, 2022
  10. Jul 19, 2022
  11. Jul 15, 2022
    • Yury Norov's avatar
      headers/deps: mm: align MANITAINERS and Docs with new gfp.h structure · 7343f2b0
      Yury Norov authored
      
      After moving gfp types out of gfp.h, we have to align MAINTAINERS
      and Docs, to avoid warnings like this:
      
      >> include/linux/gfp.h:1: warning: 'Page mobility and placement hints' not found
      >> include/linux/gfp.h:1: warning: 'Watermark modifiers' not found
      >> include/linux/gfp.h:1: warning: 'Reclaim modifiers' not found
      >> include/linux/gfp.h:1: warning: 'Useful GFP flag combinations' not found
      
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      7343f2b0
  12. Jul 11, 2022
  13. Jul 01, 2022
  14. Jun 28, 2022
    • Arnd Bergmann's avatar
      arch/*/: remove CONFIG_VIRT_TO_BUS · 4313a249
      Arnd Bergmann authored
      
      All architecture-independent users of virt_to_bus() and bus_to_virt()
      have been fixed to use the dma mapping interfaces or have been
      removed now.  This means the definitions on most architectures, and the
      CONFIG_VIRT_TO_BUS symbol are now obsolete and can be removed.
      
      The only exceptions to this are a few network and scsi drivers for m68k
      Amiga and VME machines and ppc32 Macintosh. These drivers work correctly
      with the old interfaces and are probably not worth changing.
      
      On alpha and parisc, virt_to_bus() were still used in asm/floppy.h.
      alpha can use isa_virt_to_bus() like x86 does, and parisc can just
      open-code the virt_to_phys() here, as this is architecture specific
      code.
      
      I tried updating the bus-virt-phys-mapping.rst documentation, which
      started as an email from Linus to explain some details of the Linux-2.0
      driver interfaces. The bits about virt_to_bus() were declared obsolete
      backin 2000, and the rest is not all that relevant any more, so in the
      end I just decided to remove the file completely.
      
      Reviewed-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Acked-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      4313a249
  15. Jun 27, 2022
  16. Jun 07, 2022
  17. Apr 22, 2022
  18. Apr 14, 2022
    • Kurt Kanzenbach's avatar
      timekeeping: Introduce fast accessor to clock tai · 3dc6ffae
      Kurt Kanzenbach authored
      
      Introduce fast/NMI safe accessor to clock tai for tracing. The Linux kernel
      tracing infrastructure has support for using different clocks to generate
      timestamps for trace events. Especially in TSN networks it's useful to have TAI
      as trace clock, because the application scheduling is done in accordance to the
      network time, which is based on TAI. With a tai trace_clock in place, it becomes
      very convenient to correlate network activity with Linux kernel application
      traces.
      
      Use the same implementation as ktime_get_boot_fast_ns() does by reading the
      monotonic time and adding the TAI offset. The same limitations as for the fast
      boot implementation apply. The TAI offset may change at run time e.g., by
      setting the time or using adjtimex() with an offset. However, these kind of
      offset changes are rare events. Nevertheless, the user has to be aware and deal
      with it in post processing.
      
      An alternative approach would be to use the same implementation as
      ktime_get_real_fast_ns() does. However, this requires to add an additional u64
      member to the tk_read_base struct. This struct together with a seqcount is
      designed to fit into a single cache line on 64 bit architectures. Adding a new
      member would violate this constraint.
      
      Signed-off-by: default avatarKurt Kanzenbach <kurt@linutronix.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: https://lore.kernel.org/r/20220414091805.89667-2-kurt@linutronix.de
      3dc6ffae
  19. Apr 13, 2022
  20. Mar 28, 2022
  21. Mar 26, 2022
    • Linus Torvalds's avatar
      Revert "swiotlb: rework "fix info leak with DMA_FROM_DEVICE"" · bddac7c1
      Linus Torvalds authored
      
      This reverts commit aa6f8dcb.
      
      It turns out this breaks at least the ath9k wireless driver, and
      possibly others.
      
      What the ath9k driver does on packet receive is to set up the DMA
      transfer with:
      
        int ath_rx_init(..)
        ..
                      bf->bf_buf_addr = dma_map_single(sc->dev, skb->data,
                                                       common->rx_bufsize,
                                                       DMA_FROM_DEVICE);
      
      and then the receive logic (through ath_rx_tasklet()) will fetch
      incoming packets
      
        static bool ath_edma_get_buffers(..)
        ..
              dma_sync_single_for_cpu(sc->dev, bf->bf_buf_addr,
                                      common->rx_bufsize, DMA_FROM_DEVICE);
      
              ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
              if (ret == -EINPROGRESS) {
                      /*let device gain the buffer again*/
                      dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
                                      common->rx_bufsize, DMA_FROM_DEVICE);
                      return false;
              }
      
      and it's worth noting how that first DMA sync:
      
          dma_sync_single_for_cpu(..DMA_FROM_DEVICE);
      
      is there to make sure the CPU can read the DMA buffer (possibly by
      copying it from the bounce buffer area, or by doing some cache flush).
      The iommu correctly turns that into a "copy from bounce bufer" so that
      the driver can look at the state of the packets.
      
      In the meantime, the device may continue to write to the DMA buffer, but
      we at least have a snapshot of the state due to that first DMA sync.
      
      But that _second_ DMA sync:
      
          dma_sync_single_for_device(..DMA_FROM_DEVICE);
      
      is telling the DMA mapping that the CPU wasn't interested in the area
      because the packet wasn't there.  In the case of a DMA bounce buffer,
      that is a no-op.
      
      Note how it's not a sync for the CPU (the "for_device()" part), and it's
      not a sync for data written by the CPU (the "DMA_FROM_DEVICE" part).
      
      Or rather, it _should_ be a no-op.  That's what commit aa6f8dcb
      broke: it made the code bounce the buffer unconditionally, and changed
      the DMA_FROM_DEVICE to just unconditionally and illogically be
      DMA_TO_DEVICE.
      
      [ Side note: purely within the confines of the swiotlb driver it wasn't
        entirely illogical: The reason it did that odd DMA_FROM_DEVICE ->
        DMA_TO_DEVICE conversion thing is because inside the swiotlb driver,
        it uses just a swiotlb_bounce() helper that doesn't care about the
        whole distinction of who the sync is for - only which direction to
        bounce.
      
        So it took the "sync for device" to mean that the CPU must have been
        the one writing, and thought it meant DMA_TO_DEVICE. ]
      
      Also note how the commentary in that commit was wrong, probably due to
      that whole confusion, claiming that the commit makes the swiotlb code
      
                                        "bounce unconditionally (that is, also
          when dir == DMA_TO_DEVICE) in order do avoid synchronising back stale
          data from the swiotlb buffer"
      
      which is nonsensical for two reasons:
      
       - that "also when dir == DMA_TO_DEVICE" is nonsensical, as that was
         exactly when it always did - and should do - the bounce.
      
       - since this is a sync for the device (not for the CPU), we're clearly
         fundamentally not coping back stale data from the bounce buffers at
         all, because we'd be copying *to* the bounce buffers.
      
      So that commit was just very confused.  It confused the direction of the
      synchronization (to the device, not the cpu) with the direction of the
      DMA (from the device).
      
      Reported-and-bisected-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Reported-by: default avatarOlha Cherevyk <olha.cherevyk@gmail.com>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kalle Valo <kvalo@kernel.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Toke Høiland-Jørgensen <toke@toke.dk>
      Cc: Maxime Bizon <mbizon@freebox.fr>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bddac7c1
  22. Mar 22, 2022
    • NeilBrown's avatar
      mm: document and polish read-ahead code · 84dacdbd
      NeilBrown authored
      Add some "big-picture" documentation for read-ahead and polish the code
      to make it fit this documentation.
      
      The meaning of ->async_size is clarified to match its name.  i.e.  Any
      request to ->readahead() has a sync part and an async part.  The caller
      will wait for the sync pages to complete, but will not wait for the
      async pages.  The first async page is still marked PG_readahead
      
      Note that the current function names page_cache_sync_ra() and
      page_cache_async_ra() are misleading.  All ra request are partly sync
      and partly async, so either part can be empty.  A page_cache_sync_ra()
      request will usually set ->async_size non-zero, implying it is not all
      synchronous.
      
      When a non-zero req_count is passed to page_cache_async_ra(), the
      implication is that some prefix of the request is synchronous, though
      the calculation made there is incorrect - I haven't tried to fix it.
      
      Link: https://lkml.kernel.org/r/164549983734.9187.11586890887006601405.stgit@noble.brown
      
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Chao Yu <chao@kernel.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jaegeuk Kim <jaegeuk@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Paolo Valente <paolo.valente@linaro.org>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84dacdbd
  23. Mar 21, 2022
  24. Mar 07, 2022
    • Halil Pasic's avatar
      swiotlb: rework "fix info leak with DMA_FROM_DEVICE" · aa6f8dcb
      Halil Pasic authored
      
      Unfortunately, we ended up merging an old version of the patch "fix info
      leak with DMA_FROM_DEVICE" instead of merging the latest one. Christoph
      (the swiotlb maintainer), he asked me to create an incremental fix
      (after I have pointed this out the mix up, and asked him for guidance).
      So here we go.
      
      The main differences between what we got and what was agreed are:
      * swiotlb_sync_single_for_device is also required to do an extra bounce
      * We decided not to introduce DMA_ATTR_OVERWRITE until we have exploiters
      * The implantation of DMA_ATTR_OVERWRITE is flawed: DMA_ATTR_OVERWRITE
        must take precedence over DMA_ATTR_SKIP_CPU_SYNC
      
      Thus this patch removes DMA_ATTR_OVERWRITE, and makes
      swiotlb_sync_single_for_device() bounce unconditionally (that is, also
      when dir == DMA_TO_DEVICE) in order do avoid synchronising back stale
      data from the swiotlb buffer.
      
      Let me note, that if the size used with dma_sync_* API is less than the
      size used with dma_[un]map_*, under certain circumstances we may still
      end up with swiotlb not being transparent. In that sense, this is no
      perfect fix either.
      
      To get this bullet proof, we would have to bounce the entire
      mapping/bounce buffer. For that we would have to figure out the starting
      address, and the size of the mapping in
      swiotlb_sync_single_for_device(). While this does seem possible, there
      seems to be no firm consensus on how things are supposed to work.
      
      Signed-off-by: default avatarHalil Pasic <pasic@linux.ibm.com>
      Fixes: ddbd89de ("swiotlb: fix info leak with DMA_FROM_DEVICE")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa6f8dcb
  25. Feb 14, 2022
    • Halil Pasic's avatar
      swiotlb: fix info leak with DMA_FROM_DEVICE · ddbd89de
      Halil Pasic authored
      
      The problem I'm addressing was discovered by the LTP test covering
      cve-2018-1000204.
      
      A short description of what happens follows:
      1) The test case issues a command code 00 (TEST UNIT READY) via the SG_IO
         interface with: dxfer_len == 524288, dxdfer_dir == SG_DXFER_FROM_DEV
         and a corresponding dxferp. The peculiar thing about this is that TUR
         is not reading from the device.
      2) In sg_start_req() the invocation of blk_rq_map_user() effectively
         bounces the user-space buffer. As if the device was to transfer into
         it. Since commit a45b599a ("scsi: sg: allocate with __GFP_ZERO in
         sg_build_indirect()") we make sure this first bounce buffer is
         allocated with GFP_ZERO.
      3) For the rest of the story we keep ignoring that we have a TUR, so the
         device won't touch the buffer we prepare as if the we had a
         DMA_FROM_DEVICE type of situation. My setup uses a virtio-scsi device
         and the  buffer allocated by SG is mapped by the function
         virtqueue_add_split() which uses DMA_FROM_DEVICE for the "in" sgs (here
         scatter-gather and not scsi generics). This mapping involves bouncing
         via the swiotlb (we need swiotlb to do virtio in protected guest like
         s390 Secure Execution, or AMD SEV).
      4) When the SCSI TUR is done, we first copy back the content of the second
         (that is swiotlb) bounce buffer (which most likely contains some
         previous IO data), to the first bounce buffer, which contains all
         zeros.  Then we copy back the content of the first bounce buffer to
         the user-space buffer.
      5) The test case detects that the buffer, which it zero-initialized,
        ain't all zeros and fails.
      
      One can argue that this is an swiotlb problem, because without swiotlb
      we leak all zeros, and the swiotlb should be transparent in a sense that
      it does not affect the outcome (if all other participants are well
      behaved).
      
      Copying the content of the original buffer into the swiotlb buffer is
      the only way I can think of to make swiotlb transparent in such
      scenarios. So let's do just that if in doubt, but allow the driver
      to tell us that the whole mapped buffer is going to be overwritten,
      in which case we can preserve the old behavior and avoid the performance
      impact of the extra bounce.
      
      Signed-off-by: default avatarHalil Pasic <pasic@linux.ibm.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      ddbd89de
  26. Feb 03, 2022
  27. Jan 27, 2022
  28. Jan 07, 2022
  29. Dec 28, 2021
  30. Dec 27, 2021
  31. Dec 16, 2021
  32. Nov 29, 2021
  33. Nov 06, 2021
  34. Nov 01, 2021
  35. Oct 26, 2021
    • Mark Rutland's avatar
      irq: remove handle_domain_{irq,nmi}() · 0953fb26
      Mark Rutland authored
      
      Now that entry code handles IRQ entry (including setting the IRQ regs)
      before calling irqchip code, irqchip code can safely call
      generic_handle_domain_irq(), and there's no functional reason for it to
      call handle_domain_irq().
      
      Let's cement this split of responsibility and remove handle_domain_irq()
      entirely, updating irqchip drivers to call generic_handle_domain_irq().
      
      For consistency, handle_domain_nmi() is similarly removed and replaced
      with a generic_handle_domain_nmi() function which also does not perform
      any entry logic.
      
      Previously handle_domain_{irq,nmi}() had a WARN_ON() which would fire
      when they were called in an inappropriate context. So that we can
      identify similar issues going forward, similar WARN_ON_ONCE() logic is
      added to the generic_handle_*() functions, and comments are updated for
      clarity and consistency.
      
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarMarc Zyngier <maz@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      0953fb26
  36. Oct 25, 2021
Loading