Skip to content
Snippets Groups Projects
  1. May 10, 2022
  2. May 09, 2022
  3. May 08, 2022
  4. Apr 01, 2022
  5. Mar 22, 2022
  6. Mar 16, 2022
  7. Mar 15, 2022
  8. Mar 14, 2022
  9. Mar 07, 2022
  10. Feb 02, 2022
  11. Dec 16, 2021
  12. Oct 18, 2021
  13. Sep 24, 2021
  14. Aug 17, 2021
  15. Jul 12, 2021
  16. Jun 29, 2021
  17. May 05, 2021
  18. Mar 22, 2021
  19. Feb 24, 2021
  20. Dec 03, 2020
    • Roman Gushchin's avatar
      mm: memcontrol: Use helpers to read page's memcg data · bcfe06bf
      Roman Gushchin authored
      
      Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
      
      Currently a non-slab kernel page which has been charged to a memory cgroup
      can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
      flag is defined as a page type (like buddy, offline, etc), so it takes a
      bit from a page->mapped counter.  Pages with a type set can't be mapped to
      userspace.
      
      But in general the kmemcg flag has nothing to do with mapping to
      userspace.  It only means that the page has been accounted by the page
      allocator, so it has to be properly uncharged on release.
      
      Some bpf maps are mapping the vmalloc-based memory to userspace, and their
      memory can't be accounted because of this implementation detail.
      
      This patchset removes this limitation by moving the PageKmemcg flag into
      one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
      accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
      adds several checks and removes a couple of obsolete functions.  As the
      result the code became more robust with fewer open-coded bit tricks.
      
      This patch (of 4):
      
      Currently there are many open-coded reads of the page->mem_cgroup pointer,
      as well as a couple of read helpers, which are barely used.
      
      It creates an obstacle on a way to reuse some bits of the pointer for
      storing additional bits of information.  In fact, we already do this for
      slab pages, where the last bit indicates that a pointer has an attached
      vector of objcg pointers instead of a regular memcg pointer.
      
      This commits uses 2 existing helpers and introduces a new helper to
      converts all read sides to calls of these helpers:
        struct mem_cgroup *page_memcg(struct page *page);
        struct mem_cgroup *page_memcg_rcu(struct page *page);
        struct mem_cgroup *page_memcg_check(struct page *page);
      
      page_memcg_check() is intended to be used in cases when the page can be a
      slab page and have a memcg pointer pointing at objcg vector.  It does
      check the lowest bit, and if set, returns NULL.  page_memcg() contains a
      VM_BUG_ON_PAGE() check for the page not being a slab page.
      
      To make sure nobody uses a direct access, struct page's
      mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
      Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
      Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
      bcfe06bf
  21. Dec 01, 2020
  22. Oct 18, 2020
    • Roman Gushchin's avatar
      mm, memcg: rework remote charging API to support nesting · b87d8cef
      Roman Gushchin authored
      Currently the remote memcg charging API consists of two functions:
      memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
      memcg value, which overwrites the memcg of the current task.
      
        memalloc_use_memcg(target_memcg);
        <...>
        memalloc_unuse_memcg();
      
      It works perfectly for allocations performed from a normal context,
      however an attempt to call it from an interrupt context or just nest two
      remote charging blocks will lead to an incorrect accounting.  On exit from
      the inner block the active memcg will be cleared instead of being
      restored.
      
        memalloc_use_memcg(target_memcg);
      
        memalloc_use_memcg(target_memcg_2);
          <...>
          memalloc_unuse_memcg();
      
          Error: allocation here are charged to the memcg of the current
          process instead of target_memcg.
      
        memalloc_unuse_memcg();
      
      This patch extends the remote charging API by switching to a single
      function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
      which sets the new value and returns the old one.  So a remote charging
      block will look like:
      
        old_memcg = set_active_memcg(target_memcg);
        <...>
        set_active_memcg(old_memcg);
      
      This patch is heavily based on the patch by Johannes Weiner, which can be
      found here: https://lkml.org/lkml/2020/5/28/806
      
       .
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Schatzberg <dschatzberg@fb.com>
      Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b87d8cef
  23. Sep 07, 2020
    • Jan Kara's avatar
      fs: Don't invalidate page buffers in block_write_full_page() · 6dbf7bb5
      Jan Kara authored
      
      If block_write_full_page() is called for a page that is beyond current
      inode size, it will truncate page buffers for the page and return 0.
      This logic has been added in 2.5.62 in commit 81eb69062588 ("fix ext3
      BUG due to race with truncate") in history.git tree to fix a problem
      with ext3 in data=ordered mode. This particular problem doesn't exist
      anymore because ext3 is long gone and ext4 handles ordered data
      differently. Also normally buffers are invalidated by truncate code and
      there's no need to specially handle this in ->writepage() code.
      
      This invalidation of page buffers in block_write_full_page() is causing
      issues to filesystems (e.g. ext4 or ocfs2) when block device is shrunk
      under filesystem's hands and metadata buffers get discarded while being
      tracked by the journalling layer. Although it is obviously "not
      supported" it can cause kernel crashes like:
      
      [ 7986.689400] BUG: unable to handle kernel NULL pointer dereference at
      +0000000000000008
      [ 7986.697197] PGD 0 P4D 0
      [ 7986.699724] Oops: 0002 [#1] SMP PTI
      [ 7986.703200] CPU: 4 PID: 203778 Comm: jbd2/dm-3-8 Kdump: loaded Tainted: G
      +O     --------- -  - 4.18.0-147.5.0.5.h126.eulerosv2r9.x86_64 #1
      [ 7986.716438] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
      [ 7986.723462] RIP: 0010:jbd2_journal_grab_journal_head+0x1b/0x40 [jbd2]
      ...
      [ 7986.810150] Call Trace:
      [ 7986.812595]  __jbd2_journal_insert_checkpoint+0x23/0x70 [jbd2]
      [ 7986.818408]  jbd2_journal_commit_transaction+0x155f/0x1b60 [jbd2]
      [ 7986.836467]  kjournald2+0xbd/0x270 [jbd2]
      
      which is not great. The crash happens because bh->b_private is suddently
      NULL although BH_JBD flag is still set (this is because
      block_invalidatepage() cleared BH_Mapped flag and subsequent bh lookup
      found buffer without BH_Mapped set, called init_page_buffers() which has
      rewritten bh->b_private). So just remove the invalidation in
      block_write_full_page().
      
      Note that the buffer cache invalidation when block device changes size
      is already careful to avoid similar problems by using
      invalidate_mapping_pages() which skips busy buffers so it was only this
      odd block_write_full_page() behavior that could tear down bdev buffers
      under filesystem's hands.
      
      Reported-by: default avatarYe Bin <yebin10@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6dbf7bb5
  24. Aug 23, 2020
  25. Aug 07, 2020
    • Xianting Tian's avatar
      fs: prevent BUG_ON in submit_bh_wbc() · 377254b2
      Xianting Tian authored
      
      If a device is hot-removed --- for example, when a physical device is
      unplugged from pcie slot or a nbd device's network is shutdown ---
      this can result in a BUG_ON() crash in submit_bh_wbc().  This is
      because the when the block device dies, the buffer heads will have
      their Buffer_Mapped flag get cleared, leading to the crash in
      submit_bh_wbc.
      
      We had attempted to work around this problem in commit a17712c8
      ("ext4: check superblock mapped prior to committing").  Unfortunately,
      it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the
      device dies between when the work-around check in ext4_commit_super()
      and when submit_bh_wbh() is finally called:
      
      Code path:
      ext4_commit_super
          judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8
                lock_buffer(sbh)
                ...
                unlock_buffer(sbh)
                     __sync_dirty_buffer(sbh,...
                          lock_buffer(sbh)
                              judge if 'buffer_mapped(sbh))' is false, return <== added by this patch
                                  submit_bh(...,sbh)
                                      submit_bh_wbc(...,sbh,...)
      
      [100722.966497] kernel BUG at fs/buffer.c:3095! <== BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc()
      [100722.966503] invalid opcode: 0000 [#1] SMP
      [100722.966566] task: ffff8817e15a9e40 task.stack: ffffc90024744000
      [100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190
      [100722.966575] RSP: 0018:ffffc90024747a90 EFLAGS: 00010246
      [100722.966576] RAX: 0000000000620005 RBX: ffff8818a80603a8 RCX: 0000000000000000
      [100722.966576] RDX: ffff8818a80603a8 RSI: 0000000000020800 RDI: 0000000000000001
      [100722.966577] RBP: ffffc90024747ac0 R08: 0000000000000000 R09: ffff88207f94170d
      [100722.966578] R10: 00000000000437c8 R11: 0000000000000001 R12: 0000000000020800
      [100722.966578] R13: 0000000000000001 R14: 000000000bf9a438 R15: ffff88195f333000
      [100722.966580] FS:  00007fa2eee27700(0000) GS:ffff88203d840000(0000) knlGS:0000000000000000
      [100722.966580] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [100722.966581] CR2: 0000000000f0b008 CR3: 000000201a622003 CR4: 00000000007606e0
      [100722.966582] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [100722.966583] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [100722.966583] PKRU: 55555554
      [100722.966583] Call Trace:
      [100722.966588]  __sync_dirty_buffer+0x6e/0xd0
      [100722.966614]  ext4_commit_super+0x1d8/0x290 [ext4]
      [100722.966626]  __ext4_std_error+0x78/0x100 [ext4]
      [100722.966635]  ? __ext4_journal_get_write_access+0xca/0x120 [ext4]
      [100722.966646]  ext4_reserve_inode_write+0x58/0xb0 [ext4]
      [100722.966655]  ? ext4_dirty_inode+0x48/0x70 [ext4]
      [100722.966663]  ext4_mark_inode_dirty+0x53/0x1e0 [ext4]
      [100722.966671]  ? __ext4_journal_start_sb+0x6d/0xf0 [ext4]
      [100722.966679]  ext4_dirty_inode+0x48/0x70 [ext4]
      [100722.966682]  __mark_inode_dirty+0x17f/0x350
      [100722.966686]  generic_update_time+0x87/0xd0
      [100722.966687]  touch_atime+0xa9/0xd0
      [100722.966690]  generic_file_read_iter+0xa09/0xcd0
      [100722.966694]  ? page_cache_tree_insert+0xb0/0xb0
      [100722.966704]  ext4_file_read_iter+0x4a/0x100 [ext4]
      [100722.966707]  ? __inode_security_revalidate+0x4f/0x60
      [100722.966709]  __vfs_read+0xec/0x160
      [100722.966711]  vfs_read+0x8c/0x130
      [100722.966712]  SyS_pread64+0x87/0xb0
      [100722.966716]  do_syscall_64+0x67/0x1b0
      [100722.966719]  entry_SYSCALL64_slow_path+0x25/0x25
      
      To address this, add the check of 'buffer_mapped(bh)' to
      __sync_dirty_buffer().  This also has the benefit of fixing this for
      other file systems.
      
      With this addition, we can drop the workaround in ext4_commit_supper().
      
      [ Commit description rewritten by tytso. ]
      
      Signed-off-by: default avatarXianting Tian <xianting_tian@126.com>
      Link: https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_tian@126.com
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      377254b2
  26. Jul 08, 2020
  27. Jul 01, 2020
  28. Jun 02, 2020
Loading