Skip to content
Snippets Groups Projects
  1. Dec 31, 2021
  2. Nov 17, 2021
  3. Nov 10, 2021
  4. Nov 06, 2021
    • Mel Gorman's avatar
      mm/vmscan: throttle reclaim when no progress is being made · 69392a40
      Mel Gorman authored
      Memcg reclaim throttles on congestion if no reclaim progress is made.
      This makes little sense, it might be due to writeback or a host of other
      factors.
      
      For !memcg reclaim, it's messy.  Direct reclaim primarily is throttled
      in the page allocator if it is failing to make progress.  Kswapd
      throttles if too many pages are under writeback and marked for immediate
      reclaim.
      
      This patch explicitly throttles if reclaim is failing to make progress.
      
      [vbabka@suse.cz: Remove redundant code]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-4-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69392a40
    • Mel Gorman's avatar
      mm/vmscan: throttle reclaim and compaction when too may pages are isolated · d818fca1
      Mel Gorman authored
      Page reclaim throttles on congestion if too many parallel reclaim
      instances have isolated too many pages.  This makes no sense, excessive
      parallelisation has nothing to do with writeback or congestion.
      
      This patch creates an additional workqueue to sleep on when too many
      pages are isolated.  The throttled tasks are woken when the number of
      isolated pages is reduced or a timeout occurs.  There may be some false
      positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will
      throttle again if necessary.
      
      [shy828301@gmail.com: Wake up from compaction context]
      [vbabka@suse.cz: Account number of throttled tasks only for writeback]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d818fca1
    • Mel Gorman's avatar
      mm/vmscan: throttle reclaim until some writeback completes if congested · 8cd7c588
      Mel Gorman authored
      Patch series "Remove dependency on congestion_wait in mm/", v5.
      
      This series that removes all calls to congestion_wait in mm/ and deletes
      wait_iff_congested.  It's not a clever implementation but
      congestion_wait has been broken for a long time [1].
      
      Even if congestion throttling worked, it was never a great idea.  While
      excessive dirty/writeback pages at the tail of the LRU is one
      possibility that reclaim may be slow, there is also the problem of too
      many pages being isolated and reclaim failing for other reasons
      (elevated references, too many pages isolated, excessive LRU contention
      etc).
      
      This series replaces the "congestion" throttling with 3 different types.
      
       - If there are too many dirty/writeback pages, sleep until a timeout or
         enough pages get cleaned
      
       - If too many pages are isolated, sleep until enough isolated pages are
         either reclaimed or put back on the LRU
      
       - If no progress is being made, direct reclaim tasks sleep until
         another task makes progress with acceptable efficiency.
      
      This was initially tested with a mix of workloads that used to trigger
      corner cases that no longer work.  A new test case was created called
      "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
      created XFS filesystem.  Note that it may be necessary to increase the
      timeout of ssh if executing remotely as ssh itself can get throttled and
      the connection may timeout.
      
      stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
      to check the impact as the number of direct reclaimers increase.  It has
      four types of worker.
      
       - One "anon latency" worker creates small mappings with mmap() and
         times how long it takes to fault the mapping reading it 4K at a time
      
       - X file writers which is fio randomly writing X files where the total
         size of the files add up to the allowed dirty_ratio. fio is allowed
         to run for a warmup period to allow some file-backed pages to
         accumulate. The duration of the warmup is based on the best-case
         linear write speed of the storage.
      
       - Y file readers which is fio randomly reading small files
      
       - Z anon memory hogs which continually map (100-dirty_ratio)% of memory
      
       - Total estimated WSS = (100+dirty_ration) percentage of memory
      
      X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
      
      The intent is to maximise the total WSS with a mix of file and anon
      memory where some anonymous memory must be swapped and there is a high
      likelihood of dirty/writeback pages reaching the end of the LRU.
      
      The test can be configured to have no background readers to stress
      dirty/writeback pages.  The results below are based on having zero
      readers.
      
      The short summary of the results is that the series works and stalls
      until some event occurs but the timeouts may need adjustment.
      
      The test results are not broken down by patch as the series should be
      treated as one block that replaces a broken throttling mechanism with a
      working one.
      
      Finally, three machines were tested but I'm reporting the worst set of
      results.  The other two machines had much better latencies for example.
      
      First the results of the "anon latency" latency
      
        stutterp
                                      5.15.0-rc1             5.15.0-rc1
                                         vanilla mm-reclaimcongest-v5r4
        Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
        Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
        Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
        Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
        Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
        Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
        Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
        Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
        Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
        Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
        Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
        Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
        Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
        Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
        Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
        Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
        Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
        Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
        Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
        Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
        Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
        Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
        Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
        Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)
      
      For most thread counts, the time to mmap() is unfortunately increased.
      In earlier versions of the series, this was lower but a large number of
      throttling events were reaching their timeout increasing the amount of
      inefficient scanning of the LRU.  There is no prioritisation of reclaim
      tasks making progress based on each tasks rate of page allocation versus
      progress of reclaim.  The variance is also impacted for high worker
      counts but in all cases, the differences in latency are not
      statistically significant due to very large maximum outliers.  Max-90
      shows that 90% of the stalls are comparable but the Max results show the
      massive outliers which are increased to to stalling.
      
      It is expected that this will be very machine dependant.  Due to the
      test design, reclaim is difficult so allocations stall and there are
      variances depending on whether THPs can be allocated or not.  The amount
      of memory will affect exactly how bad the corner cases are and how often
      they trigger.  The warmup period calculation is not ideal as it's based
      on linear writes where as fio is randomly writing multiple files from
      multiple tasks so the start state of the test is variable.  For example,
      these are the latencies on a single-socket machine that had more memory
      
        Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
        Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
        Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
        Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
        Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)
      
      The overall system CPU usage and elapsed time is as follows
      
                          5.15.0-rc3  5.15.0-rc3
                             vanilla mm-reclaimcongest-v5r4
        Duration User        6989.03      983.42
        Duration System      7308.12      799.68
        Duration Elapsed     2277.67     2092.98
      
      The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
      stalling.
      
      The high-level /proc/vmstats show
      
                                             5.15.0-rc1     5.15.0-rc1
                                                vanilla mm-reclaimcongest-v5r2
        Ops Direct pages scanned          1056608451.00   503594991.00
        Ops Kswapd pages scanned           109795048.00   147289810.00
        Ops Kswapd pages reclaimed          63269243.00    31036005.00
        Ops Direct pages reclaimed          10803973.00     6328887.00
        Ops Kswapd efficiency %                   57.62          21.07
        Ops Kswapd velocity                    48204.98       57572.86
        Ops Direct efficiency %                    1.02           1.26
        Ops Direct velocity                   463898.83      196845.97
      
      Kswapd scanned less pages but the detailed pattern is different.  The
      vanilla kernel scans slowly over time where as the patches exhibits
      burst patterns of scan activity.  Direct reclaim scanning is reduced by
      52% due to stalling.
      
      The pattern for stealing pages is also slightly different.  Both kernels
      exhibit spikes but the vanilla kernel when reclaiming shows pages being
      reclaimed over a period of time where as the patches tend to reclaim in
      spikes.  The difference is that vanilla is not throttling and instead
      scanning constantly finding some pages over time where as the patched
      kernel throttles and reclaims in spikes.
      
        Ops Percentage direct scans               90.59          77.37
      
      For direct reclaim, vanilla scanned 90.59% of pages where as with the
      patches, 77.37% were direct reclaim due to throttling
      
        Ops Page writes by reclaim           2613590.00     1687131.00
      
      Page writes from reclaim context are reduced.
      
        Ops Page writes anon                 2932752.00     1917048.00
      
      And there is less swapping.
      
        Ops Page reclaim immediate         996248528.00   107664764.00
      
      The number of pages encountered at the tail of the LRU tagged for
      immediate reclaim but still dirty/writeback is reduced by 89%.
      
        Ops Slabs scanned                     164284.00      153608.00
      
      Slab scan activity is similar.
      
      ftrace was used to gather stall activity
      
        Vanilla
        -------
            1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
            2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
            8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
           29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
        82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
      
      The fast majority of wait_iff_congested calls do not stall at all.  What
      is likely happening is that cond_resched() reschedules the task for a
      short period when the BDI is not registering congestion (which it never
      will in this test setup).
      
            1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
            2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
            4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
          380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
          778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
      
      congestion_wait if called always exceeds the timeout as there is no
      trigger to wake it up.
      
      Bottom line: Vanilla will throttle but it's not effective.
      
      Patch series
      ------------
      
      Kswapd throttle activity was always due to scanning pages tagged for
      immediate reclaim at the tail of the LRU
      
            1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
           94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
          112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority of events did not stall or stalled for a short period.
      Roughly 16% of stalls reached the timeout before expiry.  For direct
      reclaim, the number of times stalled for each reason were
      
         6624 reason=VMSCAN_THROTTLE_ISOLATED
        93246 reason=VMSCAN_THROTTLE_NOPROGRESS
        96934 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The most common reason to stall was due to excessive pages tagged for
      immediate reclaim at the tail of the LRU followed by a failure to make
      forward.  A relatively small number were due to too many pages isolated
      from the LRU by parallel threads
      
      For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
      
            9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
           12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
           83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
         6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
      
      Most did not stall at all.  A small number reached the timeout.
      
      For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
      the map
      
            1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
            6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
           11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
           16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
           18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
           21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
           26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
           27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
           28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
           29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
           31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
           32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
           33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
           37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
           38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
           40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
           43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
           55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
           56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
           58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
           59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
           61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
           79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
           88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
           94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
          118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
          119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
          126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
          146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
          159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
          178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
          183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
          237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
          266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
          313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
          347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
          470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
          559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
          964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
         7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
        22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
        51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
      
      The full timeout is often hit but a large number also do not stall at
      all.  The remainder slept a little allowing other reclaim tasks to make
      progress.
      
      While this timeout could be further increased, it could also negatively
      impact worst-case behaviour when there is no prioritisation of what task
      should make progress.
      
      For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
      
            1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
            2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
            3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
           12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
           16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
           24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
           28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
           32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
           42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
           77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
           99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
          137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
          190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
          339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
          518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
          852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
         3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
         7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority hit the timeout in direct reclaim context although a
      sizable number did not stall at all.  This is very different to kswapd
      where only a tiny percentage of stalls due to writeback reached the
      timeout.
      
      Bottom line, the throttling appears to work and the wakeup events may
      limit worst case stalls.  There might be some grounds for adjusting
      timeouts but it's likely futile as the worst-case scenarios depend on
      the workload, memory size and the speed of the storage.  A better
      approach to improve the series further would be to prioritise tasks
      based on their rate of allocation with the caveat that it may be very
      expensive to track.
      
      This patch (of 5):
      
      Page reclaim throttles on wait_iff_congested under the following
      conditions:
      
       - kswapd is encountering pages under writeback and marked for immediate
         reclaim implying that pages are cycling through the LRU faster than
         pages can be cleaned.
      
       - Direct reclaim will stall if all dirty pages are backed by congested
         inodes.
      
      wait_iff_congested is almost completely broken with few exceptions.
      This patch adds a new node-based workqueue and tracks the number of
      throttled tasks and pages written back since throttling started.  If
      enough pages belonging to the node are written back then the throttled
      tasks will wake early.  If not, the throttled tasks sleeps until the
      timeout expires.
      
      [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
      [hdanton@sina.com: Avoid race when reclaim starts]
      [vbabka@suse.cz: vmstat irq-safe api, clarifications]
      
      Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
      Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: NeilBrown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8cd7c588
    • Gang Li's avatar
      mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN · 627ae828
      Gang Li authored
      By using DECLARE_EVENT_CLASS and TRACE_EVENT_FN, we can save a lot of
      space from duplicate code.
      
      Link: https://lkml.kernel.org/r/20211009071243.70286-1-ligang.bdlg@bytedance.com
      
      
      Signed-off-by: default avatarGang Li <ligang.bdlg@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      627ae828
    • Gang Li's avatar
      mm: mmap_lock: remove redundant newline in TP_printk · f595e341
      Gang Li authored
      Ftrace core will add newline automatically on printing, so using it in
      TP_printkcreates a blank line.
      
      Link: https://lkml.kernel.org/r/20211009071105.69544-1-ligang.bdlg@bytedance.com
      
      
      Signed-off-by: default avatarGang Li <ligang.bdlg@bytedance.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f595e341
  5. Nov 02, 2021
  6. Oct 26, 2021
    • Chao Yu's avatar
      f2fs: multidevice: support direct IO · 71f2c820
      Chao Yu authored
      
      Commit 3c62be17 ("f2fs: support multiple devices") missed
      to support direct IO for multiple device feature, this patch
      adds to support the missing part of multidevice feature.
      
      In addition, for multiple device image, we should be aware of
      any issued direct write IO rather than just buffered write IO,
      so that fsync and syncfs can issue a preflush command to the
      device where direct write IO goes, to persist user data for
      posix compliant.
      
      Signed-off-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      71f2c820
  7. Oct 20, 2021
  8. Oct 19, 2021
  9. Oct 18, 2021
  10. Oct 17, 2021
    • Gao Xiang's avatar
      erofs: get compression algorithms directly on mapping · 8f899262
      Gao Xiang authored
      Currently, z_erofs_map_blocks_iter() returns whether extents are
      compressed or not, and the decompression frontend gets the specific
      algorithms then.
      
      It works but not quite well in many aspests, for example:
       - The decompression frontend has to deal with whether extents are
         compressed or not again and lookup the algorithms if compressed.
         It's duplicated and too detailed about the on-disk mapping.
      
       - A new secondary compression head will be introduced later so that
         each file can have 2 compression algorithms at most for different
         type of data. It could increase the complexity of the decompression
         frontend if still handled in this way;
      
       - A new readmore decompression strategy will be introduced to get
         better performance for much bigger pcluster and lzma, which needs
         the specific algorithm in advance as well.
      
      Let's look up compression algorithms in z_erofs_map_blocks_iter()
      directly instead.
      
      Link: https://lore.kernel.org/r/20211008200839.24541-2-xiang@kernel.org
      
      
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Reviewed-by: default avatarYue Hu <huyue2@yulong.com>
      Signed-off-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      8f899262
  11. Oct 16, 2021
  12. Oct 12, 2021
  13. Oct 08, 2021
  14. Oct 05, 2021
    • Dave Wysochanski's avatar
      cachefiles: Fix oops with cachefiles_cull() due to NULL object · a0e25f0a
      Dave Wysochanski authored
      
      When cachefiles_cull() calls cachefiles_bury_object(), it passes
      a NULL object.  When this occurs, either trace_cachefiles_unlink()
      or trace_cachefiles_rename() may oops due to the NULL object.
      Check for NULL object in the tracepoint and if so, set debug_id
      to MAX_UINT as was done in 2908f5e1.
      
      The following oops was seen with xfstests generic/100.
      BUG: kernel NULL pointer dereference, address: 0000000000000010
      ...
      RIP: 0010:trace_event_raw_event_cachefiles_unlink+0x4e/0xa0 [cachefiles]
      ...
       Call Trace:
         cachefiles_bury_object+0x242/0x430 [cachefiles]
         ? __vfs_removexattr_locked+0x10f/0x150
         ? vfs_removexattr+0x51/0xd0
         cachefiles_cull+0x84/0x120 [cachefiles]
         cachefiles_daemon_cull+0xd1/0x120 [cachefiles]
         cachefiles_daemon_write+0x158/0x190 [cachefiles]
         vfs_write+0xbc/0x260
         ksys_write+0x4f/0xc0
         do_syscall_64+0x3b/0x90
      
      The following oops was seen with xfstests generic/290.
      BUG: kernel NULL pointer dereference, address: 0000000000000010
      ...
      RIP: 0010:trace_event_raw_event_cachefiles_rename+0x54/0xa0 [cachefiles]
      ...
      Call Trace:
        cachefiles_bury_object+0x35c/0x430 [cachefiles]
        cachefiles_cull+0x84/0x120 [cachefiles]
        cachefiles_daemon_cull+0xd1/0x120 [cachefiles]
        cachefiles_daemon_write+0x158/0x190 [cachefiles]
        vfs_write+0xbc/0x260
        ksys_write+0x4f/0xc0
        do_syscall_64+0x3b/0x90
      
      Fixes: 2908f5e1 ("fscache: Add a cookie debug ID and use that in traces")
      Signed-off-by: default avatarDave Wysochanski <dwysocha@redhat.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Link: https://listman.redhat.com/archives/linux-cachefs/2021-October/msg00009.html
      a0e25f0a
  15. Oct 04, 2021
  16. Oct 02, 2021
    • Dave Wysochanski's avatar
      cachefiles: Fix oops in trace_cachefiles_mark_buried due to NULL object · 6e9bfdcf
      Dave Wysochanski authored
      
      In cachefiles_mark_object_buried, the dentry in question may not have an
      owner, and thus our cachefiles_object pointer may be NULL when calling
      the tracepoint, in which case we will also not have a valid debug_id to
      print in the tracepoint.
      
      Check for NULL object in the tracepoint and if so, just set debug_id to
      MAX_UINT as was done in 2908f5e1 ("fscache: Add a cookie debug ID
      and use that in traces").
      
      This fixes the following oops:
      
          FS-Cache: Cache "mycache" added (type cachefiles)
          CacheFiles: File cache on vdc registered
          ...
          Workqueue: fscache_object fscache_object_work_func [fscache]
          RIP: 0010:trace_event_raw_event_cachefiles_mark_buried+0x4e/0xa0 [cachefiles]
          ....
          Call Trace:
           cachefiles_mark_object_buried+0xa5/0xb0 [cachefiles]
           cachefiles_bury_object+0x270/0x430 [cachefiles]
           cachefiles_walk_to_object+0x195/0x9c0 [cachefiles]
           cachefiles_lookup_object+0x5a/0xc0 [cachefiles]
           fscache_look_up_object+0xd7/0x160 [fscache]
           fscache_object_work_func+0xb2/0x340 [fscache]
           process_one_work+0x1f1/0x390
           worker_thread+0x53/0x3e0
           kthread+0x127/0x150
      
      Fixes: 2908f5e1 ("fscache: Add a cookie debug ID and use that in traces")
      Signed-off-by: default avatarDave Wysochanski <dwysocha@redhat.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: linux-cachefs@redhat.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e9bfdcf
  17. Sep 29, 2021
  18. Sep 28, 2021
  19. Sep 27, 2021
  20. Sep 23, 2021
  21. Sep 13, 2021
    • David Howells's avatar
      afs: Try to avoid taking RCU read lock when checking vnode validity · 4fe6a946
      David Howells authored
      
      Try to avoid taking the RCU read lock when checking the validity of a
      vnode's callback state.  The only thing it's needed for is to pin the
      parent volume's server list whilst we search it to find the record of the
      server we're currently using to see if it has been reinitialised (ie. it
      sent us a CB.InitCallBackState* RPC).
      
      Do this by the following means:
      
       (1) Keep an additional per-cell counter (fs_s_break) that's incremented
           each time any of the fileservers in the cell reinitialises.
      
           Since the new counter can be accessed without RCU from the vnode, we
           can check that first - and only if it differs, get the RCU read lock
           and check the volume's server list.
      
       (2) Replace afs_get_s_break_rcu() with afs_check_server_good() which now
           indicates whether the callback promise is still expected to be present
           on the server.  This does the checks as described in (1).
      
       (3) Restructure afs_check_validity() to take account of the change in (2).
      
           We can also get rid of the valid variable and just use the need_clear
           variable with the addition of the afs_cb_break_no_promise reason.
      
       (4) afs_check_validity() probably shouldn't be altering vnode->cb_v_break
           and vnode->cb_s_break when it doesn't have cb_lock exclusively locked.
      
           Move the change to vnode->cb_v_break to __afs_break_callback().
      
           Delegate the change to vnode->cb_s_break to afs_select_fileserver()
           and set vnode->cb_fs_s_break there also.
      
       (5) afs_validate() no longer needs to get the RCU read lock around its
           call to afs_check_validity() - and can skip the call entirely if we
           don't have a promise.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMarkus Suvanto <markus.suvanto@gmail.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/163111669583.283156.1397603105683094563.stgit@warthog.procyon.org.uk/
      4fe6a946
  22. Sep 08, 2021
Loading