Commit 051f3ca0 authored by Suravee Suthikulpanit's avatar Suravee Suthikulpanit Committed by Ingo Molnar

sched/topology: Introduce NUMA identity node sched domain

On AMD Family17h-based (EPYC) system, a logical NUMA node can contain
upto 8 cores (16 threads) with the following topology.

         C0  | T0 T1 |    ||    | T0 T1 | C4
             --------|    ||    |--------
         C1  | T0 T1 | L3 || L3 | T0 T1 | C5
             --------|    ||    |--------
         C2  | T0 T1 | #0 || #1 | T0 T1 | C6
             --------|    ||    |--------
         C3  | T0 T1 |    ||    | T0 T1 | C7

Here, there are 2 last-level (L3) caches per logical NUMA node.
A socket can contain upto 4 NUMA nodes, and a system can support
upto 2 sockets. With full system configuration, current scheduler
creates 4 sched domains:

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NUMA      (span a socket: 4 nodes)
  domain3 NUMA      (span a system: 8 nodes)

Note that there is no domain to represent cpus spaning a logical
NUMA node.  With this hierarchy of sched domains, the scheduler does
not balance properly in the following cases:


 When running 8 tasks, a properly balanced system should
 schedule a task per logical NUMA node. This is not the case for
 the current scheduler.


 In some cases, threads are scheduled on the same cpu, while other
 cpus are idle. This results in run-to-run inconsistency. For example:

  taskset -c 0-7 sysbench --num-threads=8 --test=cpu \
                          --cpu-max-prime=100000 run

Total execution time ranges from 25.1s to 33.5s depending on threads
placement, where 25.1s is when all 8 threads are balanced properly
on 8 cpus.

Introducing NUMA identity node sched domain, which is based on how
SRAT/SLIT table define a logical NUMA node. This results in the following
hierarchy of sched domains on the same system described above.

  domain0 SMT       (span a core)
  domain1 MC        (span a last-level-cache)
  domain2 NODE      (span a logical NUMA node)
  domain3 NUMA      (span a socket: 4 nodes)
  domain4 NUMA      (span a system: 8 nodes)

This fixes the improper load balancing cases mentioned above.
Signed-off-by: default avatarSuravee Suthikulpanit <>
Signed-off-by: default avatarPeter Zijlstra (Intel) <>
Cc: Linus Torvalds <>
Cc: Mike Galbraith <>
Cc: Peter Zijlstra <>
Cc: Thomas Gleixner <>
Link: Ingo Molnar's avatarIngo Molnar <>
parent ed4ad1ca
......@@ -1332,6 +1332,10 @@ void sched_init_numa(void)
if (!sched_domains_numa_distance)
/* Includes NUMA identity node at level 0. */
sched_domains_numa_distance[level++] = curr_distance;
sched_domains_numa_levels = level;
* O(nr_nodes^2) deduplicating selection sort -- in order to find the
* unique distances in the node_distance() table.
......@@ -1379,8 +1383,7 @@ void sched_init_numa(void)
* 'level' contains the number of unique distances, excluding the
* identity distance node_distance(i,i).
* 'level' contains the number of unique distances
* The sched_domains_numa_distance[] array includes the actual distance
* numbers.
......@@ -1441,10 +1444,19 @@ void sched_init_numa(void)
for (i = 0; sched_domain_topology[i].mask; i++)
tl[i] = sched_domain_topology[i];
* Add the NUMA identity distance, aka single NODE.
tl[i++] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
.numa_level = 0,
* .. and append 'j' levels of NUMA goodness.
for (j = 0; j < level; i++, j++) {
for (j = 1; j < level; i++, j++) {
tl[i] = (struct sched_domain_topology_level){
.mask = sd_numa_mask,
.sd_flags = cpu_numa_flags,
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment