Enable build support by adding .onedev-buildspec.yml
include Loading last commit info...
linux-5.13-rc6-tpp
past-resources
scripts
src
utils
.clang-format
.gitignore
.gitmodules
LICENSE
Makefile
README.md
config.json
generate_data.py
result_to_csv.py
README.md

Tuna Micro Benchmark

Environment

  • Operating System: Ubuntu 22.04.3 LTS (Jammy Jellyfish)
  • Kernel: 5.13.0-rc6 with TPP patch
  • g++ version: g++-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
  • Python3 version: Python 3.10.12
  • Make version: GNU Make 4.3
  • Additional dependencies: libnuma-dev
  • Machine requirement: At least with 2 numa nodes, see APPENDIX C for more infomation

Note: The application may also run on older versions. Users can check compatibility with their existing versions and adjust if needed.

Kernel Requirement

This application was previous intended for Linux Kernel 5.13-rc6 with TPP patch.

Please ensure that you have the kernel parameter /proc/sys/vm/demote_scale_factor, as this is the key parameter for application to auto-adjust the size of fast memory available on the machine.

If not, please install kernel with TPP patch. Then execute the following commands to enable page promotion and demotion.

How to Enable Promotion and Demotion on A 2-socket Machine without Tiered Memory

To simulate an environment with Tiered Memory, the following adjustment to the kernel is needed:

The following is an example of configuring node 0 to top_tier node and node 1 to slow memory node

For Kernel Version 6.0 and earlier:

  • Locate to mm/migrate.c, find the function int next_demotion_node(int node), change return target; to return (node == 0)? 1 : NUMA_NO_NODE;

For Kernel Version 6.0 and later:

  • Locate to mm/memory-tiers.c, find the function int next_demotion_node(int node), change return target; to return (node == 0)? 1 : NUMA_NO_NODE;
  • Locate to mm/memory-tiers.c, find the function bool node_is_toptier(int node), change return target; to return 0;

Pre-run Steps

echo 1 | sudo tee /sys/kernel/mm/numa/demotion_enabled

echo 2 | sudo tee /proc/sys/kernel/numa_balancing

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

And, REMEMBER TO DISABLE ALL THE CPUS IN THE NODE THAT ACT AS SLOW TIER.

How to run the application

Since the application will use system calls like mlock and rely on scripts that require sudo permission, it's recommend to configure passwordless sudo access temporarily.

Limitations

This application has only been verified on two socket machines with one socket shutting down all the CPUs to simulate as a CPUless NUMA node. Detailed information can be found at APPENDIX A.

This application relies on setting kernel parameter /proc/sys/vm/demote_scale_factor in the TPP kernel (See APPENDIX B) to achieve promotion and demotion. Kernel tiering-0.72 also provides a similiar kernel parameter /proc/sys/vm/toptier_scale_factor. However, due to device limitation, I was not able to test if the application works for this kernel for now. (Edit: consider applied the kernel code changes mentioned above to find out)

Work to Be Done

The behavior of demotion in Linux 5.13-rc6 is very different from recent version Linux-6.13. In recent version, kswapd would occasionally put to sleep for a long time period when memory-heavy application is running, while in Linux 5.13-rc6, kswapd keeps running throughout the process. This causes the application to not able to achieve the desired number of demotion, which thus affects promotion (If you change the application to promote only, it will work just fine, meaning promotion still works as expected).

One might argue that this might be intentional. Indeed, with less aggressive demotion, the overall performance for memory bound application is much better.

Previously, I have only observed this phenomenon, but did not have the time to find out the changing rate of demotion in every second.

If one wants to study why this is happening, I would suggest:

  • Write a program to log the counter starting with pgpromote|pgdemote|pgscan|pgsteal in /proc/vmstat every few seconds (maybe 0.1?), and use some library to visually view the change curve of those counter, which could give important infomation
  • Use binary search to find out which version introduces changes that cause the issue
  • Use kernel probe to find out the function that calls wakeup_kswapd, kswapd_run, kswapd_stop, etc.

Note

If you decide to continue running the process with Linux 5.13-rc6, or have discover a solution that enables the application to run on other version of kernel, please keep in mind that this is a memory bound application when the computation is not configured to be dense enough. In such case, enabling too many threads

My GRUB Config

See here

APPENDIX A: Lab Machine Information

➜  ~ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
node 0 size: 29667 MB
node 0 free: 20714 MB
node 1 cpus:
node 1 size: 96733 MB
node 1 free: 90148 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10
➜  ~ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  48
  On-line CPU(s) list:   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
  Off-line CPU(s) list:  1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  12
    Socket(s):           1
    Stepping:            4
    CPU max MHz:         3700.0000
    CPU min MHz:         1000.0000
    BogoMIPS:            5200.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs b
                         ts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_
                         deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept
                         vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsav
                         ec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   384 KiB (12 instances)
  L1i:                   384 KiB (12 instances)
  L2:                    12 MiB (12 instances)
  L3:                    19.3 MiB (1 instance)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
  NUMA node1 CPU(s):
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Mitigation; Clear CPU buffers; SMT vulnerable

APPENDIX B: Kernel Information

Linux 5.13-rc6 with TPP Patch

APPENDIX C: Machine Configuration

This application is highly recommended to run on machines with CXL or PMem. However, to see the effect of this application, a machine with 2 numa nodes can be used, though not recommended since the latency and bandwidth would be different. If you're using a machine with 2 numa nodes, you will need to follow the below configuration to make this applcation work.

First, when installing the kernel, you'll have to locate to the patched source code mm/migrate.c to change the return value of int next_demotion_node(int node) to be a specific value. Both 0 or 1 are acceptable. Then, you can start to compile the kernel and install.

On booting up, use the command numactl -H to find CPU core IDs. If you set node 0 as next_demotion_node, you will need to shut down all the CPUs in node 0 using command echo 0 | sudo tee /sys/devices/system/cpu/cpu$i/online, replace $i with appropriate number. If you set node 1 as next_demotion_node, you will need to shut down all the CPUs in node 1.

Reference

How to limit the available memory on one node

Please wait...
Page is in error, reload to recover