| include | Loading last commit info... | |
| linux-5.13-rc6-tpp | ||
| past-resources | ||
| scripts | ||
| src | ||
| utils | ||
| .clang-format | ||
| .gitignore | ||
| .gitmodules | ||
| LICENSE | ||
| Makefile | ||
| README.md | ||
| config.json | ||
| generate_data.py | ||
| result_to_csv.py | 
Tuna Micro Benchmark
Environment
- Operating System: Ubuntu 22.04.3 LTS (Jammy Jellyfish)
- Kernel: 5.13.0-rc6 with TPP patch
- g++ version: g++-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
- Python3 version: Python 3.10.12
- Make version: GNU Make 4.3
- Additional dependencies: libnuma-dev
- Machine requirement: At least with 2 numa nodes, see APPENDIX C for more infomation
Note: The application may also run on older versions. Users can check compatibility with their existing versions and adjust if needed.
Kernel Requirement
This application was previous intended for Linux Kernel 5.13-rc6 with TPP patch.
Please ensure that you have the kernel parameter /proc/sys/vm/demote_scale_factor,
as this is the key parameter for application to auto-adjust the size of fast memory
available on the machine.
If not, please install kernel with TPP patch. Then execute the following commands to enable page promotion and demotion.
How to Enable Promotion and Demotion on A 2-socket Machine without Tiered Memory
To simulate an environment with Tiered Memory, the following adjustment to the kernel is needed:
The following is an example of configuring node 0 to top_tier node and node 1
to slow memory node
For Kernel Version 6.0 and earlier:
- Locate to mm/migrate.c, find the functionint next_demotion_node(int node), changereturn target;toreturn (node == 0)? 1 : NUMA_NO_NODE;
For Kernel Version 6.0 and later:
- Locate to mm/memory-tiers.c, find the functionint next_demotion_node(int node), changereturn target;toreturn (node == 0)? 1 : NUMA_NO_NODE;
- Locate to mm/memory-tiers.c, find the functionbool node_is_toptier(int node), changereturn target;toreturn 0;
Pre-run Steps
echo 1 | sudo tee /sys/kernel/mm/numa/demotion_enabled
echo 2 | sudo tee /proc/sys/kernel/numa_balancing
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
And, REMEMBER TO DISABLE ALL THE CPUS IN THE NODE THAT ACT AS SLOW TIER.
How to run the application
Since the application will use system calls like mlock and rely on scripts
that require sudo permission, it's recommend to configure passwordless sudo
access temporarily.
Limitations
This application has only been verified on two socket machines with one socket shutting down all the CPUs to simulate as a CPUless NUMA node. Detailed information can be found at APPENDIX A.
This application relies on setting kernel parameter
/proc/sys/vm/demote_scale_factor in the TPP kernel (See
APPENDIX B) to achieve promotion and demotion.
Kernel tiering-0.72
also provides a similiar kernel parameter /proc/sys/vm/toptier_scale_factor. However, due to device limitation, I was
not able to test if the application works for this kernel for now.
(Edit: consider applied the kernel code changes mentioned above to find out)
Work to Be Done
The behavior of demotion in Linux 5.13-rc6 is very different from recent version Linux-6.13.
In recent version, kswapd would occasionally put to sleep for a long time period when memory-heavy
application is running, while in Linux 5.13-rc6, kswapd keeps running throughout the process.
This causes the application to not able to achieve the desired number of demotion, which thus affects
promotion (If you change the application to promote only, it will work just fine, meaning promotion
still works as expected).
One might argue that this might be intentional. Indeed, with less aggressive demotion, the overall performance for memory bound application is much better.
Previously, I have only observed this phenomenon, but did not have the time to find out the changing rate of demotion in every second.
If one wants to study why this is happening, I would suggest:
- Write a program to log the counter starting with pgpromote|pgdemote|pgscan|pgstealin/proc/vmstatevery few seconds (maybe 0.1?), and use some library to visually view the change curve of those counter, which could give important infomation
- Use binary search to find out which version introduces changes that cause the issue
- Use kernel probeto find out the function that callswakeup_kswapd,kswapd_run,kswapd_stop, etc.
Note
If you decide to continue running the process with Linux 5.13-rc6, or have discover a solution
that enables the application to run on other version of kernel, please keep in mind that this
is a memory bound application when the computation is not configured to be dense enough. In
such case, enabling too many threads
My GRUB Config
APPENDIX A: Lab Machine Information
➜  ~ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
node 0 size: 29667 MB
node 0 free: 20714 MB
node 1 cpus:
node 1 size: 96733 MB
node 1 free: 90148 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10
➜  ~ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  48
  On-line CPU(s) list:   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
  Off-line CPU(s) list:  1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  12
    Socket(s):           1
    Stepping:            4
    CPU max MHz:         3700.0000
    CPU min MHz:         1000.0000
    BogoMIPS:            5200.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs b
                         ts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_
                         deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept
                         vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsav
                         ec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   384 KiB (12 instances)
  L1i:                   384 KiB (12 instances)
  L2:                    12 MiB (12 instances)
  L3:                    19.3 MiB (1 instance)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
  NUMA node1 CPU(s):
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Mitigation; Clear CPU buffers; SMT vulnerable
APPENDIX B: Kernel Information
Linux 5.13-rc6 with TPP Patch
APPENDIX C: Machine Configuration
This application is highly recommended to run on machines with CXL or PMem. However, to see the effect of this application, a machine with 2 numa nodes can be used, though not recommended since the latency and bandwidth would be different. If you're using a machine with 2 numa nodes, you will need to follow the below configuration to make this applcation work.
First, when installing the kernel, you'll have to locate to the patched source
code mm/migrate.c to change the return value of
int next_demotion_node(int node) to be a specific value. Both 0 or 1 are
acceptable. Then, you can start to compile the kernel and install.
On booting up, use the command numactl -H to find CPU core IDs. If you set node
0 as next_demotion_node, you will need to shut down all the CPUs in node 0
using command echo 0 | sudo tee /sys/devices/system/cpu/cpu$i/online, replace
$i with appropriate number. If you set node 1 as next_demotion_node, you
will need to shut down all the CPUs in node 1.
 
	
				 
		