result_from_other_machine | Loading last commit info... | |
.clang-format | ||
.gitignore | ||
Laplacian.cpp | ||
Laplacian.h | ||
Makefile | ||
README.md | ||
Timer.h | ||
main.cpp | ||
report.pdf |
COMP SCI 557 Assignment 1
Environment
- CPU:
AMD Ryzen 7 7735HS with Radeon Graphics
- Memory:
Configured Memory Speed: 4800 MT/s, dual channel
- Compiler:
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
- OS:
Ubuntu 22.04.5 LTS (Jammy Jellyfish)
- Compile commands:
g++ -O2 -fopenmp -c Laplacian.cpp -o Laplacian.o g++ -O2 -fopenmp -c main.cpp -o main.o g++ -O2 -fopenmp -fopenmp -o laplacian_app Laplacian.o main.o
Sample run of benchmark
$ export OMP_NUM_THREADS=1
$ ./laplacian_app
Running test iteration 1 [Elapsed time : 649.355ms]
Running test iteration 2 [Elapsed time : 269.287ms]
Running test iteration 3 [Elapsed time : 194.759ms]
Running test iteration 4 [Elapsed time : 171.439ms]
Running test iteration 5 [Elapsed time : 172.44ms]
...
Running test iteration 29 [Elapsed time : 171.601ms]
Running test iteration 30 [Elapsed time : 172.106ms]
Avg. time : 175.922ms
$ export OMP_NUM_THREADS=16
$ ./laplacian_app
Running test iteration 1 [Elapsed time : 85.1133ms]
Running test iteration 2 [Elapsed time : 36.7016ms]
Running test iteration 3 [Elapsed time : 35.8156ms]
Running test iteration 4 [Elapsed time : 36.3234ms]
Running test iteration 5 [Elapsed time : 35.7808ms]
...
Running test iteration 29 [Elapsed time : 37.4332ms]
Running test iteration 30 [Elapsed time : 33.5464ms]
Avg. time : 35.7082ms
This means that using 1-thread/core vs. all available cores/threads has an effect on runtime.
Performance && Bandwidth
Theoretical (Peak) Memory Bandwidth
4800 MT/s * 8 bytes * 2 channels = 76.8 GB/s
Effective Bandwith
- At a minimum, we need to read each entry
u[i][j][k]
, and write each entryLu[i][j][k]
- Since each entry is a float, which is 4 bytes, and we have
N = 512 * 512 * 512 = 134217728
elements - This gives us
2 * (4 * N) = 1073741824 bytes
, which is1 GB
- When using
-O2
flags, the benchmark finished at an average time of0.0357s
- This give us the effective bandwidth:
1 GB / 0.0357s = 28.0112045 GB/s
- The bandwidth is slightly smaller than that observed in
LaplacianStencil_0_0
, which is2 GB / 0.0663 s = 30.1659125 GB/s
- According to the
stream
benchmark, thePractical bandwidth
should be around25-28 GB/s
, which is very close to the calculated bandwidth
Stream | Best Rate MB/s | Best Rate GB/s |
---|---|---|
Copy | 40700.2 | 39.7462891 |
Scale | 25994.3 | 25.3850586 |
Add | 28723.9 | 28.0506836 |
Triad | 28736.0 | 28.0625 |
Swapping order of the for loops
Z -> Y -> X
-
Runtime:
$ export OMP_NUM_THREADS=16 $ ./laplacian_app Running test iteration 1 [Elapsed time : 946.327ms] Running test iteration 2 [Elapsed time : 493.813ms] Running test iteration 3 [Elapsed time : 502.126ms] Running test iteration 4 [Elapsed time : 488.225ms] Running test iteration 5 [Elapsed time : 485.701ms] ... Running test iteration 29 [Elapsed time : 497.446ms] Running test iteration 30 [Elapsed time : 490.027ms] Avg. time : 493.369ms
-
Commentary
Does it make sense if things got much slower (if they did ... or faster if they did, too!). Does it make sense that a given change had a certain impact, but maybe not as severe of an impact that a different one had?
This access pattern is the slowest among all 6 access patterns, which is expected.
Offset = (512 * 512) * i + 512 * j + k
Since i is incrementing in the inner loop, and each increment of i skips
512 * 512 = 262,144
floats in memory, this is bad for both cache and memory pages because of poor locality. It fails to make use of cache effectively and it's unpredictable for CPUs to pre-fetch next page from memory.
X -> Z -> Y
-
Runtime:
$ export OMP_NUM_THREADS=16 $ ./laplacian_app Running test iteration 1 [Elapsed time : 241.125ms] Running test iteration 2 [Elapsed time : 217.377ms] Running test iteration 3 [Elapsed time : 211.232ms] Running test iteration 4 [Elapsed time : 230.067ms] Running test iteration 5 [Elapsed time : 204.675ms] ... Running test iteration 29 [Elapsed time : 173.025ms] Running test iteration 30 [Elapsed time : 177.916ms] Avg. time : 188.969ms
-
Commentary
Does it make sense if things got much slower (if they did ... or faster if they did, too!). Does it make sense that a given change had a certain impact, but maybe not as severe of an impact that a different one had?
This access pattern is slower than the baseline implementation, but faster than
Z -> Y -> X
.Offset = (512 * 512) * i + 512 * j + k
Since j is incrementing in the inner loop, and each increment of j skips
512
floats in memory, this is bad for cache since part of the cache line are not used effectively. However, it's still a memory page friendly access pattern because it accesses the pages sequentially, which makes it possible for pre-fetch to happen. This will reduce the time of fetching from memory. And since there are two floats in each memory page, TLB will have a higher hit rate thanZ -> Y -> X
.