Enable build support by adding .onedev-buildspec.yml
result_from_other_machine Loading last commit info...
.clang-format
.gitignore
Laplacian.cpp
Laplacian.h
Makefile
README.md
Timer.h
main.cpp
report.pdf
README.md

COMP SCI 557 Assignment 1

Environment

  • CPU: AMD Ryzen 7 7735HS with Radeon Graphics
  • Memory: Configured Memory Speed: 4800 MT/s, dual channel
  • Compiler: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  • OS: Ubuntu 22.04.5 LTS (Jammy Jellyfish)
  • Compile commands:
    g++ -O2 -fopenmp -c Laplacian.cpp -o Laplacian.o
    g++ -O2 -fopenmp -c main.cpp -o main.o
    g++ -O2 -fopenmp -fopenmp -o laplacian_app Laplacian.o main.o
    

Sample run of benchmark

$ export OMP_NUM_THREADS=1
$ ./laplacian_app
Running test iteration  1 [Elapsed time : 649.355ms]
Running test iteration  2 [Elapsed time : 269.287ms]
Running test iteration  3 [Elapsed time : 194.759ms]
Running test iteration  4 [Elapsed time : 171.439ms]
Running test iteration  5 [Elapsed time : 172.44ms]
...
Running test iteration 29 [Elapsed time : 171.601ms]
Running test iteration 30 [Elapsed time : 172.106ms]
Avg. time : 175.922ms
$ export OMP_NUM_THREADS=16
$ ./laplacian_app
Running test iteration  1 [Elapsed time : 85.1133ms]
Running test iteration  2 [Elapsed time : 36.7016ms]
Running test iteration  3 [Elapsed time : 35.8156ms]
Running test iteration  4 [Elapsed time : 36.3234ms]
Running test iteration  5 [Elapsed time : 35.7808ms]
...
Running test iteration 29 [Elapsed time : 37.4332ms]
Running test iteration 30 [Elapsed time : 33.5464ms]
Avg. time : 35.7082ms

This means that using 1-thread/core vs. all available cores/threads has an effect on runtime.

Performance && Bandwidth

Theoretical (Peak) Memory Bandwidth

4800 MT/s * 8 bytes * 2 channels = 76.8 GB/s

Effective Bandwith

  • At a minimum, we need to read each entry u[i][j][k], and write each entry Lu[i][j][k]
  • Since each entry is a float, which is 4 bytes, and we have N = 512 * 512 * 512 = 134217728 elements
  • This gives us 2 * (4 * N) = 1073741824 bytes, which is 1 GB
  • When using -O2 flags, the benchmark finished at an average time of 0.0357s
  • This give us the effective bandwidth: 1 GB / 0.0357s = 28.0112045 GB/s

  • The bandwidth is slightly smaller than that observed in LaplacianStencil_0_0, which is 2 GB / 0.0663 s = 30.1659125 GB/s
  • According to the stream benchmark, the Practical bandwidth should be around 25-28 GB/s, which is very close to the calculated bandwidth
StreamBest Rate MB/sBest Rate GB/s
Copy40700.239.7462891
Scale25994.325.3850586
Add28723.928.0506836
Triad28736.028.0625

Swapping order of the for loops

Z -> Y -> X

  • Runtime:

    $ export OMP_NUM_THREADS=16
    $ ./laplacian_app
    Running test iteration  1 [Elapsed time : 946.327ms]
    Running test iteration  2 [Elapsed time : 493.813ms]
    Running test iteration  3 [Elapsed time : 502.126ms]
    Running test iteration  4 [Elapsed time : 488.225ms]
    Running test iteration  5 [Elapsed time : 485.701ms]
    ...
    Running test iteration 29 [Elapsed time : 497.446ms]
    Running test iteration 30 [Elapsed time : 490.027ms]
    Avg. time : 493.369ms
    
  • Commentary

    Does it make sense if things got much slower (if they did ... or faster if they did, too!). Does it make sense that a given change had a certain impact, but maybe not as severe of an impact that a different one had?

    This access pattern is the slowest among all 6 access patterns, which is expected.

    Offset = (512 * 512) * i + 512 * j + k

    Since i is incrementing in the inner loop, and each increment of i skips 512 * 512 = 262,144 floats in memory, this is bad for both cache and memory pages because of poor locality. It fails to make use of cache effectively and it's unpredictable for CPUs to pre-fetch next page from memory.

X -> Z -> Y

  • Runtime:

    $ export OMP_NUM_THREADS=16
    $ ./laplacian_app
    Running test iteration  1 [Elapsed time : 241.125ms]
    Running test iteration  2 [Elapsed time : 217.377ms]
    Running test iteration  3 [Elapsed time : 211.232ms]
    Running test iteration  4 [Elapsed time : 230.067ms]
    Running test iteration  5 [Elapsed time : 204.675ms]
    ...
    Running test iteration 29 [Elapsed time : 173.025ms]
    Running test iteration 30 [Elapsed time : 177.916ms]
    Avg. time : 188.969ms
    
  • Commentary

    Does it make sense if things got much slower (if they did ... or faster if they did, too!). Does it make sense that a given change had a certain impact, but maybe not as severe of an impact that a different one had?

    This access pattern is slower than the baseline implementation, but faster than Z -> Y -> X.

    Offset = (512 * 512) * i + 512 * j + k

    Since j is incrementing in the inner loop, and each increment of j skips 512 floats in memory, this is bad for cache since part of the cache line are not used effectively. However, it's still a memory page friendly access pattern because it accesses the pages sequentially, which makes it possible for pre-fetch to happen. This will reduce the time of fetching from memory. And since there are two floats in each memory page, TLB will have a higher hit rate than Z -> Y -> X.

Please wait...
Page is in error, reload to recover