Enable build support by adding .onedev-buildspec.yml
_assets Loading last commit info...
task_1
task_2
task_2_line_16
task_2_line_6
.clang-format
.gitignore
README.md
report.pdf
README.md

COMP SCI 557 Assignment 2

Environment

  • CPU: AMD Ryzen 7 7735HS with Radeon Graphics
  • Memory: Configured Memory Speed: 4800 MT/s, dual channel
  • Compiler: g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
  • OS: Ubuntu 24.04.2 LTS (Noble Numbat)
  • Compile commands:
    $ make -j
    g++ -O3 -g -std=c++20 -Wall -fopenmp -c main.cpp -o main.o
    g++ -O3 -g -std=c++20 -Wall -fopenmp -c ConjugateGradients.cpp -o ConjugateGradients.o
    g++ -O3 -g -std=c++20 -Wall -fopenmp -c Laplacian.cpp -o Laplacian.o
    g++ -O3 -g -std=c++20 -Wall -fopenmp -c PointwiseOps.cpp -o PointwiseOps.o
    g++ -O3 -g -std=c++20 -Wall -fopenmp -c Reductions.cpp -o Reductions.o
    g++ -O3 -g -std=c++20 -Wall -fopenmp -c Utilities.cpp -o Utilities.o
    g++ -O3 -g -std=c++20 -Wall -fopenmp -fno-strict-aliasing -c MergedOp.cpp -o MergedOp.o
    g++ main.o ConjugateGradients.o Laplacian.o PointwiseOps.o Reductions.o Utilities.o MergedOp.o -o solver -fopenmp
    

Task 1

Changes Made to Collect Timing Info

  1. In ConjugateGradients.cpp, I did the following
    • Declared lots of extern Timer
    • Wrapped each kernel function around a independent Timer
    • Disabled the WriteAsImage function to remove image output
  2. In main.cpp, I did the following
    • Declared lots of Timer in global section
    • For each of the Timer, their address is pushed to a std::vector<Timer *> for management
    • Created a std::vector<std::string> to store information about each Timer pointer in the std::vector<Timer *>
    • Reset the timer before calling the conjugate gradients algorithm
    • Wrapped ConjugateGradients function around a timer to get total runtime
    • Print per kernel cumulative runtime
    • Print per kernel average runtime

As can been seen in below diagram, the sum is almost equal to the total time taken, verifying that this step is done correctly.

Timing Info

LineOperation1-Thread Cumulative (ms)1-Thread Per Avg. (ms)16-Thread Cumulative (ms)16-Thread Avg. (ms)
2ComputeLaplacian(x, z)46.51410.18169613.6340.053258
6ComputeLaplacian(p, z)2349.499.177711765.056.89471
4Copy(r, p)37.85320.1478646.091270.023794
13Copy(r, z)1600.196.250751320.035.15637
4InnerProduct(p, r)16.49260.06442423.566050.0139299
6InnerProduct(p, z)2883.9911.2656989.4523.86505
13InnerProduct(z, r)2854.3311.1497952.4013.72031
2Norm(r)6.586860.02572992.026240.00791502
8Norm(r)1153.374.50534516.4332.01731
2Saxpy(z, f, r, -1)40.76730.15924747.13770.184132
8Saxpy(z, r, r, -alpha)1664.26.500791717.376.70848
9-12Saxpy(p, x, x, alpha)6.858120.02678957.118390.0278062
16Saxpy(p, x, x, alpha)1772.356.923251761.476.88073
16Saxpy(p, r, p, beta)1673.026.535221724.056.73458
1-18Conjugate Gradients Sum16106.01218-10825.82965-
1-18Conjugate Gradients Total16112.5-10833-

Task 2

Implementation of Kernel Function

Merging Line 6

MergedComputeLaplacianInnerProduct

Implementation can be found at ./task_2/MergedOp.cpp at line 16-28

As can be seen in the above and below timing info, for running with a single thread, before merging line 6, it takes 2349.49 + 2883.99 = 5233.48 ms to complete; after merging, it takes 5062.44 ms to complete. We can see a boost of 3.27%.

For running with all threads (16 threads), before merging line 6, it takes 1765.05 + 989.452 = 2754.502 ms to complete; after merging, it takes 1895.96 ms to complete. We can see a boost of 36.17%.

Merging Line 16

MergedSaxpy

Implementation can be found at ./task_2/MergedOp.cpp at line 3-14

As can be seen in the above and below timing info, for running with a single thread, before merging line 16, it takes 1772.35 + 1673.02 = 3445.37 ms to complete; after merging, it takes 2724.3 ms to complete. We can see a boost of 20.93%.

For running with all threads (16 threads), before merging line 16, it takes 1761.47 + 1724.05 = 3485.52 ms to complete; after merging, it takes 2829.22 to complete. We can see a boost of 18.83%.

How I Linked the File

I included the header file MergedOp.h in main.cpp, and using the Makefile, MergedOp.cpp is compiled into an object file MergedOp.o. Then, MergedOp.o is linked to the final executable solver

Timing Info with Line 6 and 16 merged

LineOperation1-Thread Cumulative (ms)1-Thread Per Avg. (ms)16-Thread Cumulative (ms)16-Thread Avg. (ms)
2ComputeLaplacian(x, z)46.83840.18296314.26770.0557334
6MergedComputeLaplacianInnerProduct(p, z)5062.4419.77521895.967.4061
4Copy(r, p)36.78510.1436925.02670.0196355
13Copy(r, z)1644.486.423761331.445.20093
4InnerProduct(p, r)16.49030.06441513.609430.0140993
13InnerProduct(z, r)2848.1211.1255997.2323.89544
2Norm(r)6.725450.02627132.774540.010838
8Norm(r)1153.964.50766618.4022.41563
2Saxpy(z, f, r, -1)39.09470.15271443.74470.170878
8Saxpy(z, r, r, -alpha)1742.076.804951930.357.54042
9-12Saxpy(p, x, x, alpha)6.692160.02614138.072250.0315322
16MergedSaxpy(p, x, r, x, alpha, beta)2724.310.64182829.2211.0516
1-18Conjugate Gradients Total15334.5-9687.6-

Comments

In the implementation process, I ensure the merged version behaves like the original version by comparing their output image file using the following command:

for file in task_1/x.*.pgm; do
    fname=$(basename "$file")
    cmp "$file" "task_2/$fname"
done

With line 6 and 16 being merged, the performance of the application with single-core and multi-core both have an upgrade of 4.8% and 10.6%, respectively. Having only line 6 or only line 16 merged also shows a performance boost.

The performance might be able to enhance if OpenMP is used in the Saxpy function, or in line 16. Also, some call of Saxpy has its pointer aliased, which prevents us from adding -fno-strict-aliasing flags to the files for more aggressive optimization, even though in the Merged version, I have tried to use -fno-strict-aliasing flag and do NOT see any performance improvements.

In summary, the performance boost is likely due to less function calls, less memory access and better cache locality. We do NOT need to access the same place twice because they're already in the cache.

Please wait...
Page is in error, reload to recover