| _assets | Loading last commit info... | |
| task_1 | ||
| task_2 | ||
| task_2_line_16 | ||
| task_2_line_6 | ||
| .clang-format | ||
| .gitignore | ||
| README.md | ||
| report.pdf |
COMP SCI 557 Assignment 2
Environment
- CPU:
AMD Ryzen 7 7735HS with Radeon Graphics - Memory:
Configured Memory Speed: 4800 MT/s, dual channel - Compiler:
g++ (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 - OS:
Ubuntu 24.04.2 LTS (Noble Numbat) - Compile commands:
$ make -j g++ -O3 -g -std=c++20 -Wall -fopenmp -c main.cpp -o main.o g++ -O3 -g -std=c++20 -Wall -fopenmp -c ConjugateGradients.cpp -o ConjugateGradients.o g++ -O3 -g -std=c++20 -Wall -fopenmp -c Laplacian.cpp -o Laplacian.o g++ -O3 -g -std=c++20 -Wall -fopenmp -c PointwiseOps.cpp -o PointwiseOps.o g++ -O3 -g -std=c++20 -Wall -fopenmp -c Reductions.cpp -o Reductions.o g++ -O3 -g -std=c++20 -Wall -fopenmp -c Utilities.cpp -o Utilities.o g++ -O3 -g -std=c++20 -Wall -fopenmp -fno-strict-aliasing -c MergedOp.cpp -o MergedOp.o g++ main.o ConjugateGradients.o Laplacian.o PointwiseOps.o Reductions.o Utilities.o MergedOp.o -o solver -fopenmp
Task 1
Changes Made to Collect Timing Info
- In
ConjugateGradients.cpp, I did the following- Declared lots of
extern Timer - Wrapped each kernel function around a independent
Timer - Disabled the
WriteAsImagefunction to remove image output
- Declared lots of
- In
main.cpp, I did the following- Declared lots of
Timerin global section - For each of the
Timer, their address is pushed to astd::vector<Timer *>for management - Created a
std::vector<std::string>to store information about eachTimerpointer in thestd::vector<Timer *> - Reset the timer before calling the conjugate gradients algorithm
- Wrapped
ConjugateGradientsfunction around a timer to get total runtime - Print per kernel cumulative runtime
- Print per kernel average runtime
- Declared lots of
As can been seen in below diagram, the sum is almost equal to the total time taken, verifying that this step is done correctly.
Timing Info
| Line | Operation | 1-Thread Cumulative (ms) | 1-Thread Per Avg. (ms) | 16-Thread Cumulative (ms) | 16-Thread Avg. (ms) |
|---|---|---|---|---|---|
| 2 | ComputeLaplacian(x, z) | 46.5141 | 0.181696 | 13.634 | 0.053258 |
| 6 | ComputeLaplacian(p, z) | 2349.49 | 9.17771 | 1765.05 | 6.89471 |
| 4 | Copy(r, p) | 37.8532 | 0.147864 | 6.09127 | 0.023794 |
| 13 | Copy(r, z) | 1600.19 | 6.25075 | 1320.03 | 5.15637 |
| 4 | InnerProduct(p, r) | 16.4926 | 0.0644242 | 3.56605 | 0.0139299 |
| 6 | InnerProduct(p, z) | 2883.99 | 11.2656 | 989.452 | 3.86505 |
| 13 | InnerProduct(z, r) | 2854.33 | 11.1497 | 952.401 | 3.72031 |
| 2 | Norm(r) | 6.58686 | 0.0257299 | 2.02624 | 0.00791502 |
| 8 | Norm(r) | 1153.37 | 4.50534 | 516.433 | 2.01731 |
| 2 | Saxpy(z, f, r, -1) | 40.7673 | 0.159247 | 47.1377 | 0.184132 |
| 8 | Saxpy(z, r, r, -alpha) | 1664.2 | 6.50079 | 1717.37 | 6.70848 |
| 9-12 | Saxpy(p, x, x, alpha) | 6.85812 | 0.0267895 | 7.11839 | 0.0278062 |
| 16 | Saxpy(p, x, x, alpha) | 1772.35 | 6.92325 | 1761.47 | 6.88073 |
| 16 | Saxpy(p, r, p, beta) | 1673.02 | 6.53522 | 1724.05 | 6.73458 |
| 1-18 | Conjugate Gradients Sum | 16106.01218 | - | 10825.82965 | - |
| 1-18 | Conjugate Gradients Total | 16112.5 | - | 10833 | - |
Task 2
Implementation of Kernel Function
Merging Line 6

Implementation can be found at ./task_2/MergedOp.cpp at line 16-28
As can be seen in the above and below timing info, for running with a single thread, before merging line 6, it takes
2349.49 + 2883.99 = 5233.48 ms to complete; after merging, it takes 5062.44 ms to complete.
We can see a boost of 3.27%.
For running with all threads (16 threads), before merging line 6, it takes 1765.05 + 989.452 = 2754.502 ms to complete;
after merging, it takes 1895.96 ms to complete. We can see a boost of 36.17%.
Merging Line 16

Implementation can be found at ./task_2/MergedOp.cpp at line 3-14
As can be seen in the above and below timing info, for running with a single thread, before merging line 16, it takes
1772.35 + 1673.02 = 3445.37 ms to complete; after merging, it takes 2724.3 ms to complete.
We can see a boost of 20.93%.
For running with all threads (16 threads), before merging line 16, it takes 1761.47 + 1724.05 = 3485.52 ms to complete;
after merging, it takes 2829.22 to complete. We can see a boost of 18.83%.
How I Linked the File
I included the header file MergedOp.h in main.cpp, and using the Makefile, MergedOp.cpp is
compiled into an object file MergedOp.o. Then, MergedOp.o is linked to the final executable solver
Timing Info with Line 6 and 16 merged
| Line | Operation | 1-Thread Cumulative (ms) | 1-Thread Per Avg. (ms) | 16-Thread Cumulative (ms) | 16-Thread Avg. (ms) |
|---|---|---|---|---|---|
| 2 | ComputeLaplacian(x, z) | 46.8384 | 0.182963 | 14.2677 | 0.0557334 |
| 6 | MergedComputeLaplacianInnerProduct(p, z) | 5062.44 | 19.7752 | 1895.96 | 7.4061 |
| 4 | Copy(r, p) | 36.7851 | 0.143692 | 5.0267 | 0.0196355 |
| 13 | Copy(r, z) | 1644.48 | 6.42376 | 1331.44 | 5.20093 |
| 4 | InnerProduct(p, r) | 16.4903 | 0.0644151 | 3.60943 | 0.0140993 |
| 13 | InnerProduct(z, r) | 2848.12 | 11.1255 | 997.232 | 3.89544 |
| 2 | Norm(r) | 6.72545 | 0.0262713 | 2.77454 | 0.010838 |
| 8 | Norm(r) | 1153.96 | 4.50766 | 618.402 | 2.41563 |
| 2 | Saxpy(z, f, r, -1) | 39.0947 | 0.152714 | 43.7447 | 0.170878 |
| 8 | Saxpy(z, r, r, -alpha) | 1742.07 | 6.80495 | 1930.35 | 7.54042 |
| 9-12 | Saxpy(p, x, x, alpha) | 6.69216 | 0.0261413 | 8.07225 | 0.0315322 |
| 16 | MergedSaxpy(p, x, r, x, alpha, beta) | 2724.3 | 10.6418 | 2829.22 | 11.0516 |
| 1-18 | Conjugate Gradients Total | 15334.5 | - | 9687.6 | - |
Comments
In the implementation process, I ensure the merged version behaves like the original version by comparing their output image file using the following command:
for file in task_1/x.*.pgm; do
fname=$(basename "$file")
cmp "$file" "task_2/$fname"
done
With line 6 and 16 being merged, the performance of the application with single-core and multi-core both have an upgrade of 4.8% and 10.6%,
respectively. Having only line 6 or only line 16 merged also shows a performance boost.
The performance might be able to enhance if OpenMP is used in the Saxpy function, or in line 16. Also, some
call of Saxpy has its pointer aliased, which prevents us from adding -fno-strict-aliasing flags to the files for
more aggressive optimization, even though in the Merged version, I have tried to use -fno-strict-aliasing flag
and do NOT see any performance improvements.
In summary, the performance boost is likely due to less function calls, less memory access and better cache locality. We do NOT need to access the same place twice because they're already in the cache.