CS 557 Assignment 3
Compile Environment
-
CPU:
11th Gen Intel(R) Core(TM) i5-11500 @ 2.70GHz
-
Memory:
16 GB
-
Compiler:
icc (ICC) 19.0.5.281 20190815
-
OS:
Ubuntu 22.04.5 LTS (Jammy Jellyfish)
-
MKL:
/s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl
-
Compile commands:
$ make -j icc ConjugateGradients.cpp Laplacian.cpp main.cpp MatVecMultiply.cpp PointwiseOps.cpp Reductions.cpp Substitutions.cpp Utilities.cpp -Wall -O2 -qopenmp -Wl,--start-group /s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64/libmkl_intel_lp64.a /s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64/libmkl_core.a /s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64/libmkl_intel_thread.a -Wl,--end-group -liomp5 -lpthread -lm -ldl -o main icc ConjugateGradients.cpp Laplacian.cpp main.cpp MatVecMultiply.cpp PointwiseOps.cpp Reductions.cpp Substitutions.cpp Utilities.cpp -Wall -O2 -qopenmp -DDO_NOT_USE_MKL -o main_no_mkl
Runtime Environment
- CPU:
AMD Ryzen 7 7735HS with Radeon Graphics
- Memory:
Configured Memory Speed: 4800 MT/s, dual channel
- OS:
Ubuntu 24.04.2 LTS (Noble Numbat)
- Required Libraries:
libomp.so.5
- Install command:
sudo apt install libomp-dev
- Install command:
- Runtime Environment Selection Explanation: Since CSL machine cannot output steady timing result, I chose to use my own server to run the program. Since I only have machines with AMD CPU, the improvement of replacing hand-code kernel call with MKL library call might not be too obvious.
MKL Library Call Replacement for Hand-Coded Kernel
The InnerProduct()
Call
The Copy()
Call
The Norm()
Call
Correctness of the Replacement
I have compare the Residual Norm
before the modification and after the modification, and the Residual Norm
is
the same, which indicates that the modification is correct.
(The output can be found at ./timing_result
)
Timing Info
In addition to modifying the library kernel call, I have also added a timer to measure the total runtime of each kernel call, as seen in the following:
Line | Operation | Hardcoded (16T, ms) | MKL (16T, ms) | Speedup (16T) | Hardcoded (1T, ms) | MKL (1T, ms) | Speedup (1T) |
---|---|---|---|---|---|---|---|
2 | ComputeLaplacian(matrix, x, z) | 35.0611 | 35.8844 | 0.9768 | 219.433 | 93.7438 | 2.3390 |
6 | ComputeLaplacian(matrix, p, z) | 840.431 | 790.769 | 1.0629 | 1571.34 | 1418.71 | 1.1076 |
4* | Copy(r, p) [MKL: cblas_scopy] | 5.33071 | 6.50866 | 0.8187 | 31.1384 | 27.0205 | 1.1525 |
13* | Copy(r, z) [MKL: cblas_scopy] | 114.262 | 113.528 | 1.0065 | 243.892 | 146.515 | 1.6640 |
4* | InnerProduct(p, r) [MKL: cblas_dsdot] | 3.41286 | 3.38309 | 1.0089 | 4.58645 | 5.43184 | 0.8447 |
6* | InnerProduct(p, z) [MKL: cblas_dsdot] | 79.7061 | 77.0881 | 1.0341 | 103.526 | 129.673 | 0.7981 |
13* | InnerProduct(z, r) [MKL: cblas_dsdot] | 79.057 | 77.836 | 1.0152 | 105.877 | 126.313 | 0.8389 |
2* | Norm(r) [MKL: cblas_isamax] | 1.80554 | 1.78372 | 1.0122 | 3.02873 | 4.77013 | 0.6350 |
8* | Norm(r) [MKL: cblas_isamax] | 46.1479 | 44.3714 | 1.0400 | 75.2152 | 115.743 | 0.6490 |
2 | Saxpy(z, f, r, -1) | 7.90098 | 7.57096 | 1.0434 | 32.4382 | 30.9701 | 1.0472 |
8 | Saxpy(z, r, -alpha) | 119.908 | 121.781 | 0.9843 | 177.957 | 154.394 | 1.1527 |
9-12 | Saxpy(p, x, alpha) | 5.02616 | 4.90975 | 1.0233 | 7.56614 | 6.53355 | 1.1583 |
16 | Saxpy(p, x, alpha) | 115.961 | 114.024 | 1.0173 | 172.288 | 149.927 | 1.1487 |
16 | Saxpy(p, z, p, beta) | 123.083 | 124.779 | 0.9872 | 182.337 | 179.83 | 1.0139 |
1-18 | Conjugate Gradients Sum | 1577.09 | 1524.22 | 1.0350 | 2930.62 | 2589.57 | 1.1317 |
1-18 | Conjugate Gradients Total | 7551.19 | 7344 | 1.0282 | 7894.79 | 7566.05 | 1.0435 |
NOTE: Line number with a *
indicates that the specific line is the line that is replaced with
Intel MKL Library
in this assignment.
Comment on the runtime
For 1 thread scenario:
The overall performance improvement is about 13%
, which is higher than 16 threads scenario.
However, for each of the component, InnerProduct
and Norm
is slower down significantly,
while others see a boost in different degrees.
For 16 threads scenario:
Except for the Copy
operation on line 4
, replacing kernel call with MKL library call can slightly improve the
performance of Conjugate Graients
, with a maximum improvement of 4%
on line 8
.
Since MKL library will have more optimization for Intel CPU, we can expect more performance boost if the runtime
environment on a machine with Intel CPU.