Enable build support by adding .onedev-buildspec.yml
_asset Loading last commit info...
timing_result
.clang-format
.gitignore
CSRMatrix.h
CSRMatrixHelper.h
ConjugateGradients.cpp
ConjugateGradients.h
Laplacian.cpp
Laplacian.h
Makefile
MatVecMultiply.cpp
MatVecMultiply.h
Parameters.h
PointwiseOps.cpp
PointwiseOps.h
README.md
Reductions.cpp
Reductions.h
Substitutions.cpp
Substitutions.h
Timer.h
Utilities.cpp
Utilities.h
main
main.cpp
main_no_mkl
report.pdf
README.md

CS 557 Assignment 3

Compile Environment

  • CPU: 11th Gen Intel(R) Core(TM) i5-11500 @ 2.70GHz

  • Memory: 16 GB

  • Compiler: icc (ICC) 19.0.5.281 20190815

  • OS: Ubuntu 22.04.5 LTS (Jammy Jellyfish)

  • MKL: /s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl

  • Compile commands:

    $ make -j
    icc ConjugateGradients.cpp Laplacian.cpp main.cpp MatVecMultiply.cpp PointwiseOps.cpp Reductions.cpp Substitutions.cpp Utilities.cpp -Wall -O2 -qopenmp -Wl,--start-group /s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64/libmkl_intel_lp64.a /s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64/libmkl_core.a /s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64/libmkl_intel_thread.a -Wl,--end-group -liomp5 -lpthread -lm -ldl -o main
    icc ConjugateGradients.cpp Laplacian.cpp main.cpp MatVecMultiply.cpp PointwiseOps.cpp Reductions.cpp Substitutions.cpp Utilities.cpp -Wall -O2 -qopenmp -DDO_NOT_USE_MKL -o main_no_mkl
    

Runtime Environment

  • CPU: AMD Ryzen 7 7735HS with Radeon Graphics
  • Memory: Configured Memory Speed: 4800 MT/s, dual channel
  • OS: Ubuntu 24.04.2 LTS (Noble Numbat)
  • Required Libraries: libomp.so.5
    • Install command: sudo apt install libomp-dev
  • Runtime Environment Selection Explanation: Since CSL machine cannot output steady timing result, I chose to use my own server to run the program. Since I only have machines with AMD CPU, the improvement of replacing hand-code kernel call with MKL library call might not be too obvious.

MKL Library Call Replacement for Hand-Coded Kernel

The InnerProduct() Call

Inner Product

The Copy() Call

Copy

The Norm() Call

Norm

Correctness of the Replacement

I have compare the Residual Norm before the modification and after the modification, and the Residual Norm is the same, which indicates that the modification is correct.

(The output can be found at ./timing_result)

Timing Info

In addition to modifying the library kernel call, I have also added a timer to measure the total runtime of each kernel call, as seen in the following:

LineOperationHardcoded (16T, ms)MKL (16T, ms)Speedup (16T)Hardcoded (1T, ms)MKL (1T, ms)Speedup (1T)
2ComputeLaplacian(matrix, x, z)35.061135.88440.9768219.43393.74382.3390
6ComputeLaplacian(matrix, p, z)840.431790.7691.06291571.341418.711.1076
4*Copy(r, p) [MKL: cblas_scopy]5.330716.508660.818731.138427.02051.1525
13*Copy(r, z) [MKL: cblas_scopy]114.262113.5281.0065243.892146.5151.6640
4*InnerProduct(p, r) [MKL: cblas_dsdot]3.412863.383091.00894.586455.431840.8447
6*InnerProduct(p, z) [MKL: cblas_dsdot]79.706177.08811.0341103.526129.6730.7981
13*InnerProduct(z, r) [MKL: cblas_dsdot]79.05777.8361.0152105.877126.3130.8389
2*Norm(r) [MKL: cblas_isamax]1.805541.783721.01223.028734.770130.6350
8*Norm(r) [MKL: cblas_isamax]46.147944.37141.040075.2152115.7430.6490
2Saxpy(z, f, r, -1)7.900987.570961.043432.438230.97011.0472
8Saxpy(z, r, -alpha)119.908121.7810.9843177.957154.3941.1527
9-12Saxpy(p, x, alpha)5.026164.909751.02337.566146.533551.1583
16Saxpy(p, x, alpha)115.961114.0241.0173172.288149.9271.1487
16Saxpy(p, z, p, beta)123.083124.7790.9872182.337179.831.0139
1-18Conjugate Gradients Sum1577.091524.221.03502930.622589.571.1317
1-18Conjugate Gradients Total7551.1973441.02827894.797566.051.0435

NOTE: Line number with a * indicates that the specific line is the line that is replaced with Intel MKL Library in this assignment.

Comment on the runtime

For 1 thread scenario: The overall performance improvement is about 13%, which is higher than 16 threads scenario. However, for each of the component, InnerProduct and Norm is slower down significantly, while others see a boost in different degrees.

For 16 threads scenario: Except for the Copy operation on line 4, replacing kernel call with MKL library call can slightly improve the performance of Conjugate Graients, with a maximum improvement of 4% on line 8. Since MKL library will have more optimization for Intel CPU, we can expect more performance boost if the runtime environment on a machine with Intel CPU.

Please wait...
Page is in error, reload to recover