COMP SCI 557 Homework 5
Compile Environment
-
CPU:
11th Gen Intel(R) Core(TM) i5-11500 @ 2.70GHz
-
Memory:
16 GB
-
Compiler:
icc (ICC) 19.0.5.281 20190815
-
OS:
Ubuntu 22.04.5 LTS (Jammy Jellyfish)
-
MKL:
/s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl
-
Compile commands (See
Makefile
for details):$ make -j icc Laplacian.cpp main.cpp DirectSolver.cpp Utilities.cpp -Wall -O2 -qopenmp -Wl,--start-group /s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64/libmkl_intel_lp64.a /s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64/libmkl_core.a /s/intelcompilers-2019/amd64_rhel6/compilers_and_libraries_2019.5.281/linux/mkl/lib/intel64/libmkl_intel_thread.a -Wl,--end-group -liomp5 -lpthread -lm -ldl -o main
Runtime Environment
- CPU:
AMD Ryzen 7 5800H with Radeon Graphics
- CPU Cache:
- L1d:
256 KiB (8 instances)
- L1i:
256 KiB (8 instances)
- L2:
4 MiB (8 instances)
- L3:
16 MiB (1 instance)
- L1d:
- Memory:
Configured Memory Speed: 3200 MT/s, dual channel
- OS:
Ubuntu 24.04.2 LTS (Noble Numbat)
- Required Libraries:
libomp.so.5
- Install command:
sudo apt install libomp-dev
- Install command:
- Runtime Environment Selection Explanation: Since CSL machine cannot output steady timing result, I chose to use my own server to run the program. Since I only have machines with AMD CPU, the improvement of replacing hand-code kernel call with MKL library call might not be too obvious.
Task A
What was the resulting sparsity of the L factor (PARDISO reports it as the "number of non-zeros in L") ?
- Default (Nested Dissection): 92,983,874 non-zeros
- Minimum Degree: 187,686,049 non-zeros
- No Permutation: 901,928,763 non-zeros
How many operations are needed for the factorization itself (PARDISO reports it as the "gflops for the numerical factorization") ?
- Default (Nested Dissection): 241.855850 GFLOP
- Minimum Degree: 981.144287 GFLOP
- No Permutation: 3449.176270 GFLOP
What was the bandwidth achieved during factorization (PARDISO reports it at the "gflops/s for the numerical factorization") ?
- Default (Nested Dissection): 165.799622 GFLOP/S
- Minimum Degree: 187.779114 GFLOP/S
- No Permutation: 11.031159 GFLOP/S
What was the run-time (in seconds) for the solve step (forward/backward substitution)? PARDISO reports this as "Time spent in direct solver at solve step (solve)" in Phase 3.
- Default (Nested Dissection): 0.021083 seconds
- Minimum Degree: 0.059186 seconds
- No Permutation: 0.830340 seconds
Matrix (64 * 64 * 64) | Sparsity of the L Factor | Operations Needed for Factorization | Bandwidth Achieved during Factorization | Runtime for the Solve Step |
---|---|---|---|---|
Default (Nested Dissection) | 92983874 | 241.855850 | 165.799622 | 0.021083 |
Minimum Degree Algorithm | 187686049 | 981.144287 | 187.779114 | 0.059186 |
No Permutation | 901928763 | 3449.176270 | 11.031159 | 0.830340 |
Commentary on Task A
As we can see from the table, nested dissection partitions the domain to keep each pivot block as localized as possible, so it dramatically reduces fill.
Minimum degree tries to pick pivots that minimally increase the degree (connectivity) of the remaining matrix, reducing fill over naive methods.
No permutation leaves the domain points in a naive order, so any pivot can “connect” distant rows/columns during factorization, causing large amounts of fill.
Task B
Runtime(s) | |
---|---|
k = 1 | 0.020741 |
k = 10 | 0.109536 |
k = 100 | 0.463952 |
Commentary on Task B
In this example, the cost is NOT increasing sub-linearly, this is likely due to the overhead of starting the timer.
However, if we increase k to 1000, we could easily the increase of the cost is indeed sub-linear.
Verifying Correctness
The correctness of my implementation is verified by inspecting if
the output image ./x.0000.pgm
is the same as the original version
from the course repository.