HPC Homework 2

Yiwei Shao

Experiment Machine:

Processor: AMD EPYC 7452 32-Core Processor 1.5GHz g++ version:11.3.0

Architecture: x86_64

Problem 2

The outputs of different optimization flags are in directory problem2/opt

The outputs of different BLOCK_SIZE are in directory problem2/block

When the program is optimized with O0, O1 and O2, the block version is always about 2 times faster than the original version, which matches our expectation.

But when switching to O3 optimization flag, the original version is faster than the block version. More specifically, the running time of the block version does not reduce while the original version speedup about 4X. According to https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html, the O3 optimization does more loop optimization than O2. That might be the reason since the original version only contains simple loops and is easier for the compiler to optimize.

The running time of both versions will increase dramatically after dimension 1024. That might because L3 machine is 16Mb and it can accommodate $1024\times1024$ double precision floats at most.

When changing the BLOCK_SIZE with optimization flag O3, the cost time for the biggest matrix decreases until BLOCK_SIZE=32. So 32 is an optimal value for matrix multiplication.

Problem 4

(a)

The inner product is implemented in inner_prod.cpp.

compute_fn01() Naive algorithm.
compute_fn02() Unroll 2, pipeline
compute_fn03() Unroll 2, pipeline with index optimization
compute_fn04() Unroll 2, pipeline with index optimization and disentangle
compute_fn05() Unroll 4, pipeline with index optimization and disentangle

The outputs of different implementations are in directory Problem4

According to the output files, the solving time increases dramatically when vector size increases to $524288$. And the total sizes of these two vectors are 8Mb. Considering the L3 cache for my machine is 16 Mb and my program needs a little more cache than the two vectors take up, the best vector size should be slightly smaller than $2\times524288$.

(b)

compute.cpp

multiply-add

Optimization Flag	Running Time (s)	Cycles per Evaluation	GFLOPS
O3	1.497518	2.246359	1.335493
O2	1.497929	2.246976	1.335126
O1	3.891597	5.837482	0.513920
O0	4.248807	6.373301	0.470714

division

Optimization Flag	Running Time (s)	Cycles per Evaluation	GFLOPS
O3	3.903214	5.854896	0.512392
O2	3.894030	5.841128	0.513599
O1	6.000244	9.000456	0.333316
O0	6.313590	9.470529	0.316772

sqrt

Optimization Flag	Running Time (s)	Cycles per Evaluation	GFLOPS
O3	6.004123	9.006265	0.333101
O2	6.010781	9.016312	0.332730
O1	8.402271	12.603510	0.238029
O0	10.192865	15.289367	0.196215

sin

Optimization Flag	Running Time (s)	Cycles per Evaluation	GFLOPS
O3	11.656103	17.484359	0.171582
O2	11.652059	17.478183	0.171643
O1	13.251104	19.876739	0.150930
O0	15.095802	22.643796	0.132487

compute-vec.cpp

Output with OpenMP

time = 1.510941
flop-rate = 5.294521 Gflop/s

time = 1.497440
flop-rate = 5.342419 Gflop/s

time = 1.498995
flop-rate = 5.336869 Gflop/s

Output with #pragma unroll

compute-vec.cpp:16:21: optimized: loop vectorized using 32 byte vectors
compute-vec.cpp:16:21: optimized:  loop versioned for vectorization because of possible aliasing
compute-vec.cpp:52:21: optimized: loop vectorized using 16 byte vectors
compute-vec.cpp:46:5: optimized: basic block part vectorized using 32 byte vectors
time = 1.499241
flop-rate = 5.335839 Gflop/s

time = 1.498447
flop-rate = 5.338831 Gflop/s

time = 1.498376
flop-rate = 5.339084 Gflop/s

Output with #pragma GCC ivdep

compute-vec.cpp:16:17: optimized: loop vectorized using 32 byte vectors
compute-vec.cpp:52:21: optimized: loop vectorized using 16 byte vectors
compute-vec.cpp:46:5: optimized: basic block part vectorized using 32 byte vectors
time = 1.501182
flop-rate = 5.328966 Gflop/s

time = 1.497545
flop-rate = 5.342031 Gflop/s

time = 1.499832
flop-rate = 5.333898 Gflop/s

#pragma unroll let compiler decide if unroll the loop and #pragma GCC ivdep force the compiler to ignore the loop dependency. The optimized information output shows that the compiler successfully unrolled the loop and they gained similar improvement for this problem. OpenMP does the similar optimization with AVX

compute-vec-pipe.cpp

Running time for different M and functions

OpenMP

M	1	2	4	8	16	32
fn0	1.498873	1.513063	1.513183	2.861379	2.949068	11.432868
fn1	1.501146	1.505750	1.497094	2.399404	5.425581	10.017032
fn2	1.498717	1.500135	1.497746	2.398358	5.448774	9.979667

#pragma unroll

M	1	2	4	8	16	32
fn0	1.512504	1.497755	1.511739	1.512873	2.868334	10.196745
fn1	1.499327	1.496293	1.500039	1.512967	5.201772	10.215604
fn2	1.498232	1.496820	1.499235	1.518770	5.221702	10.359698

#pragma GCC ivdep

M	1	2	4	8	16	32
fn0	1.510875	1.499915	1.500936	1.512461	2.852350	10.292027
fn1	1.497899	1.499319	1.497990	1.543855	5.195912	10.216272
fn2	1.500909	1.499360	1.497704	1.513994	5.303215	10.365627

As we can see the running time increases only after M is bigger than 8, and that might because the AVX vector is packed of 4 and there are 2 FMA for each core. So we can do 8 multiply-add one time. When vector size is bigger than 8, we need to call more AVX functions and the cost time is no longer constant. OpenMP takes much more time when vector size is 8. That might because OpenMP cannot fully utilize the 2 FMA since it is cross-platform.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Problem2		Problem2
Problem4		Problem4
.gitignore		.gitignore
MMult1.cpp		MMult1.cpp
Makefile		Makefile
README.md		README.md
fast-sin.cpp		fast-sin.cpp
inner_prod.cpp		inner_prod.cpp
intrin-wrapper.h		intrin-wrapper.h
utils.h		utils.h
val_test01.cpp		val_test01.cpp
val_test01_solved.cpp		val_test01_solved.cpp
val_test02.cpp		val_test02.cpp
val_test02_solved.cpp		val_test02_solved.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HPC Homework 2

Problem 2

Problem 4

(a)

(b)

About

Releases

Packages

Contributors 2

Languages

njsyw1997/HPC_HW2

Folders and files

Latest commit

History

Repository files navigation

HPC Homework 2

Problem 2

Problem 4

(a)

(b)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages