Multicore and GPU Programming – Matrix Multiplication
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Multicore and GPU Programming – Matrix Multiplication
Due – 5/21(Sun) 23:59 pm
Introduction
In this project, you will implement matrix multiplication with GPU Programming .
The specification has the following contents . You just need to use CUDA on it .
1. Environment Setup: About our environment
2. Framework: About structure and usage of the framework
3. Submission: How can you validate the submission before the deadline
4. Criteria: What you have to do
5. Report: How can you report the result
6. Grading: How can we grade the resul
Environment Setup
The same manner as in previous homework .
You don’t need to set up the environment . We prepared everything you need to do an experiment about GPU programming . Check GNU GCC/G++ 9 .4 .0, GNU make, CUDA 11.8 documentation for more information .
Matrix Multiplication Framework
1. Structure of template code
/HW4/matmul$ tree . . |-- Makefile |-- obj | `-- driver.o `-- src |-- matmul .h `-- matmul_kernel .cu |
2. Implement your optimized solution
… . #include "matmul .h" using namespace std; void allocateDeviceMemory(void** M, int size) { cudaError_t err = cudaMalloc(M, size); assert(err==cudaSuccess); } void deallocateDeviceMemory(void* M) { cudaError_t err = cudaFree(M); assert(err==cudaSuccess); } void matmul_ref(const int* const matrixA, const int* const matrixB, int* const matrixC, const int n) { // You can assume matrixC is initialized with zero for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) matrixC[i * n + j] += matrixA[i * n + k] * matrixB[k * n + j]; } void matmul_optimized(const int* const matrixA, const int* const matrixB, int* const matrixC, const int* d_A, const int* d_B, int* const d_C, const int n) { // TODO: Implement your CUDA code } |
As you see, the code given is almost the same to your HW3 . However, you’re supposed to use GPU to get the performance even lower. The target is 0 .20sec (which is easy) . Your memcpy should be included within the measurements . However, you don’t include the time for allocating the device memories (because we didn’t include CPU memory allocation either) . In case you want to do something about the allocation (You probably don’t) we have put separate alloc/dealloc functions . In the driver.cpp that is provided as pre-compiled binaries, it would look something like this:
int main() { … A = new int[N*N]; B = new int[N*N]; C = new int[N*N]; allocateDeviceMemory(&d_A); allocateDeviceMemory(&d_B); allocateDeviceMemory(&d_C); free A, B, C; … } |
If you want to allocate anything else, you should do it within matmul_optimized() . Of course, you’re not allowed to change or re-write the main() . If you think you need to, please contact us .
3. Compile and run matmul
The ‘matmul’ program gets two arguments
- inputPath: path to input file
- outputPath: path to output file
- You can just run by commands “make 256”, “make 512”, … , “make 4096”
# Compile mkdir -p cuobj nvcc -g -std=c++11 -arch=sm_86 -O3 - Iinclude -c src/matmul_kernel .cu -o cuobj/matmul_kernel .o CUDA Compiled src/matmul_kernel .cu successfully! mkdir -p bin OBJ: obj/driver.o CUOBJ: cuobj/matmul_kernel .o g++ obj/driver.o cuobj/matmul_kernel .o -o bin/matmul -g -std=c++11 -Wall -Wno-sign-compare -O3 - L/usr/local/cuda/lib64 -lcudart - Iinclude - I/usr/local/cuda/include Compiled obj/driver.o successfully! # Local Run / HW4/matmul$ make 4096 ./bin/matmul /HW4_data/input_4096 .dat /HW4_data/output_4096 .dat ===================================== Matrix Multiplication ===================================== The size of Matrix: 4096 ===================================== Read input file(/HW4_data/input_4096 .dat) . . . Read output file(/HW4 data/output 4096 dat)
Run your solution . . . matmul_optimal took X .XXX sec Correct
|
Submission
There are five submit function, submit_64, submit_128, submit_1024, submit_2048 and submit_4096 . Each number means the size of matrix .
/ HW4/matmul$ make remote mkdir -p cuobj /usr/local/cuda/bin/nvcc -g -std=c++11 -arch=sm_86 -O3 - Iinclude -c src/matmul_kernel .cu -o cuobj/matmul_kernel .o CUDA Compiled src/matmul_kernel .cu successfully! mkdir -p bin OBJ: obj/driver.o CUOBJ: cuobj/matmul_kernel .o g++ obj/driver.o cuobj/matmul_kernel .o -o bin/matmul -g -std=c++11 -Wall -Wno-sign-compare -O3 - L/usr/local/cuda/lib64 -lcudart - Iinclude - I/usr/local/cuda/include Compiled obj/driver.o successfully! ./bin/matmul /HW4_data/input_4096 .dat /HW4_data/output_4096 .dat | tee result/eval_output .txt ===================================== Matrix Multiplication ===================================== The size of Matrix: 4096 ===================================== Read input file(/HW4_data/input_4096 .dat) . . . Read output file(/HW4_data/output_4096 .dat) . . . Run your solution . . . cpyA size:67108864 no error cpyA dA:0x7f2a02000000 hA:0x7f2a4a67f010 kernel N=4096 took X .XXX sec matmul_optimal took X .XXX sec Correct |
We check only the runtime of “make remote” with the size of 4096x4096.
Criteria
1. Requirements
A . General
i . You should implement matrix multiplication with CUDA . We do not allow any matmul techniques in the CPU.
ii . No external libraries . You can only use STL . (No cuBLAS)
iii . Do not override driver.o
B . Matrix Multiplication
i . Problem size: 4096 x 4096
ii . Performance requirements: 0.2 sec (4096 x 4096. This time includes running memcpy() .)
iii . Correctness: Should work correctly for any square matrix input between size of 256 x 256 ~ 4096 x 4096
2. Measurements
We are measuring performance from a real machine, and it involves a certain amount of luck . To reduce the luck effect as much as possible, we will run your program at least five times and use the minimum exec . time .
3. Guideline
This assignment is designed to be super-easy, in that most of the code is already provided in the lecture slides . You can directly dive into the final version, but we suggest that you do the following in sequence, such that you will gradually see how each technique improves your performance .
A . The naive version, which directly updates the partial sums to global memory (from slides p .7)
B . Replace the memory update to a local variable on register (p .10)
C . Blocked matmul with shared memory (p .26) . This will already get you to the target performance .
D . Try to further use the remaining shared memory to get the extra credit .
Report
No reports. Just code
Grading
1. You get 50 pts if your program runs legally within 0 .2sec .
A . To get the full score, your program must not only run correctly within 0 .2sec for input size 4096x4096, but also run correctly for any square matrix input between size of 64x64 ~ 4096x4096-
2. 25pts for being a correct program.
A . Compile error or incorrect results: 0pts
B . If the program can be fixed with only with Makefile: 5pts penalty (20pts)
3. +25 performance points for being correct and running faster than 0 .2sec
A . If your program is correct on all test cases but runs slower than 0 .2sec:
i . perf score = max(25 (1/t - 1/0 .25), 0)
4. +5 extra points if your program runs faster than 0 .14se
2023-07-01