Multicore and GPU Programming – Matrix Multiplication
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Multicore and GPU Programming – Matrix Multiplication
Due – 5/21(Sun) 23:59 pm
Introduction
In this project, you will implement matrix multiplication with GPU Programming.
The specification has the following contents. You just need to use CUDA on it.
1. Environment Setup: About our environment
2. Framework: About structure and usage of the framework
3. Submission: How can you validate the submission before the deadline
4. Criteria: What you have to do
5. Report: How can you report the result
6. Grading: How can we grade the resul
Environment Setup
The same manner as in previous homework.
You don’t need to set up the environment. We prepared everything you need to do an experiment about GPU programming. Check GNU GCC/G++ 9.4.0, GNU make, CUDA 11.8 documentation for more information.
Matrix Multiplication Framework
1. Structure of template code
/HW4/matmul$ tree . -- obj(Ma)kefile | `-- driver.o ` -- src |-- matmul.h `-- matmul_kernel.cu |
2. Implement your optimized solution
i.nclude "matmul.h" using namespace std; void allocateDeviceMemory(void** M, int size) { cudaError_t err = cudaMalloc(M, size); assert(err==cudaSuccess); } void deallocateDeviceMemory(void* M) { cudaError_t err = cudaFree(M); assert(err==cudaSuccess); } void matmul_ref(const int* const matrixA, const int* const matrixB, int* const matrixC, const int n) { // You can assume matrixC is initialized with zero for (int i = 0; i < n; i++) for (intj = 0; j < n; j++) for (int k = 0; k < n; k++) matrixC[i * n + j] += matrixA[i * n + k] * matrixB[k * n + j]; } void matmul_optimized(const int* const matrixA, const int* const matrixB, int* const matrixC, const int* d_A, const int* d_B, int* const d_C, const int n) { // TODO: Implement your CUDA code } |
As you see, the code given is almost the same to your HW3. However, you’re supposed to use GPU to get the performance even lower. The target is 0.20sec (which is easy). Your memcpy should be included within the measurements. However, you don’t include the time for allocating the device memories (because we didn’t include CPU memory allocation either). In case you want to do something about the allocation (You probably don’t) we have put separate alloc/dealloc functions. In the driver.cpp that is provided as pre-compiled binaries, it would look something like this:
int main() { … A = new int[N*N]; B = new int[N*N]; C = new int[N*N]; allocateDeviceMemory(&d_A); allocateDeviceMemory(&d_B); allocateDeviceMemory(&d_C); free A, B, C; … } |
If you want to allocate anything else, you should do it within matmul_optimized() . Of course, you’re not allowed to change or re-write the main(). If you think you need to, please contact us.
3. Compile and run matmul
The ‘matmul’ program gets two arguments
- inputPath: path to input file
- outputPath: path to output file
- You can just run by commands “make 256”, “make 512”, … , “make 4096”
# Compile mkdir -p cuobj nvcc -g -std=c++11 -arch=sm_86 -O3 -Iinclude -c src/matmul_kernel.cu -o cuobj/matmul_kernel.o CUDA Compiled src/matmul_kernel.cu successfully! mkdir -p bin OBJ: obj/driver.o CUOBJ: cuobj/matmul_kernel.o g++ obj/driver.o cuobj/matmul_kernel.o -o bin/matmul -g -std=c++11 -Wall -Wno-sign-compare -O3 - L/usr/local/cuda/lib64 -lcudart -Iinclude -I/usr/local/cuda/include Compiled obj/driver.o successfully! # Local Run / HW4/matmul$ make 4096 ./bin/matmul /HW4_data/input_4096.dat /HW4_data/output_4096.dat ===================================== Matrix Multiplication ===================================== The size of Matrix: 4096 ===================================== Read input file(/HW4_data/input_4096.dat) . . . Read output file(/HW4_data/output_4096.dat) . . . |
Run your solution. . . matmul_optimal took X.XXX sec Correct |
Submission
There are five submit function, submit_64, submit_128, submit_1024, submit_2048 and submit_4096. Each number means the size of matrix.
/ HW4/matmul$ make remote mkdir -p cuobj /usr/local/cuda/bin/nvcc -g -std=c++11 -arch=sm_86 -O3 -Iinclude -c src/matmul_kernel.cu -o cuobj/matmul_kernel.o CUDA Compiled src/matmul_kernel.cu successfully! mkdir -p bin OBJ: obj/driver.o CUOBJ: cuobj/matmul_kernel.o g++ obj/driver.o cuobj/matmul_kernel.o -o bin/matmul -g -std=c++11 -Wall -Wno-sign-compare -O3 - L/usr/local/cuda/lib64 -lcudart -Iinclude -I/usr/local/cuda/include Compiled obj/driver.o successfully! ./bin/matmul /HW4_data/input_4096.dat /HW4_data/output_4096.dat | tee result/eval_output.txt ===================================== Matrix Multiplication ===================================== The size of Matrix: 4096 ===================================== Read input file(/HW4_data/input_4096.dat) . . . Read output file(/HW4_data/output_4096.dat) . . . Run your solution. . . cpyA size:67108864 no error cpyAdA:0x7f2a02000000 hA:0x7f2a4a67f010 kernel N=4096 took X.XXX sec matmul_optimal took X.XXX sec Correct |
We check only the runtime of “make remote” with the size of 4096x4096.
Criteria
1. Requirements
A. General
i. You should implement matrix multiplication with CUDA. We do not allow any matmul techniques in the CPU.
ii. No external libraries. You can only use STL. (No cuBLAS)
iii. Do not override driver.o
B. Matrix Multiplication
i. Problem size: 4096 x 4096
ii. Performance requirements: 0.2 sec (4096 x 4096. This time includes running memcpy().)
iii. Correctness: Should work correctly for any square matrix input between size of 256 x 256
~ 4096 x 4096
2. Measurements
We are measuring performance from a real machine, and it involves a certain amount of luck. To reduce the luck effect as much as possible, we will run your program at least five times and use the minimum exec. time.
3. Guideline
This assignment is designed to be super-easy, in that most of the code is already provided in the lecture slides. You can directly dive into the final version, but we suggest that you do the following in sequence, such that you will gradually see how each technique improves your performance.
A. The naive version, which directly updates the partial sums to global memory (from slides p.7)
B. Replace the memory update to a local variable on register (p.10)
C. Blocked matmul with shared memory (p.26). This will already get you to the target performance.
D. Try to further use the remaining shared memory to get the extra credit.
Report
No reports. Just code
Grading
1. You get 50 pts if your program runs legally within 0.2sec.
A. To get the full score, your program must not only run correctly within 0.2sec for input size 4096x4096, but also run correctly for any square matrix input between size of 64x64 ~ 4096x4096-
2. 25pts for being a correct program.
A. Compile error or incorrect results: 0pts
B. If the program can be fixed with only with Makefile: 5pts penalty (20pts)
3. +25 performance points for being correct and running faster than 0.2sec
A. If your program is correct on all test cases but runs slower than 0.2sec:
i. perf score = max(25 (1/t - 1/0.25), 0)
4. +5 extra points if your program runs faster than 0.14se
2023-07-19