Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Multicore and GPU Programming – Matrix Multiplication

Due – 5/21(Sun) 23:59 pm

Introduction

In this project, you will implement matrix multiplication with GPU Programming .

The specification has the following contents . You just need to use CUDA on it .

1.     Environment Setup: About our environment

2.     Framework: About structure and usage of the framework

3.     Submission: How can you validate the submission before the deadline

4.     Criteria: What you have to do

5.     Report: How can you report the result

6.     Grading: How can we grade the resul

Environment Setup

The same manner as in previous homework .

You don’t need to set up the environment . We prepared everything you need to do an experiment about GPU programming . Check GNU GCC/G++ 9 .4 .0, GNU make, CUDA 11.8 documentation for more information .

Matrix Multiplication Framework

1.     Structure of template code

/HW4/matmul$ tree .

.

|-- Makefile

|-- obj

|       `-- driver.o

`-- src

|-- matmul .h

`-- matmul_kernel .cu

2.     Implement your optimized solution

 .

#include "matmul .h"

using namespace std;

void allocateDeviceMemory(void** M, int size)

{

cudaError_t err = cudaMalloc(M, size);

assert(err==cudaSuccess);

}

void deallocateDeviceMemory(void* M)

{

cudaError_t err = cudaFree(M);

assert(err==cudaSuccess);

}

void matmul_ref(const int* const matrixA, const int* const matrixB,

int* const matrixC, const int n) {

// You can assume matrixC is initialized with zero

for (int i = 0; i < n; i++)

for (int j = 0; j < n; j++)

for (int k = 0; k < n; k++)

matrixC[i * n + j] += matrixA[i * n + k] * matrixB[k * n + j];

}

void matmul_optimized(const int* const matrixA, const int* const matrixB,

int* const matrixC, const int* d_A, const int* d_B,     int* const d_C, const int n) {

// TODO: Implement your CUDA code

}


As you see, the code given is almost the same to your HW3 . However, you’re supposed to use GPU to get the performance even lower. The target is 0 .20sec (which is easy) . Your memcpy should be included within the measurements . However, you don’t include the time for allocating the device memories (because we didn’t include CPU memory allocation either) . In case you want to do something about the allocation (You probably don’t) we have put separate alloc/dealloc functions . In the driver.cpp that is provided as pre-compiled binaries, it would look something like this:

int main()

{

A = new int[N*N];

B = new int[N*N];

C = new int[N*N];

allocateDeviceMemory(&d_A);

allocateDeviceMemory(&d_B);

allocateDeviceMemory(&d_C);

free A, B, C;

}

If you want to allocate anything else, you should do  it within  matmul_optimized() . Of course, you’re  not allowed to change or re-write the main() . If you think you need to, please contact us .

3.     Compile and run matmul

The matmul’ program gets two arguments

-       inputPath: path to input file

-       outputPath: path to output file

-       You can just run by commands “make 256”, “make 512”, … , “make 4096”

# Compile

mkdir -p cuobj

nvcc -g -std=c++11 -arch=sm_86 -O3 - Iinclude -c src/matmul_kernel .cu -o cuobj/matmul_kernel .o CUDA Compiled src/matmul_kernel .cu successfully!

mkdir -p bin

OBJ: obj/driver.o

CUOBJ: cuobj/matmul_kernel .o

g++      obj/driver.o  cuobj/matmul_kernel .o  -o  bin/matmul  -g  -std=c++11  -Wall  -Wno-sign-compare  -O3  - L/usr/local/cuda/lib64 -lcudart - Iinclude - I/usr/local/cuda/include

Compiled obj/driver.o successfully!

# Local Run

/ HW4/matmul$ make 4096

./bin/matmul /HW4_data/input_4096 .dat /HW4_data/output_4096 .dat

=====================================

Matrix Multiplication

=====================================

The size of Matrix: 4096

=====================================

Read input file(/HW4_data/input_4096 .dat) . . .

Read output file(/HW4  data/output  4096 dat)

Run your solution . . .

matmul_optimal took X .XXX sec

Correct


Submission

There are five submit function, submit_64, submit_128, submit_1024, submit_2048 and submit_4096 . Each number means the size of matrix .

/ HW4/matmul$ make remote

mkdir -p cuobj

/usr/local/cuda/bin/nvcc     -g     -std=c++11     -arch=sm_86     -O3     - Iinclude     -c     src/matmul_kernel .cu     -o cuobj/matmul_kernel .o

CUDA Compiled src/matmul_kernel .cu successfully!

mkdir -p bin

OBJ: obj/driver.o

CUOBJ: cuobj/matmul_kernel .o

g++      obj/driver.o  cuobj/matmul_kernel .o  -o  bin/matmul  -g  -std=c++11  -Wall  -Wno-sign-compare  -O3  - L/usr/local/cuda/lib64 -lcudart - Iinclude - I/usr/local/cuda/include

Compiled obj/driver.o successfully!

./bin/matmul /HW4_data/input_4096 .dat /HW4_data/output_4096 .dat | tee result/eval_output .txt

=====================================

Matrix Multiplication

=====================================

The size of Matrix: 4096

=====================================

Read input file(/HW4_data/input_4096 .dat) . . .

Read output file(/HW4_data/output_4096 .dat) . . .

Run your solution . . .

cpyA size:67108864 no error

cpyA dA:0x7f2a02000000 hA:0x7f2a4a67f010

kernel N=4096 took X .XXX sec

matmul_optimal took X .XXX sec

Correct

We check only the runtime of “make remote” with the size of 4096x4096.

Criteria

1.     Requirements

A .      General

i .           You should implement matrix multiplication with CUDA . We do not allow any matmul techniques in the CPU.

ii .           No external libraries . You can only use STL . (No cuBLAS)

iii .           Do not override driver.o


B .       Matrix Multiplication

i .           Problem size: 4096 x 4096

ii .           Performance requirements: 0.2 sec (4096 x 4096. This time includes running memcpy() .)

iii .           Correctness: Should work correctly for any square matrix input between size of 256 x 256 ~ 4096 x 4096

2.     Measurements

We are measuring performance from a real machine, and it involves a certain amount of luck . To reduce the luck effect as much as possible, we will run your program at least five times and use the minimum exec . time .

3.     Guideline

This assignment is designed to be super-easy, in that most of the code is already provided in the lecture slides . You can directly dive into the final version, but we suggest that you do the following in sequence, such that you will gradually see how each technique improves your performance .

A .      The naive version, which directly updates the partial sums to global memory (from slides p .7)

B .       Replace the memory update to a local variable on register (p .10)

C .       Blocked   matmul  with   shared   memory   (p .26) .  This  will   already   get  you  to  the  target performance .

D .      Try to further use the remaining shared memory to get the extra credit .

Report

No reports. Just code

Grading

1.     You get 50 pts if your program runs legally within 0 .2sec .

A .      To get the full score, your program must not only run correctly within 0 .2sec for input size 4096x4096,  but  also  run  correctly  for  any  square  matrix  input  between  size  of  64x64  ~ 4096x4096-

2.     25pts for being a correct program.

A .      Compile error or incorrect results: 0pts

B .       If the program can be fixed with only with Makefile: 5pts penalty (20pts)

3.     +25 performance points for being correct and running faster than 0 .2sec

A .      If your program is correct on all test cases but runs slower than 0 .2sec:

i .           perf score = max(25 (1/t - 1/0 .25), 0)

4.     +5 extra points if your program runs faster than 0 .14se