闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Multicore and GPU Programming – Matrix Multiplication

Due – 5/21(Sun) 23:59 pm

Introduction

In this project, you will implement matrix multiplication with GPU Programming .

The specification has the following contents . You just need to use CUDA on it .

1. Environment Setup: About our environment

2. Framework: About structure and usage of the framework

3. Submission: How can you validate the submission before the deadline

4. Criteria: What you have to do

5. Report: How can you report the result

6. Grading: How can we grade the resul

Environment Setup

The same manner as in previous homework .

You don’t need to set up the environment . We prepared everything you need to do an experiment about GPU programming . Check GNU GCC/G++ 9 .4 .0, GNU make, CUDA 11.8 documentation for more information .

Matrix Multiplication Framework

1. Structure of template code

/HW4/matmul$ tree .

|-- Makefile

|-- obj

| `-- driver.o

`-- src

|-- matmul .h

`-- matmul_kernel .cu

2. Implement your optimized solution

… .

#include "matmul .h"

using namespace std;

void allocateDeviceMemory(void** M, int size)

{

cudaError_t err = cudaMalloc(M, size);

assert(err==cudaSuccess);

}

void deallocateDeviceMemory(void* M)

{

cudaError_t err = cudaFree(M);

assert(err==cudaSuccess);

}

void matmul_ref(const int* const matrixA, const int* const matrixB,

int* const matrixC, const int n) {

// You can assume matrixC is initialized with zero

for (int i = 0; i < n; i++)

for (int j = 0; j < n; j++)

for (int k = 0; k < n; k++)

matrixC[i * n + j] += matrixA[i * n + k] * matrixB[k * n + j];

}

void matmul_optimized(const int* const matrixA, const int* const matrixB,

int* const matrixC, const int* d_A, const int* d_B, int* const d_C, const int n) {

// TODO: Implement your CUDA code

}

As you see, the code given is almost the same to your HW3 . However, you’re supposed to use GPU to get the performance even lower. The target is 0 .20sec (which is easy) . Your memcpy should be included within the measurements . However, you don’t include the time for allocating the device memories (because we didn’t include CPU memory allocation either) . In case you want to do something about the allocation (You probably don’t) we have put separate alloc/dealloc functions . In the driver.cpp that is provided as pre-compiled binaries, it would look something like this:

int main()

{

…

A = new int[N*N];

B = new int[N*N];

C = new int[N*N];

allocateDeviceMemory(&d_A);

allocateDeviceMemory(&d_B);

allocateDeviceMemory(&d_C);

free A, B, C;

…

}

If you want to allocate anything else, you should do it within matmul_optimized() . Of course, you’re not allowed to change or re-write the main() . If you think you need to, please contact us .

3. Compile and run matmul

The ‘matmul’ program gets two arguments

- inputPath: path to input file

- outputPath: path to output file

- You can just run by commands “make 256”, “make 512”, … , “make 4096”

# Compile

mkdir -p cuobj

nvcc -g -std=c++11 -arch=sm_86 -O3 - Iinclude -c src/matmul_kernel .cu -o cuobj/matmul_kernel .o CUDA Compiled src/matmul_kernel .cu successfully!

mkdir -p bin

OBJ: obj/driver.o

CUOBJ: cuobj/matmul_kernel .o

g++ obj/driver.o cuobj/matmul_kernel .o -o bin/matmul -g -std=c++11 -Wall -Wno-sign-compare -O3 - L/usr/local/cuda/lib64 -lcudart - Iinclude - I/usr/local/cuda/include

Compiled obj/driver.o successfully!

# Local Run

/ HW4/matmul$ make 4096

./bin/matmul /HW4_data/input_4096 .dat /HW4_data/output_4096 .dat

=====================================

Matrix Multiplication

=====================================

The size of Matrix: 4096

=====================================

Read input file(/HW4_data/input_4096 .dat) . . .

Read output file(/HW4 data/output 4096 dat)

Run your solution . . .

matmul_optimal took X .XXX sec

Correct

Submission

There are five submit function, submit_64, submit_128, submit_1024, submit_2048 and submit_4096 . Each number means the size of matrix .

/ HW4/matmul$ make remote

mkdir -p cuobj

/usr/local/cuda/bin/nvcc -g -std=c++11 -arch=sm_86 -O3 - Iinclude -c src/matmul_kernel .cu -o cuobj/matmul_kernel .o

CUDA Compiled src/matmul_kernel .cu successfully!

mkdir -p bin

OBJ: obj/driver.o

CUOBJ: cuobj/matmul_kernel .o

g++ obj/driver.o cuobj/matmul_kernel .o -o bin/matmul -g -std=c++11 -Wall -Wno-sign-compare -O3 - L/usr/local/cuda/lib64 -lcudart - Iinclude - I/usr/local/cuda/include

Compiled obj/driver.o successfully!

./bin/matmul /HW4_data/input_4096 .dat /HW4_data/output_4096 .dat | tee result/eval_output .txt

=====================================

Matrix Multiplication

=====================================

The size of Matrix: 4096

=====================================

Read input file(/HW4_data/input_4096 .dat) . . .

Read output file(/HW4_data/output_4096 .dat) . . .

Run your solution . . .

cpyA size:67108864 no error

cpyA dA:0x7f2a02000000 hA:0x7f2a4a67f010

kernel N=4096 took X .XXX sec

matmul_optimal took X .XXX sec

Correct

We check only the runtime of “make remote” with the size of 4096x4096.

Criteria

1. Requirements

A . General

i . You should implement matrix multiplication with CUDA . We do not allow any matmul techniques in the CPU.

ii . No external libraries . You can only use STL . (No cuBLAS)

iii . Do not override driver.o

B . Matrix Multiplication

i . Problem size: 4096 x 4096

ii . Performance requirements: 0.2 sec (4096 x 4096. This time includes running memcpy() .)

iii . Correctness: Should work correctly for any square matrix input between size of 256 x 256 ~ 4096 x 4096

2. Measurements

We are measuring performance from a real machine, and it involves a certain amount of luck . To reduce the luck effect as much as possible, we will run your program at least five times and use the minimum exec . time .

3. Guideline

This assignment is designed to be super-easy, in that most of the code is already provided in the lecture slides . You can directly dive into the final version, but we suggest that you do the following in sequence, such that you will gradually see how each technique improves your performance .

A . The naive version, which directly updates the partial sums to global memory (from slides p .7)

B . Replace the memory update to a local variable on register (p .10)

C . Blocked matmul with shared memory (p .26) . This will already get you to the target performance .

D . Try to further use the remaining shared memory to get the extra credit .

Report

No reports. Just code

Grading

1. You get 50 pts if your program runs legally within 0 .2sec .

A . To get the full score, your program must not only run correctly within 0 .2sec for input size 4096x4096, but also run correctly for any square matrix input between size of 64x64 ~ 4096x4096-

2. 25pts for being a correct program.

A . Compile error or incorrect results: 0pts

B . If the program can be fixed with only with Makefile: 5pts penalty (20pts)

3. +25 performance points for being correct and running faster than 0 .2sec

A . If your program is correct on all test cases but runs slower than 0 .2sec:

i . perf score = max(25 (1/t - 1/0 .25), 0)

4. +5 extra points if your program runs faster than 0 .14se

2023-07-01

Java

物理(Physical)

LINUX

C++