Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ELEC 873 Cluster Computing

Fall 2023

Assignment 3

Posted: November 14, 2023

Total marks = 100 (8% of the course grade) + 20 bonus marks

Due: December 4, 2023 at 11:30am

The objective of this assignment is to gain some exposure to interconnection networks, collective communication algorithms, and GPU programming.

This assignment is to be done individually, and you are not allowed to use any online code or generative AI tools. Students must abide by the academic integrity expectations, run through the assignment by themselves, include their own solutions and analyses, and turn in their own reports. You program for the Bonus part of Q12 must be run on one of the Smith Engineering remote GPU machines, or on the Alienware machines in the ECE Bainlab. For information on accessing the Smith Engineering remote GPU machines and CUDA environment, please consult the "GPU Servers and CUDA Environment Tutorial" under Assignments, or GPU Computing and Clusters lecture slides in onQ.

Submission instructions can be found at the end of this document.

1. Consider a packet format with 10 bytes of routing and control information and 6 bytes of CRC and other trailer information. The payload contains a 64-byte cache block along with 8 bytes of command and address information. If the raw link bandwidth is 500 MB/s, what is the effective data bandwidth on cache block transfer using this format? How would this change with 128-byte blocks? 4-KB page transfers? [5]

2. Draw a De Bruijn network for 16 nodes in base 2. [5]

3. Draw a Butterfly MIN with eight nodes and 2 × 2 switches. Repeat for Cube and Baseline MINs. [7]

4. Suppose the links are 1-byte wide and operating at 300 MHz in a network where the average routing distance between nodes is for p nodes. Compare the unloaded latency for 80-byte packets under circuit-switching, store-and-forward, virtual cut-through, and wormhole switching, assuming 4 cycles of delay per hop to make the routing decision, one cycle for switching delay in the router, and p equal to 16 and 1024 nodes. [8]

5. Consider the following interconnection network with four nodes n0, n1, n2, and n3, with the minimal routing function R. Prove R is not deadlock-free. Then, using virtual channel(s) devise a new routing function R’, and prove R’ is deadlock-free. [10]

6. Write a minimal north-last routing algorithm for 2-D meshes. [8]

7. Consider a 7 × 7 mesh and a multicast communication, where node (1, 3) is the source and nodes (0, 0), (0, 6), (6, 0), (6, 6), (3, 3), and (3, 0) are the six destinations. Assuming a single-port communication model:

(a) Show the corresponding U-mesh tree and the path followed by each message. [4]

(b) Show the paths used by dual-path routing. [4]

(c) Show the paths used by multipath routing. [4]

8. Show how broadcasting can be done in a 5-cube rooted at node 0 using Spanning Binomial tree algorithm. Assume a single-port communication model. [5]

9. Using the Recursive Doubling algorithm, show how an all-to-all broadcast operation, MPI_Allgather, is performed on a binary 3-cube (hypercube). Clearly identify the messages involved in each iteration between any two nodes. Assume a single-port communication model. [10]

10. Using the Standard Exchange algorithm, show how an all-to-all personalized exchange operation, MPI_Alltoall, is performed on a binary 3-cube. Clearly identify the messages involved in each iteration between any two nodes. Assume a single-port communication model. [10]

11. Show how the Bruck algorithm works for MPI_Allgather and MPI_Alltoall collectives with 5 processes. Assume a single-port communication model. [10]

12. A matrix-vector multiplication takes an input matrix B and an input vector C and produces an output vector A. Each element of the output vector A is the dot product of one row of the input matrix B and vector C. Consider only square matrices. Write a single-precision matrix-vector multiplication GPU kernel and the host function that can be called with four parameters: pointer to the output vector A, pointer to the input matrix B, pointer to the input vector C, and the number of elements in each dimension. After the device matrix-vector multiplication is invoked, you will compute the correct output vector using the CPU and compare that solution with the device-computed solution. If it matches (within a certain tolerance), it will display "Test PASSED" on the screen before exiting.

Create randomly-initialized input matrix B and Vector C. Write the host code by allocating memory for the input and output vectors and the input matrix, transferring input data to device, launching the kernel, transferring the output data to host, and freeing the device memory for the input and output data. Write a kernel that has each thread producing one output vector element. Write the execution configuration parameters in your host code accordingly. [10]

Bonus:

(a) Run your CUDA code on one of the Smith Engineering remote GPU machines, or one of the Alienware machines in the ECE Bainlab, with different matrix/vector sizes, and analyze your 3 matrix-vector multiplication kernel performance. Experiment with different number of blocks and number of threads per block. [10]

(b) Consider a multi-GPU node/cluster and write an MPI+CUDA code to implement the matrix-vector multiplication by allocating a part of the input matrix B on each GPU. Consider one process per GPU, and allocate matrix B statically. The input vector C is copied on all GPUs. Each GPU computes its own part of the result vector A and sends the result back to the root process. You do not need to run your code on any of the GPU machines. [10]

Submission Instructions:

For Assignment 3:

• Submit a single zip file, “Assignment_3_YourLastName.zip”, containing your CUDA code (and the optional MPI+CUDA) code for Q12, with your name and student ID on top of each code.

• Submit a single report in PDF format, “Assignment_3_YourLastName.pdf”, where you present your solution for each question, and include a description of your work, analyses of the results, and discussions of any outstanding issues. Include a printout of your code for Q12, and the optional output screenshots of your program running on one of the Smith Engineering remote GPU machines, or one of the Alienware machines in the ECE Bainlab.