COMP5426 Parallel and Distributed Computing 2018
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Semester 1 - Main, 2018
COMP5426 Parallel and Distributed Computing
(This section contains 4 short-answer questions.)
a. Distinguish between implicit and explicit parallelism.
b. Classify parallel computers based on Flynn’s taxonomy in terms of streams of data and instructions with examples.
c. Describe at least FOUR characteristics of shared-memory multiprocessors that distinguish them from distributed-memory multicomputers on the basis of architecture, data sharing and communication/synchronization.
d. Distinguish between SIMD and SPMD.
Question 2 (16 Marks)
(This section contains 5 short-answer questions)
a. What are the major performance metrics for characterizing an interconnection network, and what are the main purposes of using these metrics?
b. What is the execution time, in general, relate to the standard all-to-one reduction operation on a two-dimensional torus with a total number of compute nodes p.
d. What is the Arc connectivity of a two-dimensional torus?
e. What is the cost (i.e., the number of links) of a hypercube network?
Question 3 (27 Marks)
(This section contains 5 short-answer questions.)
a. Provide concrete definitions and examples of the following terms:
i. Parallel speedup
ii. Parallel efficiency
iii. Total overhead
b. Provide concrete definitions and examples of Amdahl’s Law and Gustafson-Barsis’s law. What do these two laws tell us?
c. Suppose 90% of the code of a program is found to be parallel and it can be executed simultaneously by 16 homogeneous processors. Rest of the code is sequential. The execution speed of a processor is 4 GFLOPS (Giga Floating Point Operations Per Second). Calculate the effective execution speed in GFLOPS of this program execution.
d. Consider a matrix of size 9×11, which is partitioned into blocks, each being of size 2×2, and distributed onto a 2×2 processor mesh. Label each matrix element with the processor numbers 1, 2, 3 and 4 that the element is mapped to by the two-dimensional block cyclic layout.
e. Consider performing an even-odd transportation sort for a sequence of n numbers on a linear array of p processors. What are the speedup, efficiency and isoefficiency function? Show your work. (You may assume that a local sequential sort of m numbers takes O(mlog m) time and a local sequential merge of two sorted sequences of m numbers takes O(m) time.)
Question 4 (19 Marks)
(This section contains 3 questions)
a. List SIX very basic MPI functions which are used in almost every parallel programs using MPI.
b. Explain clearly what each one of the following MPI commands does. You need to specify the input and output parameters and what they store before and after the call.
i. int MPI_Comm_size(MPI_Comm comm, int *size)
ii. MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
iii. MPI_Scatterv (const void *sendbuf, const int sendcounts[], const int displs[], MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
c. Suppose that MPI_COMM_WORLD consists of three processes 0, 1, and 2, and suppose the following code is executed:
int x, y, z;
switch(my_rank) {
case 0: x=0; y=1; z=2;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Send(&y, 1, MPI_INT, 2, 2, MPI_COMM_WORLD);
MPI_Bcast(&y, 1, MPI_INT, 1, MPI_COMM_WORLD);
break;
case 1: x=3; y=8; z=5;
MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Send(&z, 1, MPI_INT, 2, 3, MPI_COMM_WORLD);
MPI_Bcast(&y, 1, MPI_INT, 1, MPI_COMM_WORLD);
break;
case 2: x=6; y=7; z=8;
MPI_Bcast(&z, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Recv(&x, 1, MPI_INT, 0, 2, MPI_COMM_WORLD, &status); MPI_Recv(&y, 1, MPI_INT, 1, 3, MPI_COMM_WORLD, &status);
MPI_Bcast(&z, 1, MPI_INT, 1, MPI_COMM_WORLD);
break;
}
What are the values of x, y, and z on each process after the code has been executed?
(This section contains 5 questions)
b. Explain mutex and condition variable and the associated operations.
c. Using mutex and condition variable, write a simple synchronization structure barrier which is used to coordinate the execution of a set of threads. When a thread reaches a synchronization point by calling barrier, its execution is stopped until all other threads in the set reach the synchronization point.
A barrier type is defined as
typedef struct {
pthread_mutex_t count_lock; /* mutex */
pthread_cond_t ok_to_proceed; /* condition variable */ int num_thrds; /* the total number of threads in the set */
int count; /* the number of threads arrived */
} mylib_barrier_t;
You must write the following TWO functions:
mylib_barrier_init(mylib_barrier_t *b, int n_thrds), and mylib_barrier(mylib_barrier_t *b)
/* all threads are awakened when the total number of blocked threads is equal to n_thrds. */
d. In the context of a GPU programmed using CUDA, what are (i) threads; (ii) blocks; and (iii) grids?
e. Outline how you would allocate and free memory for the arrays on the GPU, and how you would transfer data from and to the host CPU.
f. Consider a CUDA kernel routine plus_reduce() below. This routine performs a parallel add reduction operation, i.e., parallel summation of an array of integers with
N elements.
__global__
void plus_reduce(int *input, int N, int *total)
{
int tid = threadldx.x;
int i = blockldx.x*blockDim.x + threadldx.x;
__shared__ int x[blockDim.x];
x[tid] = (i<N) ? input[i] : 0;
__syncthreads();
for (int s=blockDim.x/2; s>0; s=s/2)
{
If (tid<s) x[tid] += x[tid+s];
__syncthreads();
}
If (tid==0) atomicAdd(total, x[tid]);
}
Answer the following questions after analyzing the routine.
i. Describe how the parallel algorithm adopted by this routine works.
ii. Discuss if there are any shortcomings in this algorithm.
iii. Based on your discussion, modify the routine to make it work more efficiently.
iv. Justify your modifications, i.e., discuss why your modifications can enhance the performance of the original routine.
2022-06-07