闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP5426 Parallel and Distributed Computing

Final Exam

Semester 1 2021

Question 1 (50 Marks)

(This section contains 15 short-answer questions.)

1) Give three methods for increasing the speed of a computer system. [3]

2) Classify MIMD computers based on memory organization and communication methods. [3]

3) Describe at least FOUR characteristics of shared-memory multiprocessors that distinguish them from distributed-memory multicomputers on the basis of hardware architecture, data sharing and communication/synchronization. [4]

4) Give a definition of cloud computing and briefly describe its main characteristics. [4]

5) What are the major performance metrics for characterizing an interconnection network, and what are the main purposes of using these metrics? [4]

6) What is the execution time, in general, relate to the standard all-to-all reduction operation without broadcasting on a two-dimensional torus with a total number of compute nodes p. [4]

7) What is a non-blocking network? Give two examples. [3]

8) Parallel paths are a number of distinct paths collecting nodes A and B and these paths have no common nodes other than A and B. What is the maximum number of possible parallel paths between any two nodes in a d-dimensional hypercube? [4]

9) Will it take longer time to perform a one-to-all broadcast operation on a one- dimensional array of size than that on a two-dimensional mash of size if per-hop time is ignored? You must give reasons to justify your answer. [4]

10) Prove that if total overhead for a given problem size, then the parallel execution time will continue to decrease as is increased and asymptotically approach a constant value. [4]

11) What is race condition? Explain race condition with two examples. [3]

12) In the context of a GPU program using CUDA, what are (i) threads; (ii) blocks; and (iii) grids? [3]

14) Can functions annotated with global__be executed on the host? [2]

15) What does coalesced / un-coalesced memory access mean? [3]

Question 2 (10 Marks)

Suppose that MPI_COMM_WORLD consists of three processes with rank 0, 1, and 2, respectively. Suppose the following code is executed:

int x, y, z;

switch(my_rank) {

case 0: x=1; y=4; z=2;

MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD);

MPI_Send(&y, 1, MPI_INT, 2, 2, MPI_COMM_WORLD);

MPI_Recv(&z, 1, MPI_INT, 1, 1, MPI_COMM_WORLD, &status); MPI_Allreduce (&y, &x, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD); break;

case 1: x=3; y=8; z=6;

MPI_Bcast(&y, 1, MPI_INT, 0, MPI_COMM_WORLD);

MPI_Send(&z, 1, MPI_INT, 2, 1, MPI_COMM_WORLD);

MPI_Send(&x, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);

MPI_Allreduce (&z, &y, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD); break;

case 2: x=6; y=7; z=8;

MPI_Bcast(&x, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Recv(&y, 1, MPI_INT, 0, 2, MPI_COMM_WORLD, &status);

MPI_Recv(&z, 1, MPI_INT, 1, 1, MPI_COMM_WORLD, &status); MPI_Allreduce (&x, &y, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD); break;

}

What are the values of x, y, and z on each process after the code has been executed?

QUESTION 3 (10 Marks)

A counting semaphore is a synchronization mechanism widely used in practice. It has an integer variable used for signaling among threads. Only three atomic operations may be performed on a semaphore:

i. sem_Init: The integer variable is initially assigned a zero or positive value.

ii. sem_ Wait: The integer value is decremented by 1; if the value is smaller than zero, the calling thread will be blocked.

iii. sem_Signal: The integer value is incremented by 1; if the value is smaller than or equal to zero, unblock one of the waiting thread.

You are asked to implement a simple counting semaphore using pthread mutex and condition variable.

A counting semaphore type may be defined as

typedef struct {

int count; /* the integer variable */

pthread_mutex_t s_m; /* mutex */

pthread_cond_t s_cv; /* condition variable */

} mylib_sem_t;

You must implement the following THREE functions:

mylib_sem_init(mylib_sem_t *s, int value)/*sem_Init as discussed above*/ mylib_sem_wait(mylib_sem_t *s) /*sem_Wait operation as discussed above*/ mylib_sem_signal(mylib_sem_t *s) /* sem_Signal as discussed above */

Question 4 (10 Marks)

Consider a CUDA kernel routine reduceOp() below. This routine performs a parallel reduction operation, i.e., parallel summation of an array of integers with n elements. The routine does not make use of shared memory in each SM (or Stream Multiprocessor) and the reduction is done in-place, which means that the values in global memory are replaced by partial sums at each step.

__global__ void reduceOp(int *g_idata, int *g_odata, unsigned int n) {

// set thread ID

unsigned int tid = threadIdx.x;

unsigned int idx = blockIdx.x * blockDim.x + threadIdx .x;

// convert global data pointer to the local pointer of this block int *idata = g_idata + blockIdx.x*blockDim.x;

// boundary check

if(idx >= n) return;

// in-place reduction in global memory

for (int stride = 1; stride < blockDim.x; stride *= 2) {

int index = 2 * stride * tid;

if (index < blockDim.x) {

idata[index] += idata[index + stride];

}

// synchronize within threadblock

__syncthreads();

}

// write result for this block to global mem

if (tid == 0) g_odata[blockIdx.x] = idata[0];

}

You need to answer the following questions after analyzing the routine.

1) Describe how the parallel algorithm adopted by this routine works.

2) Discuss if there are any shortcomings in this algorithm (in addition to the use of shared memory).

3) Based on your discussion, modify the routine to make it work more efficiently still for in-place reduction in global memory without making use of shared memory in each SM.

4) Justify your modifications, i.e., discuss why your modifications can enhance the performance of the original routine.

Question 5 (20 Marks)

Relational operator “ .>” performs element-wise comparisons between two arrays of equal length and the results is a logical array of the same length indicating the locations where

the relation “>” (i.e., greater than) is true. Given two integer arrays and

, for example, . The total number of locations

where the relation is true is in this example. (Note in general

1) Given a set of arrays, each being of length , you are asked to design a parallel algorithm to find the array pairs which have the largest value for in a distributed memory system of processing elements organized as a one-dimensional ring for . (Note there may be multiple such array pairs.)

In your design you must seriously consider how to balance the workloads across the processing elements and minimize the communication cost.

You need to discuss how both data and computation are partitioned, how many computation-and-communication steps are needed, and in each step what computation is performed and how data are sent/received by each processing element. You may draw figures and/or write pseudocode to simplify the description of your parallel algorithm.

2) Calculate the speedup, efficiency and isoefficiency function for your designed parallel algorithm. Show your work.