闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP528

First Semester Examinations 2019/20

Multi-Core and Multiprocessor Programming

QUESTION 1

1a) Explain Amdahl’s Law, either in words or via a set of equations (taking care to define

each variable used). Apply this to determine the expected speed-up using 100 cores for a

code that takes 180 seconds on 1 core and 100 seconds on 2 cores.

1b) Rewrite the “DAXPY” code fragment below (where there are N elements of x, y and res) as a 1D CUDA kernel, ensuring that it is portable for invocation on any number of threads larger than N. Explain where the parallelism occurs.

void daxpy(int N, double *res, double A, double *x, double *y)

{

res[i] = A * x[i] + y[i];

}

1c) According to the MPI standard, what is required by a receiving process in order to successfully receive a message from a given sending process? If every MPI process of a parallel program implementation first attempts to send a message before each attempts to receive a message, what is likely to occur and how could you prevent it?

QUESTION 2

2a) Given the requirements for ever increasing amounts of computation (in a fixed time), what approaches (in terms of hardware) are available given the limitations of a single core processor (e.g. no further increase in clock frequency)?

2b) When running a communications-intensive MPI code on a number of compute nodes, why is the inter-node interconnect important and explain the two key terms used to compare performance of HPC interconnects.

2c) With the aid of a diagram, explain the OpenMP “fork-join” model. Using this, explain why coarse grained parallelism is generally preferential to fine grained parallelism.

2d) Explain the acronym “SPMD”. Explain how an MPI implementation of SPMD can result in a faster time to solution than the corresponding serial implementation, and what decisions need to be made about the data being processed.

QUESTION 3

3a) For a CPU-based cluster, the programmer has a choice of approaches, including programming for shared memory or for distributed memory. Explain these terms, highlighting their physical differences, and then discuss & compare programming options available for each (or both).

3b) What are the names of the two patterns of MPI communications covered during the course? What are the key differences between these two patterns? If you wished to ensure all MPI processes (in the named communicator) were synchronised, what explicit MPI call would you make and on which MPI processes? (You need only name the MPI function, you do not need to give the parameters of the function call.)

3c) Name one of the directives based approaches to programming a GPU. How does the directives approach compare to the use of CUDA, and where might the use of such directives be more portable than the use of CUDA?

QUESTION 4

4a) Considering data dependencies within the following 3 examples, say for each whether the code as it is currently written would give correct results when run in parallel (naming any particular conditions if necessary). Variables A,B,C are vectors (of size N) with B,C initialised appropriately, final is a scalar, i,k,N are constants, and no aliasing exists.

i. #pragma omp parallel for default(none) shared(A,B,C)

for (i=0; i

A[i] = B[i] + C[i];

}

ii. A[0] = rand(); // set to a random value

#pragma omp parallel for default(none) shared(A,k)

for (i=1; i

A[i] = A[i-1] * k ;

}

iii. final = 0.0;

#pragma omp parallel for shared(B,C) private(final)

for (i=0; i

final = final + B[i] / C[i] ;

}

printf(“The sums of B & C elements is %f\n”, final);

4b) Explain for what values of stride that the following is parallelisable (presuming all elements of A have been set previously).

#pragma omp parallel default(none) shared(A,k,stride)

{

#pragma omp for schedule(static,1)

for (i=0; i

A[i+2] = A[i+1] + k* A[i];

}

In terms of performance, what would you change and why?

4c) In each OpenMP parallel region, it is necessary to determine the “scope” of each variable and define (e.g.) as PRIVATE or SHARED. By using an example and, optionally, a diagram, explain the difference for variable x between if it is defined as PRIVATE(x) or as SHARED(x), and discuss which you think has less overhead at the entry to a parallel region.

4d) For programming a GPU using CUDA, the NVIDIA terminology for the various levels of hardware of a “GPU” includes “CUDA cores” and “streaming multiprocessors”. Explain each of the software concepts of threads, blocks, grids and how they relate to the hardware concepts. What is the importance of “warps” in terms of efficient CUDA programming?

QUESTION 5

5a) Give two motivating examples of why companies and governments invest heavily in high performance computing (HPC). How can the “Top500” supercomputer listing be used for an institution to evaluate its computing capacity compared to its rival? Give 2 limitations of the “Top500” list.

5b) Give an example of where locks, explicitly or an equivalent locking mechanism, are required within parallelised code. OpenMP supports explicitly locking via low-level calls but also provides the “critical” and “atomic” constructs. Explain clearly the difference between these and discuss whether either still allows some concurrency.

5c) Having written a functionally correct implementation to solve a given problem, it is common to then follow a few steps to improve the serial performance of the code, particularly to maximise the capabilities of today’s High Performance Computing hardware. What compilation time options might you change to increase performance (explaining the meaning of any compiler flags you may state), and which tools/methods could you use to explore where time is being spent within your code?