关键词 > CSCI-UA.0480

CSCI-UA.0480-051: Parallel Computing Final Exam

发布时间：2024-06-14

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSCI-UA.0480-051: Parallel Computing

Final Exam (Dec 21st, 2021)

Total: 100 points

Problem 1

a. [10 points] There is a problem with the following code. Please explain the problem in 1-2 lines. Can we fix this problem? If yes, show how (either by writing few lines of code or by explaining what you will do). If no, explain why not.

#pragma omp parallel for for

(int i = 1; i < N; i++)

{

A[i] = B[i] – A[i – 1]; //A and B are arrays of int

}

b. [5] Hyperthreading core can execute several instructions at the same time. Superscalar core can also execute several instructions at the same time. What is the main difference then between the two? The answer must not exceed 1-2 lines.

c. [5] Does a hyperthreading core need to be also superscalar? Explain.

d. [5] Does a superscalar core need to have branch prediction? Explain.

Problem 2

Assume the following three processes in MPI and the operations they do at each time step of the program. Also assume that the reduction operation is MPI_MIN. The ranks of the three processes (0, 1, and 2) are their ranks in MPI_COMM_WORLD, and all the APIs called in the table below use MPI_COMM_WORLD.

Time	Process 0	Process 1	Process 2
0	x = 6; y = 7;	x = 8; y = 9;	x = 5; y = 6;
1	MPI_Allreduce(&x, &y, …)	MPI_Allreduce(&y, &x, …)	MPI_Allreduce(&x, &y, …)
2	non-MPI code not affecting x andy	MPI_Send(&y, 1, MPI_INT, 2, 0, …)	MPI_Recv(&y, 1, MPI_INT, 1, 0, …)
3	MPI_Allreduce(&y, &x, …)	MPI_Allreduce(&x, &y, …)	MPI_Allreduce(&x, &y, …)

Fill in the following table with values of x and y in each process:

[6] After Step 1:

	Process 0	Process 1	Process 2
x
y

[6] After Step 2:

	Process 0	Process 1	Process 2
x
y

[6] After Step 3:

	Process 0	Process 1	Process 2
x
y

a. [5] Suppose we have three communicators, including MPI_COMM_WORLD. How many ranks does each of the above three processes have? Explain in no more than two lines.

b. [7] If we have a quad-core processor, can these three processes use all the cores at the same time? Justify in 1-2 lines. Assume each one of the four cores has superscalar capability but no hyperthreading capability. Of course, the code shown above for each process is just a small part of the full code of the process (i.e. the full code of each process is bigger than what is shown in this problem).

Problem 3

Suppose we have the following code snippet. Assume all variables and arrays have been declared and initialized earlier.

1. #pragma omp parallel

2. {

4. for (i = 1 ; i < M; i += 2 )

5. {

6. D[i] = x * A[i] + x * B[i];

7. }

9. #pragma omp for

10. for (i = 0; i < N; i++)

11. {

12. C[i] = k * D[i];

13. }

14. } // end omp parallel

a. [10] Assume M = N. What do we have to write in line 3 above to get as much performance as possible from the above code? You must not have race condition and at the same time get the best performance. No need to write any explanations, just the line of code.

b. [10] Repeat problem a, but assume M = 2N.

c. [5] Based on your answer in b, do we need to write anything in line 8 to make the code even faster? If no, then just say no. If yes, then write the line you will include in line 8.

d. [5] If M = 16, N = 8, and the multicore on which we run the code has 16 cores. How many threads do you think OpenMP runtime will create when executing line 1? Justify in no more than two lines. Assume no other processes/threads are using the multicore except the above code.

Problem 4

[15] Choose one answer only from the multiple choices for each problem. If you pick more than one answer, you will lose the grade of the corresponding problem.

1. Suppose a block, when launching a GPU kernel, is composed of 33 threads. How many warps will be created for this block?

a. 1 b. 2 c. 4 d.5 e. Depends on the total number of blocks.

2. Which of the following is NOT an advantage of warps?

a. reduces complexity of the GPU hardware

b. reduces penalty of thread divergence

c. decouples the dependence of the number of threads in a block on the number of SPs in the SM.

3. If we have a local variable in the GPU kernel, it will surely go to registers.

a. True b. False c. Depends on the compute capability of the GPU.

4. More than one block from two different kernels can be assigned to the same SM at the same time.

a. True b. False c. Depends on the compute capability of the GPU.

5. If a critical section in OpenMP looks like { x += b;} what is the best mutual exclusion technique? Assume no other critical sections exist in the program.

a. lock b. critical section c. atomic

6. If a critical section in OpenMP looks like { x += c; y+=b; } what is the best mutual exclusion technique? Assume no other critical sections exist in the program.

a. lock b. critical section c. atomic

7. Sometimes, in OpenMP, we cannot know the number of critical sections in a program by just looking at the source code.

a. True b. False

8. Suppose we launch a kernel as follows: kernelname <<<4,4>>>(argument list …);

And suppose the kernel has a global variable x. How many instances of x will be created?

a. 16 b. 4 c. 8 d. 1 e. 2

9. An algorithm of O(n) can be faster than an algorithm of O(n2).

a. True b. False

10. An algorithm of O(n2) can be faster than an algorithm of O(n).

a. True b. False

11. Even though MPI is a distributed memory programming model, it can still run on a multicore processor, which is a shared memory hardware.

a. True b. False

12. Even though OpenMP is a shared memory programming model, it can still run on a distributed memory hardware system.

a. True b. False

13. Even though locks are faster than critical sections in OpenMP, sometimes we must use critical sections.

a. True b. False

14. Non-determinism exists in MPI, OpenMP, and CUDA.

a. True b. False

15. The less the synchronization points in your parallel program, the less sensitive your program will be to load-imbalance.

a. True b. False