Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS450:  Introduction to Parallel Programming

Quizz 6

2022

Submission Instructions

Please submit the answers to the questions on BBLearn as a pdf  (no .doc, .docx, .odt, .txt, etc.).  You should turn in a pdf called CS450 quizzX NAME.pdf, where “NAME” is your NAU username and “X”is the assignment number (e.g., CS450 quizz6 ab1234.pdf). Assignments need to be turned in via BBLearn.

 

Question 1: Processes

You write some computation-intensive code, such as an array reduction, which is for the moment only sequential.  But because it takes some time to execute, you decide that it is time to parallelize it.  After carefully considering your options, your choice goes towards processes. You will create n processes, and will thus divide your array into n chunks.  You are lucky, n evenly divides your array!  Each process will then compute the partial reduction on their respective array, and the main process of your application (i.e., the one that created the other processes), will then add all the n partial reductions from the n processes.

After parallelizing and running your code, you do not obtain the expected result. And even after countless hours debugging your code, you realize that it is free of race conditions, deadlocks, etc.  The code should work. But it does not, the result is wrong. As if you were not accumulating all the elements of the array. . . Question:   Why do not you get the right result?

 

Question 2: Pthreads Performance

You parallelize a piece of code that you had using Pthreads, and you are very proud of what you wrote, which looks like something like this:

1                 //  Global  variables

2                 float*  result;

3                 float*  arrayA;

4                 float*  arrayB;

5

6                pthread_mutex_t mutex;

7

8                 int main() 9                 {

10                              /*  Declares  and  initializes  the  global  arrays ,  threads , mutexes ,  etc .  */

11                              /*  Basically  everything  you  need  for  your  code  to  work  */

12                              /*  Threads  run  the  doWork()  function  below  */

13                 }

14

15                void  doWork(void*  args) 16                 {

17                              int  tid  =  (int)  args;

18                              int  localSize  =  ARRAY_SIZE  /  NB_THREADS;

19

20                              mutex .lock();

21                              for  (i  =  0;  i  <  localSize;  ++i)


22                              {

23                                           result[tid  *  localSize  +  i]  =  arrayA[tid  *  localSize  +  i]  *  arrayB[tid  * localSize  +  i];

24                              }

25                              mutex .unlock();

26                 }

The code that you wrote basically splits a vector multiplication evenly between all your threads, and you are sure that it is free from any bug/race condition/deadlock/etc.  As a good programmer that cares about the performance of their program, you decide to add some timers to measure how long it takes to execute it.  You are pretty confident that you will reach a high, if not perfect, speedup.  However, it does not seem to be the case. Actually, it is quite far from that. . . The performance are barely better than when it was sequential!

Question:   Why do not you achieve good performance, and what could you change?

 

Question 3: Is it actually good?

You change your code above, and you finally achieve pretty good performance. Or at least, it looks like so. You are not entirely sure. You decide that in addition to the execution time, you should also look at some other metrics.

Your code was taking 33 seconds to execute in sequential. You parallelized it to use 6 threads (which is less than the number of cores your machine has), and it now runs in 9 seconds.

Question:   What is the speedup, and the parallel efficiency, of your code?

Question:   Your machine actually has 8 cores. What execution time could you expect, if you were using 8 threads, instead of 6?