Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSE 560 – Practice Problem Set 5 Solution

1.    In this question, you will investigate how the compiler can increase the amount of ILP via the    scheduling of instructions on a single-issue, in-order pipeline.  Our code uses a simple loop that adds a scalar value to an array in memory.  The source code (in C) looks like this:

for (i=1000; i > 0; i=i-1)

x[i] = x[i] + s;

We can see that this loop is parallel by noticing that the body of each iteration is independent.   The first step is to translate the above code segment into assembly language.  In the following   code segment, r1 is initially the address of the element of the array with the highest address,     and f2 contains the scalar value, s.  Register r2 is pre-computed, so that 8(r2) is the last element to operate on.  Straightforward assembly language code, not scheduled for the pipeline, looks   like this:

(1)         loop:     load      f0, 0(r1)                             ;f0

(2)                        addf      f4

(3)                        store     0(r1), f4                             ;store result

(4)                        addi       r1

(5)                        bneq     r1, r2, loop                       ;branch if r1 != r2

Assume floating point additions take 4 cycles, and the 5-stage pipeline has full bypassing paths available.  Assume the branch predicts “not-taken” and miss-predicted branches flush the        pipeline.

(a)  Show the timing of this instruction sequence (i.e., draw a pipeline diagram) without any

code transformations.  How many clock cycles are required per iteration?  For the entire code snippet?

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

(1)

F

D

X

M

W

(2)

F

D

d*

X1

X2

X3

X4

M

W

(3)

F

p*

D

d*

d*

d*

X

M

W

(4)

F

p*

p*

p*

D

X

M

W

(5)

F

D

X

M

W

(6)

F

D

f*

(7)

F

f*

(1)

F

D

X

M

W

This code takes 11 cycles per iteration, or 11,000 cycles total (ignoring the pipeline fill time).

(b)  Re-schedule the code (make sure it still performs the required computation) to diminish the

time required per iteration.  Show the timing of this revised instruction sequence.  How many clock cycles are required per iteration? For the entire code snippet?

(1)         loop:     load      f0, 0(r1)                             ;f0

(2)                        addi       r1

(3)                        addf      f4

(4)                        store     8(r1), f4                             ;store result (note address transformation)

(5)                        bneq     r1, r2, loop                       ;branch if r1 != r2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

(1)

F

D

X

M

W