CSE 560 – Practice Problem Set 5 Solution
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CSE 560 – Practice Problem Set 5 Solution
1. In this question, you will investigate how the compiler can increase the amount of ILP via the scheduling of instructions on a single-issue, in-order pipeline. Our code uses a simple loop that adds a scalar value to an array in memory. The source code (in C) looks like this:
for (i=1000; i > 0; i=i-1)
x[i] = x[i] + s;
We can see that this loop is parallel by noticing that the body of each iteration is independent. The first step is to translate the above code segment into assembly language. In the following code segment, r1 is initially the address of the element of the array with the highest address, and f2 contains the scalar value, s. Register r2 is pre-computed, so that 8(r2) is the last element to operate on. Straightforward assembly language code, not scheduled for the pipeline, looks like this:
(1) loop: load f0, 0(r1) ;f0
(2) addf f4
(3) store 0(r1), f4 ;store result
(4) addi r1
(5) bneq r1, r2, loop ;branch if r1 != r2
Assume floating point additions take 4 cycles, and the 5-stage pipeline has full bypassing paths available. Assume the branch predicts “not-taken” and miss-predicted branches flush the pipeline.
(a) Show the timing of this instruction sequence (i.e., draw a pipeline diagram) without any
code transformations. How many clock cycles are required per iteration? For the entire code snippet?
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
(1) |
F |
D |
X |
M |
W |
|
|
|
|
|
|
|
|
|
|
|
|
|
(2) |
|
F |
D |
d* |
X1 |
X2 |
X3 |
X4 |
M |
W |
|
|
|
|
|
|
|
|
(3) |
|
|
F |
p* |
D |
d* |
d* |
d* |
X |
M |
W |
|
|
|
|
|
|
|
(4) |
|
|
|
|
F |
p* |
p* |
p* |
D |
X |
M |
W |
|
|
|
|
|
|
(5) |
|
|
|
|
|
|
|
|
F |
D |
X |
M |
W |
|
|
|
|
|
(6) |
|
|
|
|
|
|
|
|
|
F |
D |
f* |
|
|
|
|
|
|
(7) |
|
|
|
|
|
|
|
|
|
|
F |
f* |
|
|
|
|
|
|
(1) |
|
|
|
|
|
|
|
|
|
|
|
F |
D |
X |
M |
W |
|
|
This code takes 11 cycles per iteration, or 11,000 cycles total (ignoring the pipeline fill time).
(b) Re-schedule the code (make sure it still performs the required computation) to diminish the
time required per iteration. Show the timing of this revised instruction sequence. How many clock cycles are required per iteration? For the entire code snippet?
(1) loop: load f0, 0(r1) ;f0
(2) addi r1
(3) addf f4
(4) store 8(r1), f4 ;store result (note address transformation)
(5) bneq r1, r2, loop ;branch if r1 != r2
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
(1) |
F |
D |
X |
M |
W |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2022-10-17