Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework 4 CSE120

solutions

Q1(10 pts)

You are tasked with improving the performance of a functional unit. The computation for the functional unit has 4 steps (A-D), and each step is indivisible. Assume there is no dependency between successive  computations.

 

A)   What is the greatest possible clock rate speedup possible with pipelining? You do not need to worry about the register timing constraints (e.g. delay, setup, hold). Explain your reasoning.

Without pipelining, the functional unit will produce a result every 20 ns (5+8+4+3) because it goes through all of the steps, and only one operation is in flight at a time.

Since the steps are indivisible, Step B (the longest step) will become the critical path with pipelining. Since the clock period could be no less than 8 ns (Step B’s latency), the greatest possible clock rate    speedup from pipelining would be (original clock period) / (new clock period) = 20ns /8ns = 2.5x.

B)   For maximizing the clock rate, what is the minimum number of pipeline registers you would use? Where would you insert the registers (draw or describe) into the datapath provided for this          functional unit? Why not use fewer or more pipeline stages?

Since the steps are indivisible, it is natural to consider placing registers between the steps. From part A, we know the greatest clock rate comes from a 8ns clock.

If the clock period is 8ns, and can’t be made any smaller, splitting up Steps C & D will not provide any benefit, since the two of them combined (7 ns) would not violate the clock period. Adding a register between C & D would increase the latency of a result (from 3 cycles to 4) for no gain in throughput.

 

Q2 (12 pts)

For the following questions, it will be helpful for you to draw the Pipeline progression such as F D X M W (which stand for the different pipeline stages) for each instruction and trace in which clock cycle the        computation for an instruction is actually taking place. That way, you will know the correct value of the   operands in the computation.

A)   (4 pts) Assume that x11 is initialized to 11 and x12 is initialized to 22. Suppose you executed the code below on a version of the pipeline that does not handle data hazards (i.e., the programmer is responsible for addressing data hazards by inserting NOP instructions where necessary). Also, the register file does not have the feature of reading and writing in the same clock cycle. That is, in the conflict of reading and writing to the same data location, the data read would be the     old value. E.g., if the old value of x1=10 and we are both writing the new value 20 to x1 and also

reading x1 in the same clock cycle, the output value from read would be 10, NOT 20. In this context, what would the final values of registers x13 and x14 be?

addi x11, x12, 5

add x13, x11, x12

addi x14, x11, 15

Based on clock cycle pipeline diagram below, final values are: x13=33, x14=26

 

B)   (4 pts) Assume that x11 is initialized to 11 and x12 is initialized to22. Suppose you executed the code below on a version of the pipeline that does not handle data hazards (i.e., the                      programmer is responsible for addressing data hazards by inserting NOP instructions where      necessary). What would the final values of register x15 be? Assume the register file is written  at the beginning of the cycle and read at the end of a cycle. Therefore, an ID stage will return the results of a WB state occurring during the same cycle.

addi x11, x12, 5

add x13, x11, 12

addi x14, x11, 15

add x15, x11,x11

Based on clock cycle pipeline diagram below, final values are: x15=54

 

C)   (4 pts) Add the minimum number of NOP instructions to the code below so that it will run         correctly on a pipeline that does not handle data hazards. You may assume that the register file operates as described in 2.B above.

Make sure it is the minimum number of NOPs (might be 0 in between some instructions)! For example,  without  even thinking twice and not even looking at the code content below, I can state  with  confidence  that  adding  10  NOPs  in  between  each  instruction  will  solve  all  data hazards! The problem is then you are heavily penalizing your energy expenditure and throughput in catering to these excess NOPs in your code.

addi x11, x12, 5

add x13, x11, x12

addi x14, x11, 15

add x15, x13, x12

 

Q3 (10 pts)

We will be working with the code snippet below for this problem as it passes through a 5 stage (F D X M W) processor. It resolves branches in the decode stage.

I1 ld x3, 8(x1) I2

add x4, x3, x2 I3

add x5, x5, x4 I4

ld x1, 0(x1)

I5 bne x5, x1, exit

A)   (5 pts) For the code snippet, list all possible types of data hazards (RAW, WAR, WAW). E.g. “WAR on x2 between I1-I3”.

RAW on x3 from I1 -> I2

RAW on x4 from I2 -> I3

RAW on x5 from I3 -> I5

RAW on x1 from I4 -> I5

WAR on x1 from I1 -> I4

B)   (5 pts) Assume the processor has full forwarding paths (including half-cycle write back) as well    as stall capability in the event of general data hazard detection not resolved by forwarding. Fill in the diagram to show how the code progresses through the pipeline:

 

The entire breakdown of the pipeline is shown below in the multiple clock cycle diagram. Once this is     done, the stages are mapped to the instruction in a particular clock cycle following the pipeline diagram onto the table provided with this question.

 

Q4 (10 pts)

A)    (7 pts) Assume the processor has full forwarding paths (including half-cycle write back), has stall capability based on non-forwardable data hazard detection, and resolves branches in the decode stage.

Also assume that if the decode stage does not have the data available yet to determine branch   taken/not taken, it will stall the branch instruction at that stage until the data becomes available from previous instruction. Assume the branch is taken. The table correctly shows next                   instruction (addi) being taken after the beq instruction. Fill in the table to show how the pipeline progresses:

 

The entire breakdown of the pipeline is shown below in the multiple clock cycle diagram. Once this is     done, the stages are mapped to the instruction in a particular clock cycle following the pipeline diagram onto the table provided with this question.

 

B)   (3 pts) You are considering removing the forwarding paths from your processor because they   are too expensive. How will removing the forwarding paths qualitatively effect the 3 terms in   the CPU Performance Iron Law equation? For each term explain term below, explain why it will increase, stay the same, or decrease. If it depends on other factors, explain those too. You can assume your processor is pipelined with full forwarding (initially) and it is executing a generic   workload.

(i)           Instruction count

The instruction count will not be affected by the existence of absence of forwarding  paths. The forwarding paths will affect the number of stalls for data hazards, but that will not require changes to the programs.

CPI

The removal of forwarding makes it likely that CPI will increase due to data hazards. Those data hazards will require stalls or NOPs to be handled. A smart compiler could reschedule, but it will not remove all possible stalls.

Clock Period

If the forwarding paths are on the critical path (the pipeline stage with the maximum latency constraining the clock period), removing them will decrease the clock period. If the                 forwarding paths are not on the critical path, removing them will not change the clock           period.