Homework 4 CSE120 solutions
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Homework 4 CSE120
solutions
Q1(10 pts)
You are tasked with improving the performance of a functional unit. The computation for the functional unit has 4 steps (A-D), and each step is indivisible. Assume there is no dependency between successive computations.
A) What is the greatest possible clock rate speedup possible with pipelining? You do not need to worry about the register timing constraints (e.g. delay, setup, hold). Explain your reasoning.
Without pipelining, the functional unit will produce a result every 20 ns (5+8+4+3) because it goes through all of the steps, and only one operation is in flight at a time.
Since the steps are indivisible, Step B (the longest step) will become the critical path with pipelining. Since the clock period could be no less than 8 ns (Step B’s latency), the greatest possible clock rate speedup from pipelining would be (original clock period) / (new clock period) = 20ns /8ns = 2.5x.
B) For maximizing the clock rate, what is the minimum number of pipeline registers you would use? Where would you insert the registers (draw or describe) into the datapath provided for this functional unit? Why not use fewer or more pipeline stages?
Since the steps are indivisible, it is natural to consider placing registers between the steps. From part A, we know the greatest clock rate comes from a 8ns clock.
If the clock period is 8ns, and can’t be made any smaller, splitting up Steps C & D will not provide any benefit, since the two of them combined (7 ns) would not violate the clock period. Adding a register between C & D would increase the latency of a result (from 3 cycles to 4) for no gain in throughput.
Q2 (12 pts)
For the following questions, it will be helpful for you to draw the Pipeline progression such as F D X M W (which stand for the different pipeline stages) for each instruction and trace in which clock cycle the computation for an instruction is actually taking place. That way, you will know the correct value of the operands in the computation.
A) (4 pts) Assume that x11 is initialized to 11 and x12 is initialized to 22. Suppose you executed the code below on a version of the pipeline that does not handle data hazards (i.e., the programmer is responsible for addressing data hazards by inserting NOP instructions where necessary). Also, the register file does not have the feature of reading and writing in the same clock cycle. That is, in the conflict of reading and writing to the same data location, the data read would be the old value. E.g., if the old value of x1=10 and we are both writing the new value 20 to x1 and also
reading x1 in the same clock cycle, the output value from read would be 10, NOT 20. In this context, what would the final values of registers x13 and x14 be?
addi x11, x12, 5
add x13, x11, x12
addi x14, x11, 15
Based on clock cycle pipeline diagram below, final values are: x13=33, x14=26
B) (4 pts) Assume that x11 is initialized to 11 and x12 is initialized to22. Suppose you executed the code below on a version of the pipeline that does not handle data hazards (i.e., the programmer is responsible for addressing data hazards by inserting NOP instructions where necessary). What would the final values of register x15 be? Assume the register file is written at the beginning of the cycle and read at the end of a cycle. Therefore, an ID stage will return the results of a WB state occurring during the same cycle.
addi x11, x12, 5
add x13, x11, 12
addi x14, x11, 15
add x15, x11,x11
Based on clock cycle pipeline diagram below, final values are: x15=54
C) (4 pts) Add the minimum number of NOP instructions to the code below so that it will run correctly on a pipeline that does not handle data hazards. You may assume that the register file operates as described in 2.B above.
Make sure it is the minimum number of NOPs (might be 0 in between some instructions)! For example, without even thinking twice and not even looking at the code content below, I can state with confidence that adding 10 NOPs in between each instruction will solve all data hazards! The problem is then you are heavily penalizing your energy expenditure and throughput in catering to these excess NOPs in your code.
addi x11, x12, 5
add x13, x11, x12
addi x14, x11, 15
add x15, x13, x12
Q3 (10 pts)
We will be working with the code snippet below for this problem as it passes through a 5 stage (F D X M W) processor. It resolves branches in the decode stage.
I1 ld x3, 8(x1) I2
add x4, x3, x2 I3
add x5, x5, x4 I4
ld x1, 0(x1)
I5 bne x5, x1, exit
A) (5 pts) For the code snippet, list all possible types of data hazards (RAW, WAR, WAW). E.g. “WAR on x2 between I1-I3”.
RAW on x3 from I1 -> I2
RAW on x4 from I2 -> I3
RAW on x5 from I3 -> I5
RAW on x1 from I4 -> I5
WAR on x1 from I1 -> I4
B) (5 pts) Assume the processor has full forwarding paths (including half-cycle write back) as well as stall capability in the event of general data hazard detection not resolved by forwarding. Fill in the diagram to show how the code progresses through the pipeline:
The entire breakdown of the pipeline is shown below in the multiple clock cycle diagram. Once this is done, the stages are mapped to the instruction in a particular clock cycle following the pipeline diagram onto the table provided with this question.
Q4 (10 pts)
A) (7 pts) Assume the processor has full forwarding paths (including half-cycle write back), has stall capability based on non-forwardable data hazard detection, and resolves branches in the decode stage.
Also assume that if the decode stage does not have the data available yet to determine branch taken/not taken, it will stall the branch instruction at that stage until the data becomes available from previous instruction. Assume the branch is taken. The table correctly shows next instruction (addi) being taken after the beq instruction. Fill in the table to show how the pipeline progresses:
The entire breakdown of the pipeline is shown below in the multiple clock cycle diagram. Once this is done, the stages are mapped to the instruction in a particular clock cycle following the pipeline diagram onto the table provided with this question.
B) (3 pts) You are considering removing the forwarding paths from your processor because they are too expensive. How will removing the forwarding paths qualitatively effect the 3 terms in the CPU Performance Iron Law equation? For each term explain term below, explain why it will increase, stay the same, or decrease. If it depends on other factors, explain those too. You can assume your processor is pipelined with full forwarding (initially) and it is executing a generic workload.
(i) Instruction count
The instruction count will not be affected by the existence of absence of forwarding paths. The forwarding paths will affect the number of stalls for data hazards, but that will not require changes to the programs.
CPI
The removal of forwarding makes it likely that CPI will increase due to data hazards. Those data hazards will require stalls or NOPs to be handled. A smart compiler could reschedule, but it will not remove all possible stalls.
Clock Period
If the forwarding paths are on the critical path (the pipeline stage with the maximum latency constraining the clock period), removing them will decrease the clock period. If the forwarding paths are not on the critical path, removing them will not change the clock period.
2022-05-30