Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework 5 CSE120

Q1(10 pts)

In this exercise, assume that the breakdown of the dynamic instruction count of a program into various instruction categories is as follows:

 

Also, assume the following branch predictor accuracies:

 

Assumptions: If prediction is correct, no stall cycles are incurred. If prediction incorrect, 1 stall cycle is  incurred for all 3 branch predictor schemes mentioned above. Note: Here you do not need to know the control point for the branch decision, i.e. ID, EX or MEM, since the number of penalty cycles is already  provided.

Also, there are no data hazards in the program. CPI=1 for all instructions where no extra stall cycle is incurred.

A)   What is the overall CPI accounting for mispredicted branches with the always-taken predictor? (2.5 pts)

For 75% instructions CPI=1; for 25% branch instructions, 45% of cases we are not incurring any penalty. Remaining 55% cases of branch we are incurring the penalty of 1 clock cycle.

CPI= 0.75*1 + 0.25*0.45*1 + 0.25*0.55*(1 + 1) = 1.1375

B)   What is the overall CPI accounting for mispredicted branches with the always-not-taken predictor?

(2.5 pts)

For 75% instructions CPI=1; for 25% branch instructions, 55% of cases we are not incurring any penalty. Remaining 45% cases of branch we are incurring the penalty of 1 clock cycle.

CPI=0.75*1 + 0.25*0.55*1 + 0.25*0.45*(1 + 1) = 1.1125

C)   What is the overall CPI accounting for mispredicted branches with the 2-bit predictor? (2.5 pts)

For 75% instructions CPI=1; for 25% branch instructions, 85% of cases we are not incurring any penalty. Remaining 15% cases of branch we are incurring the penalty of 1 clock cycle.

0.75*1 + 0.25*0.85*1 + 0.25*0.15*(1 + 1) = 1.0375

D)   With the 2-bit predictor, what speedup would be achieved if we could convert half of the branch instructions in a way that replaced each branch instruction with two ALU instructions? Assume   that correctly and incorrectly predicted instructions have the same chance of being replaced.

(2.5 pts)

Changing half of the branch instructions to an ALU instruction reduces the percentage of               instructions that are branches from 25% to 12.5%. Because predicted and mispredicted branches

are replaced equally, the misprediction rate remains 15%.

New branch fraction= 12.5/(100+12.5)=0.1111

CPI= 0.8889*1 + 0.1111*0.85*1 + 0.1111*0.15*(1 + 1)= 1.01675

Speedup= Original_CPI/ New_CPI= 1.0375/1.01675=1.021

Q2 (10 pts)

This exercise examines the accuracy of various branch predictors for the following repeating pattern (e.g., in a loop) of branch outcomes: T, NT, T, T, NT. (T=branch taken, NT=branch not taken)

A)   What is the accuracy (“score card” ) of always-taken and always-not-taken predictors for this        sequence of branch outcomes provided we encounter this branch outcome sequence only once(2 pts)

In one pass of this branch sequence, branch is taken 3 out of 5 times. Therefore for always taken predictor, score card is 3/5. For “always not taken”, it will be then 2/5.

B)   What is the accuracy (“score card” in lec 15-16) of the 2-bit predictor for this sequence of branch outcomes provided we encounter this branch outcome sequence only once, assuming that the    predictor starts off in the “Strongly Predict Not Taken” state (“SNT”)? (3 pts)

 

Following the acronym convention from slide 12 from lec16_postZoom.pdf, we create the following table for the first pass of branch sequence.

Branch Prediction

Actual Branch behavior

Prediction Accuracy

SNT

T

WNT

NT

SNT

T

WNT

T

WT

NT

Thus, predictor score card is 1/5 for first pass

C)   What is the accuracy (“score card” in lec 15-16) of the 2-bit predictor if this pattern is repeated forever? i.e. T, NT, T, T, NT, T, NT, T, T, NT .ad infinitum. See end slide of lec16_postZoom.pdf for hints.

(5 pts)

In section B, we saw how the predictor state ended with “Weakly Predict Taken” (WT) which did not match with Actual branch behavior NT. Thus, our first Branch Prediction column entry for     second pass of branch sequence will be “Weakly Predict Not Taken” (WNT). (Refer to the state   diagram) Accordingly, we fill up the table for second pass.

Branch Prediction

Actual Branch behavior

Prediction Accuracy

WNT

T

WT

NT

WNT

T

WT

T

ST

NT

Thus, predictor score card is again 1/5 for second pass. The predictor state ended with Strongly Predict Taken” (ST) which did not match with Actual branch behavior NT. Thus, our first Branch   Prediction column entry for third pass of branch sequence will be Weakly Predict Taken” (WT). Accordingly, we fill up the table for third pass.

Branch Prediction

Actual Branch behavior

Prediction Accuracy

WT

T

ST

NT

WT

T

ST

T

ST

NT

Thus, predictor score card is 3/5 for third pass. The predictor state ended with Strongly Predict Taken” (ST) which did not match with Actual branch behavior NT. Thus, our first Branch                Prediction column entry for third pass of branch sequence will be Weakly Predict Taken” (WT). Now, observe that the prediction WT after end of third pass is identical to the first column entry under “Branch Prediction” for the third pass. Thus, for successive passes of this branch                 sequence, we will obtain the exact same prediction table as the third pass, giving us the same    score card: 3/5.

Thus, if run this branch sequence n times, our overall score card will be:

Total no. of correct predictions/ Total no. of predictions = [2 + (n-2) * 3]/(5*n) = (3*n -4)/(5*n)

If we take the limit of n to Infinity, the above score card’s value will become 3/5.

Q3 (10 pts)

For a direct-mapped cache design with a 64-bit address and byte-addressable memory, the following bits of the address are used to access the cache:

Tag      Index     Offset

a.  63-9       8-5        4-0

b.  63-12    11-6      5-0

For each configuration (a and b):

What is the block size (in bytes)?

(5 pts)

How many entries (blocks) does the cache have?

(5 pts)

a.    Offset=5 bits => Block size= 25  bytes = 32B       Index=4 bits => #Cache entries= 24  = 16 blocks

b.    Offset=6 bits => Block size= 26  bytes = 64B       Index=6 bits => #Cache entries= 26  = 64 blocks

Q4 (16 pts)

In lec 20-21, we went through an example of filling in a set associative cache. However, the addresses were block addresses instead of byte addresses (we are used to dealing with byte addresses in RISCV memory reference). Thus, there was no need for ofset bits in the class example. The following exercise gives a more practical example of memory reference in a set-associative cache. However, as in the class example,  only  the  tag  and set field values are suficient to verify the memory reference exists in the cache.

Suppose  you  have  a  2-way  set  associative  cache that  is  8KiB,  with  16-byte  block  size. Assume  we implement an additional MRU bit to indicate which of the 2 blocks /ways in a set was most recently referenced. You  send  a  sequence  of memory references to the cache for read; each is a 64-bit byte addressable memory address in hexadecimal: 0x1000, 0x1004, 0x1010,0x11c0, 0x2000, 0x21c0, 0x2006

1.    For each of the byte addressable memory references, fill in the corresponding set index number for lookup, the tag number, the byte offset and whether the reference is a hit or a miss in the table below. You may enter the values in hexadecimal.

Hint: Because of the particular width of the ofset and set index fields in this exercise, you do not  necessarily have to convert the addresses to binary to extract the relevant byte ofset or set index number. Less work for you!

(14 pts)

Offset: 4 bits (for 16B block size)

Index: 8 bits (for 256 lines)

Tag: 52 bits. Thus the offset will simply be the least significant hex-digit. Index will be the next two. Tag is whatever hex-digit that remains unaccounted for.

Memory

address

Set index

Tag value

Offset value

Hit/Miss? (M/H)

0x 1 00 0

00

1

0

M

0x 1 00 4

00

1

4

H

0x 1 01 0

01

1

0

M

0x 1 1c 0

1c

1

0

M

0x 2 00 0

00

2

0

M

0x 2 1c 0

1c

2

0

M

0x 2 00 6

00

2

6

H

How many times did we have to implement the LRU policy in accessing a memory reference from cache?

(2 pts)

Hint: Think very carefully before you opt to solve this by laboriously drawing the cache diagram sequence containing the MRU bit

For each particular set access, note that there were no more than 2 unique tags. Thus, there was always available slots within a particular set each time it was referenced for a block so we             needed to implement the LRU policy 0 number of times.