Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS 61C

Spring 2023

Final

Q1   RISC-yArrayArrchitecture                                                                                     (10 points)

Writing code to access integer arrays can be really annoying in RISC-V! Suppose we come up with new instructions, readArr to read from integer arrays and writeArr to write to integer arrays. For this   question, you may assume integers are 32 bits.

readArr  rd,  rs1,  rs2 will read the array that rs1 points to at the index stored in rs2, and put that value in register rd. In C pseudocode: rd  =  ((int  *)  rs1)[rs2].

Q1.1  (3.5 points)  What changes would we need to make to our datapath in order for us to implement the readArr instruction with as few changes as possible? Select all that apply.

  Add a new immediate type for ImmGen

  Add a new output to Regfile for a third register value

  Add a new input to the AMux and update any relevant selector/control logic 

      Add a new input to the BMux and update any relevant selector/control logic

  Add a new ALU operation and update any relevant selector/control logic

  Add a new DMEM mux which feeds into the data input of the DMEM, and any relevant selector/control logic

  Add a new input to WBMux and update any relevant selector/control logic

  None of the above

writeArr  rs3,  rs1,  rs2 will take the value in register rs3, and write that value to the array that rs1 points to at index rs2. In C pseudocode: ((int  *)  rs1)[rs2]  =  rs3.

Q1.2  (3.5 points) Assume that the changes, if any, for readArr have not been implemented for this subpart. What changes would we need to make to our datapath in order for us to implement the writeArr instruction with as few changes as possible? Select all that apply.

  Add a new immediate type for ImmGen

  Add a new output to Regfile for a third register value

  Add a new input to the AMux and update any relevant selector/control logic

     Add a new input to the BMux and update any relevant selector/control logic

  Add a new ALU operation and update any relevant selector/control logic

  Add a new DMEM mux which feeds into the data input of the DMEM, and any relevant selector/control logic

  Add a new input to WBMux and update any relevant selector/control logic

  None of the above

Q1.3  (3 points)  Eddy noticed that the structure of writeArr is similar to an R-type instruction. However, when he tried to use the control signals for an R-type instruction, it didn’t work. Which of the following control signals does he need to change to correctly implement writeArr? Select all that apply.

  PCSel               RegWEn

  ASel                  MemRW

  BSel                  None of the above

Q2   IF Only ID Pipelined Better                                                                                         (10 points)

In Project 3, we implemented a RISC-V CPU with two stages; stage 1 included IF and stage 2 included ID/EX/MEM/WB. For this question, imagine instead that we implement a two-stage pipeline with a    different split; stage 1 will include IF/ID and stage 2 will include EX/MEM/WB (IF/ID/EX/MEM/WB are defined equivalently to the pipelined CPU on the reference card).

For Q2.1 and Q2.2, assume the following delays for each component. Any component not listed is assumed to have a negligible delay.

Component

Delay

τclk-to-q

35ps

τsetup

20ps

Mux

75ps

Regfile Setup

20ps

Regfile Read

175ps

Immediate Generator

150ps

Branch Comparator

200ps

ALU

200ps

Memory Read

300ps

Q2.1  (3 points) What is the minimum clock period of this circuit, in picoseconds, to achieve correct behavior?

Q2.2  (2 points)  Which component in stage 2 can we move to stage 1 to decrease the minimum clock period of this circuit the most, while maintaining the same behavior? If a decrease is not possible, write “Not Possible”.

For the remainder of this question, assume that the changes made in Q2.2, if any, have not been implemented.

Q2.3  (3.5 points)  In the CPU, which of the following values must have a pipeline register? Select all that apply.

  Instruction                        RegReadData2                 MemReadData

  Program Counte                Immediate                       None of the above

  RegReadData1                  ALUOut

Q2.4  (1.5 points) Assume that the pipeline has been correctly implemented. Which types of hazards could a program experience? Assume that you cannot read from and write to the Regfile in the same clock cycle.

  Control                      Data                           Structural                   None


Q3   Faster Than Average                                                                                                  (23 points)

In this question, you will parallelize a function to compute the average of all values in a matrix. Below is a single-threaded implementation of this function.

1 double matrix_average(double** matrix, int num_rows, int num_cols) {
2 double global_sum = 0.0;
3 for (int i = 0; i < num_rows; i++) {
4 for (int j = 0; j < num_cols; j++) {
5 global_sum += matrix[i][j];
6 }
7 }
8 return global_sum / (num_rows * num_cols);
9 }

Using the SIMD operations provided, optimize matrix_average. You have access to the following SIMD operations. A vector is a 256-bit vector register capable of holding 4 doubles.

• vector  vec_load(double*  A): Loads 4 doubles at memory address A into a vector

• void  vec_store(double*  A,  vector  B): Stores the 4 doubles in vector B at memory address A

• vector  vec_set0(): Puts all 0s into a vector

•  double  vec_sum(vector  A): Adds all elements of the vector together: return  A[0]  +  A[1] +  A[2]  +  A[3]

• vector  vec_add(vector  A,  vector  B): Adds A and B together elementwise

1 double matrix_average(double** matrix, int num_rows, int num_cols) {
2 double global_sum = 0.0;
3 vector sum_vec =
Q3.1
;
4 for (int i = 0; i < num_rows; i++) {
5 for (int j = 0; j <
Q3.2
;
Q3.3
) {
6 vector values =
Q3.4
;
7 sum_vec =
Q3.5
;
8 }
9 for (int j =
Q3.6
; j <
Q3.7
;
Q3.8
) {
10 global_sum += matrix[i][j];
11 }
12 }
13 global_sum +=
Q3.9
;
14 return global_sum / (num_rows * num_cols);
15 }

Parallelize matrix_average using OpenMP without using #pragma  omp  parallel  for or

reduction. Each thread should work on one or more rows of the matrix. Assume num_rows is a multiple of num_threads.

1 double matrix_average(double** matrix, int num_rows, int num_cols) {
2 double global_sum = 0.0;
3
Q3.10
4 {
5 int num_threads = omp_get_num_threads();
6 int thread_num = omp_get_thread_num();
7 int chunk_size =
Q3.11
;
8 int start_row =
Q3.12
;
9 int end_row =
Q3.13
;
10
11 for (int i = start_row; i < end_row; i++) {
12 double row_sum = 0.0;
13 for (int j = 0; j < num_cols; j++) {
14 row_sum += matrix[i][j];
15 }
16
Q3.14
17
Q3.15
18 }
19 }
20 return global_sum / (num_rows * num_cols);
21 }


Q4   Convoluted Caching                                                                                                 (13 points)

Consider the following function that takes in two integer arrays, a (of length a_len) and b (of length b_len), and returns the 1D convolution of a and b. Assume results is properly allocated.

Let a=0x1000, b=0x2000, results=0x3030, a_len=4, and b_len=2.

void convolve_1d(int* a, int a_len, int* b, int b_len, int* results) {
for (int i = 0; i < a_len - b_len + 1; i++) {
register int sum = 0;
for (int j = 0; j < b_len; j++) {
sum += b[j] * a[i + j];
}
results[i] = sum;
}
}

For Q4.1 and Q4.2, we have a single-level, direct-mapped 64B cache with 16B blocks and 16-bit addresses.

Q4.1  (3 points)  What are the tag, index, and offset bits of the address 0x3037?

Q4.2  (2.5 points) What is the overall hit rate for a call to convolve_ 1d?  No need to simplify the fraction.


Q4.3  (2.5 points)  We change to a 2-way set associative cache of the same size with a LRU replacement policy. What is the overall hit rate for a call to convolve_ 1d? No need to simplify the fraction.

Q4.4  (2.5 points)  We change to a fully associative cache of the same size with a LRU replacement policy. What is the overall hit rate for a call to convolve_ 1d? No need to simplify the fraction.

Q4.5  (2.5 points)  We discover that accessing physical memory will take 400 cycles, so we decide to add an L2 cache. The hit rate of the L1 cache is 75%, and the hit rate of the L2 cache is 99%. With an access time of 6 cycles to fetch from the L1 cache, and an access time of 36 cycles to fetch from the L2 cache, what would our memory access time be for this system, on average?

Q5    The Lookup Box                                                                                                         (10 points)

Consider a system with a 32-bit virtual address space, 256B pages, and 16 MiB of DRAM as main

memory. Provided below is the TLB and a portion of the page table. The TLB is fully associative and there is no data cache. The next free physical pages in main memory start at physical addresses

0x61DE00 and 0x61EF00, respectively.

Each PTE is 32 bits. Bit 31 is the valid bit, bit 30 is the dirty bit, bits 16 through 29 hold other metadata (not relevant for this question), and bits 0 through 15 hold the PPN.

Initial TLB State:

Tag (VPN)

PPN

Valid

Dirty

0x000000

0x23EF

1

0

0x000001

0xFFFF

0

0

Page Table:

Index

PTE

0x0

0x80AB23EF

0x1

0x80EE00C0

0x2

0x8123200A

0x3

0x3561CBA8

...

...

0xA

0xCAFFEEE0

For each question, determine what the memory address access results in, and calculate its physical

address. Note that each memory access is executed in sequence, so they are not independent of each other.

Q5.1  (2.5 points)  Virtual Address: 0x000000FF

  TLB hit                                   TLB miss, no page fault           TLB miss, page fault

Q5.2  (2.5 points)  Virtual Address: 0x00000283

  TLB hit                                   TLB miss, no page fault           TLB miss, page fault

Q5.3  (2.5 points)  Virtual Address: 0x00000AAA

  TLB hit                                   TLB miss, no page fault           TLB miss, page fault

Q5.4  (2.5 points)  Virtual Address: 0x00000360

  TLB hit                                   TLB miss, no page fault           TLB miss, page fault

Q6   Cumulative: Potpourri                    (5 points)

Q6.1  (1 point)  The OS running on a cluster of computers in a datacenter allows a single machine to read and write the memory and local disk of a remote machine in the same rack or array.

According to lecture, which of the following are true? Select all that apply.

(“farther away” means that the distance the data travels increases in steps, first to our local machine, then to a machine in our same rack, then to a machine in our same array.)

  As our CPU sends data to DRAM farther away, bandwidth increases

  As our CPU sends data to Disk farther away, bandwidth increases

  As our CPU sends data to DRAM farther away, latency increases

  As our CPU sends data to Disk farther away, latency increases

  We have higher latency to DRAM on an Array computer than to our own disk

  We have higher bandwidth to DRAM on an Array computer than to our own disk

  None of the above

Q6.2  (1 point)  After a single machine M finishes its assigned portion of a map task in a MapReduce cluster, which of the following can happen immediately, regardless of the overall program state? Select all that apply.

  M can shut down without risking the success of the overall computation

  M can be assigned a new map task

  Data shuffling of the workload M just finished can begin

  The reduce task for the workload M just finished can begin

  None of the above

Q6.3  (1 point) We want to send one bit using a Hamming error correcting code. What are the valid bit patterns you could send that correspond to 0b0 and 0b1?

0b0:   0b1:

Your network card just received a packet with an incorrect checksum.

Q6.4  (1 point) According to lecture, which of the following is true?

  There was a guaranteed error in the payload but not the checksum

  There was a guaranteed error in the checksum but not the payload

  There was a guaranteed error in either the payload or checksum

  None of the above

Q6.5  (1 point)  According to lecture, what should you do?

  Send back a traditional data packet with information about which packet had the problem

  Send back an “ACK”

  Send back a “NO-ACK”

  Send back nothing and delete the packet