闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CS 61C

Spring 2023

Final

Q1 RISC-yArrayArrchitecture (10 points)

Writing code to access integer arrays can be really annoying in RISC-V! Suppose we come up with new instructions, readArr to read from integer arrays and writeArr to write to integer arrays. For this question, you may assume integers are 32 bits.

readArr rd, rs1, rs2 will read the array that rs1 points to at the index stored in rs2, and put that value in register rd. In C pseudocode: rd = ((int *) rs1)[rs2].

Q1.1 (3.5 points) What changes would we need to make to our datapath in order for us to implement the readArr instruction with as few changes as possible? Select all that apply.

Add a new immediate type for ImmGen

Add a new output to Regfile for a third register value

Add a new input to the AMux and update any relevant selector/control logic

Add a new input to the BMux and update any relevant selector/control logic

Add a new ALU operation and update any relevant selector/control logic

Add a new DMEM mux which feeds into the data input of the DMEM, and any relevant selector/control logic

Add a new input to WBMux and update any relevant selector/control logic

None of the above

writeArr rs3, rs1, rs2 will take the value in register rs3, and write that value to the array that rs1 points to at index rs2. In C pseudocode: ((int *) rs1)[rs2] = rs3.

Q1.2 (3.5 points) Assume that the changes, if any, for readArr have not been implemented for this subpart. What changes would we need to make to our datapath in order for us to implement the writeArr instruction with as few changes as possible? Select all that apply.

Add a new immediate type for ImmGen

Add a new output to Regfile for a third register value

Add a new input to the AMux and update any relevant selector/control logic

Add a new input to the BMux and update any relevant selector/control logic

Add a new ALU operation and update any relevant selector/control logic

Add a new DMEM mux which feeds into the data input of the DMEM, and any relevant selector/control logic

Add a new input to WBMux and update any relevant selector/control logic

None of the above

Q1.3 (3 points) Eddy noticed that the structure of writeArr is similar to an R-type instruction. However, when he tried to use the control signals for an R-type instruction, it didn’t work. Which of the following control signals does he need to change to correctly implement writeArr? Select all that apply.

PCSel RegWEn

ASel MemRW

BSel None of the above

Q2 IF Only ID Pipelined Better (10 points)

In Project 3, we implemented a RISC-V CPU with two stages; stage 1 included IF and stage 2 included ID/EX/MEM/WB. For this question, imagine instead that we implement a two-stage pipeline with a different split; stage 1 will include IF/ID and stage 2 will include EX/MEM/WB (IF/ID/EX/MEM/WB are defined equivalently to the pipelined CPU on the reference card).

For Q2.1 and Q2.2, assume the following delays for each component. Any component not listed is assumed to have a negligible delay.

Component	Delay
τclk-to-q	35ps
τsetup	20ps
Mux	75ps
Regfile Setup	20ps
Regfile Read	175ps
Immediate Generator	150ps
Branch Comparator	200ps
ALU	200ps
Memory Read	300ps

Q2.1 (3 points) What is the minimum clock period of this circuit, in picoseconds, to achieve correct behavior?

Q2.2 (2 points) Which component in stage 2 can we move to stage 1 to decrease the minimum clock period of this circuit the most, while maintaining the same behavior? If a decrease is not possible, write “Not Possible”.

For the remainder of this question, assume that the changes made in Q2.2, if any, have not been implemented.

Q2.3 (3.5 points) In the CPU, which of the following values must have a pipeline register? Select all that apply.

Instruction RegReadData2 MemReadData

Program Counte Immediate None of the above

RegReadData1 ALUOut

Q2.4 (1.5 points) Assume that the pipeline has been correctly implemented. Which types of hazards could a program experience? Assume that you cannot read from and write to the Regfile in the same clock cycle.

Control Data Structural None

Q3 Faster Than Average (23 points)

In this question, you will parallelize a function to compute the average of all values in a matrix. Below is a single-threaded implementation of this function.

1 double matrix_average(double** matrix, int num_rows, int num_cols) {

2 double global_sum = 0.0;

3 for (int i = 0; i < num_rows; i++) {

4 for (int j = 0; j < num_cols; j++) {

5 global_sum += matrix[i][j];

6 }

7 }

8 return global_sum / (num_rows * num_cols);

9 }

Using the SIMD operations provided, optimize matrix_average. You have access to the following SIMD operations. A vector is a 256-bit vector register capable of holding 4 doubles.

• vector vec_load(double* A): Loads 4 doubles at memory address A into a vector

• void vec_store(double* A, vector B): Stores the 4 doubles in vector B at memory address A

• vector vec_set0(): Puts all 0s into a vector

• double vec_sum(vector A): Adds all elements of the vector together: return A[0] + A[1] + A[2] + A[3]

• vector vec_add(vector A, vector B): Adds A and B together elementwise

1 double matrix_average(double** matrix, int num_rows, int num_cols) {

2 double global_sum = 0.0;

3 vector sum_vec =

Q3.1

;

4 for (int i = 0; i < num_rows; i++) {

5 for (int j = 0; j <

Q3.2

;

Q3.3

) {

6 vector values =

Q3.4

;

7 sum_vec =

Q3.5

;

8 }

9 for (int j =

Q3.6

; j <

Q3.7

;

Q3.8

) {

10 global_sum += matrix[i][j];

11 }

12 }

13 global_sum +=

Q3.9

;

14 return global_sum / (num_rows * num_cols);

15 }

Parallelize matrix_average using OpenMP without using #pragma omp parallel for or

reduction. Each thread should work on one or more rows of the matrix. Assume num_rows is a multiple of num_threads.

1 double matrix_average(double** matrix, int num_rows, int num_cols) {

2 double global_sum = 0.0;

Q3.10

4 {

5 int num_threads = omp_get_num_threads();

6 int thread_num = omp_get_thread_num();

7 int chunk_size =

Q3.11

;

8 int start_row =

Q3.12

;

9 int end_row =

Q3.13

;

11 for (int i = start_row; i < end_row; i++) {

12 double row_sum = 0.0;

13 for (int j = 0; j < num_cols; j++) {

14 row_sum += matrix[i][j];

15 }

Q3.14

Q3.15

18 }

19 }

20 return global_sum / (num_rows * num_cols);

21 }

Q4 Convoluted Caching (13 points)

Consider the following function that takes in two integer arrays, a (of length a_len) and b (of length b_len), and returns the 1D convolution of a and b. Assume results is properly allocated.

Let a=0x1000, b=0x2000, results=0x3030, a_len=4, and b_len=2.

void convolve_1d(int* a, int a_len, int* b, int b_len, int* results) {

for (int i = 0; i < a_len - b_len + 1; i++) {

for (int j = 0; j < b_len; j++) {

sum += b[j] * a[i + j];

}

results[i] = sum;

}

For Q4.1 and Q4.2, we have a single-level, direct-mapped 64B cache with 16B blocks and 16-bit addresses.

Q4.1 (3 points) What are the tag, index, and offset bits of the address 0x3037?

Q4.2 (2.5 points) What is the overall hit rate for a call to convolve_ 1d? No need to simplify the fraction.

Q4.3 (2.5 points) We change to a 2-way set associative cache of the same size with a LRU replacement policy. What is the overall hit rate for a call to convolve_ 1d? No need to simplify the fraction.

Q4.4 (2.5 points) We change to a fully associative cache of the same size with a LRU replacement policy. What is the overall hit rate for a call to convolve_ 1d? No need to simplify the fraction.

Q4.5 (2.5 points) We discover that accessing physical memory will take 400 cycles, so we decide to add an L2 cache. The hit rate of the L1 cache is 75%, and the hit rate of the L2 cache is 99%. With an access time of 6 cycles to fetch from the L1 cache, and an access time of 36 cycles to fetch from the L2 cache, what would our memory access time be for this system, on average?

Q5 The Lookup Box (10 points)

Consider a system with a 32-bit virtual address space, 256B pages, and 16 MiB of DRAM as main

memory. Provided below is the TLB and a portion of the page table. The TLB is fully associative and there is no data cache. The next free physical pages in main memory start at physical addresses

0x61DE00 and 0x61EF00, respectively.

Each PTE is 32 bits. Bit 31 is the valid bit, bit 30 is the dirty bit, bits 16 through 29 hold other metadata (not relevant for this question), and bits 0 through 15 hold the PPN.

Initial TLB State:

Tag (VPN)	PPN	Valid	Dirty
0x000000	0x23EF	1	0
0x000001	0xFFFF	0	0

Page Table:

Index	PTE
0x0	0x80AB23EF
0x1	0x80EE00C0
0x2	0x8123200A
0x3	0x3561CBA8
...	...
0xA	0xCAFFEEE0

For each question, determine what the memory address access results in, and calculate its physical

address. Note that each memory access is executed in sequence, so they are not independent of each other.

Q5.1 (2.5 points) Virtual Address: 0x000000FF

TLB hit TLB miss, no page fault TLB miss, page fault

Q5.2 (2.5 points) Virtual Address: 0x00000283

TLB hit TLB miss, no page fault TLB miss, page fault

Q5.3 (2.5 points) Virtual Address: 0x00000AAA

TLB hit TLB miss, no page fault TLB miss, page fault

Q5.4 (2.5 points) Virtual Address: 0x00000360

TLB hit TLB miss, no page fault TLB miss, page fault

Q6 Cumulative: Potpourri (5 points)

Q6.1 (1 point) The OS running on a cluster of computers in a datacenter allows a single machine to read and write the memory and local disk of a remote machine in the same rack or array.

According to lecture, which of the following are true? Select all that apply.

(“farther away” means that the distance the data travels increases in steps, first to our local machine, then to a machine in our same rack, then to a machine in our same array.)

As our CPU sends data to DRAM farther away, bandwidth increases

As our CPU sends data to Disk farther away, bandwidth increases

As our CPU sends data to DRAM farther away, latency increases

As our CPU sends data to Disk farther away, latency increases

We have higher latency to DRAM on an Array computer than to our own disk

We have higher bandwidth to DRAM on an Array computer than to our own disk

None of the above

Q6.2 (1 point) After a single machine M finishes its assigned portion of a map task in a MapReduce cluster, which of the following can happen immediately, regardless of the overall program state? Select all that apply.

M can shut down without risking the success of the overall computation

M can be assigned a new map task

Data shuffling of the workload M just finished can begin

The reduce task for the workload M just finished can begin

None of the above

Q6.3 (1 point) We want to send one bit using a Hamming error correcting code. What are the valid bit patterns you could send that correspond to 0b0 and 0b1?

0b0: 0b1:

Your network card just received a packet with an incorrect checksum.

Q6.4 (1 point) According to lecture, which of the following is true?

There was a guaranteed error in the payload but not the checksum