CS 61C Spring 2023 Final
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
CS 61C
Spring 2023
Final
Q1 RISC-yArrayArrchitecture (10 points)
Writing code to access integer arrays can be really annoying in RISC-V! Suppose we come up with new instructions, readArr to read from integer arrays and writeArr to write to integer arrays. For this question, you may assume integers are 32 bits.
readArr rd, rs1, rs2 will read the array that rs1 points to at the index stored in rs2, and put that value in register rd. In C pseudocode: rd = ((int *) rs1)[rs2].
Q1.1 (3.5 points) What changes would we need to make to our datapath in order for us to implement the readArr instruction with as few changes as possible? Select all that apply.
Add a new immediate type for ImmGen
Add a new output to Regfile for a third register value
Add a new input to the AMux and update any relevant selector/control logic
Add a new input to the BMux and update any relevant selector/control logic
Add a new ALU operation and update any relevant selector/control logic
Add a new DMEM mux which feeds into the data input of the DMEM, and any relevant selector/control logic
Add a new input to WBMux and update any relevant selector/control logic
None of the above
writeArr rs3, rs1, rs2 will take the value in register rs3, and write that value to the array that rs1 points to at index rs2. In C pseudocode: ((int *) rs1)[rs2] = rs3.
Q1.2 (3.5 points) Assume that the changes, if any, for readArr have not been implemented for this subpart. What changes would we need to make to our datapath in order for us to implement the writeArr instruction with as few changes as possible? Select all that apply.
Add a new immediate type for ImmGen
Add a new output to Regfile for a third register value
Add a new input to the AMux and update any relevant selector/control logic
Add a new input to the BMux and update any relevant selector/control logic
Add a new ALU operation and update any relevant selector/control logic
Add a new DMEM mux which feeds into the data input of the DMEM, and any relevant selector/control logic
Add a new input to WBMux and update any relevant selector/control logic
None of the above
Q1.3 (3 points) Eddy noticed that the structure of writeArr is similar to an R-type instruction. However, when he tried to use the control signals for an R-type instruction, it didn’t work. Which of the following control signals does he need to change to correctly implement writeArr? Select all that apply.
PCSel RegWEn
ASel MemRW
BSel None of the above
Q2 IF Only ID Pipelined Better (10 points)
In Project 3, we implemented a RISC-V CPU with two stages; stage 1 included IF and stage 2 included ID/EX/MEM/WB. For this question, imagine instead that we implement a two-stage pipeline with a different split; stage 1 will include IF/ID and stage 2 will include EX/MEM/WB (IF/ID/EX/MEM/WB are defined equivalently to the pipelined CPU on the reference card).
For Q2.1 and Q2.2, assume the following delays for each component. Any component not listed is assumed to have a negligible delay.
Component |
Delay |
τclk-to-q |
35ps |
τsetup |
20ps |
Mux |
75ps |
Regfile Setup |
20ps |
Regfile Read |
175ps |
Immediate Generator |
150ps |
Branch Comparator |
200ps |
ALU |
200ps |
Memory Read |
300ps |
Q2.1 (3 points) What is the minimum clock period of this circuit, in picoseconds, to achieve correct behavior?
Q2.2 (2 points) Which component in stage 2 can we move to stage 1 to decrease the minimum clock period of this circuit the most, while maintaining the same behavior? If a decrease is not possible, write “Not Possible”.
For the remainder of this question, assume that the changes made in Q2.2, if any, have not been implemented.
Q2.3 (3.5 points) In the CPU, which of the following values must have a pipeline register? Select all that apply.
Instruction RegReadData2 MemReadData
Program Counte Immediate None of the above
RegReadData1 ALUOut
Q2.4 (1.5 points) Assume that the pipeline has been correctly implemented. Which types of hazards could a program experience? Assume that you cannot read from and write to the Regfile in the same clock cycle.
Control Data Structural None
Q3 Faster Than Average (23 points)
In this question, you will parallelize a function to compute the average of all values in a matrix. Below is a single-threaded implementation of this function.
Using the SIMD operations provided, optimize matrix_average. You have access to the following SIMD operations. A vector is a 256-bit vector register capable of holding 4 doubles.
• vector vec_load(double* A): Loads 4 doubles at memory address A into a vector
• void vec_store(double* A, vector B): Stores the 4 doubles in vector B at memory address A
• vector vec_set0(): Puts all 0s into a vector
• double vec_sum(vector A): Adds all elements of the vector together: return A[0] + A[1] + A[2] + A[3]
• vector vec_add(vector A, vector B): Adds A and B together elementwise
Parallelize matrix_average using OpenMP without using #pragma omp parallel for or
reduction. Each thread should work on one or more rows of the matrix. Assume num_rows is a multiple of num_threads.
Q4 Convoluted Caching (13 points)
Consider the following function that takes in two integer arrays, a (of length a_len) and b (of length b_len), and returns the 1D convolution of a and b. Assume results is properly allocated.
Let a=0x1000, b=0x2000, results=0x3030, a_len=4, and b_len=2.
For Q4.1 and Q4.2, we have a single-level, direct-mapped 64B cache with 16B blocks and 16-bit addresses.
Q4.1 (3 points) What are the tag, index, and offset bits of the address 0x3037?
Q4.2 (2.5 points) What is the overall hit rate for a call to convolve_ 1d? No need to simplify the fraction.
Q4.3 (2.5 points) We change to a 2-way set associative cache of the same size with a LRU replacement policy. What is the overall hit rate for a call to convolve_ 1d? No need to simplify the fraction.
Q4.4 (2.5 points) We change to a fully associative cache of the same size with a LRU replacement policy. What is the overall hit rate for a call to convolve_ 1d? No need to simplify the fraction.
Q4.5 (2.5 points) We discover that accessing physical memory will take 400 cycles, so we decide to add an L2 cache. The hit rate of the L1 cache is 75%, and the hit rate of the L2 cache is 99%. With an access time of 6 cycles to fetch from the L1 cache, and an access time of 36 cycles to fetch from the L2 cache, what would our memory access time be for this system, on average?
Q5 The Lookup Box (10 points)
Consider a system with a 32-bit virtual address space, 256B pages, and 16 MiB of DRAM as main
memory. Provided below is the TLB and a portion of the page table. The TLB is fully associative and there is no data cache. The next free physical pages in main memory start at physical addresses
0x61DE00 and 0x61EF00, respectively.
Each PTE is 32 bits. Bit 31 is the valid bit, bit 30 is the dirty bit, bits 16 through 29 hold other metadata (not relevant for this question), and bits 0 through 15 hold the PPN.
Initial TLB State:
Tag (VPN) |
PPN |
Valid |
Dirty |
0x000000 |
0x23EF |
1 |
0 |
0x000001 |
0xFFFF |
0 |
0 |
Page Table:
Index |
PTE |
0x0 |
0x80AB23EF |
0x1 |
0x80EE00C0 |
0x2 |
0x8123200A |
0x3 |
0x3561CBA8 |
... |
... |
0xA |
0xCAFFEEE0 |
For each question, determine what the memory address access results in, and calculate its physical
address. Note that each memory access is executed in sequence, so they are not independent of each other.
Q5.1 (2.5 points) Virtual Address: 0x000000FF
TLB hit TLB miss, no page fault TLB miss, page fault
Q5.2 (2.5 points) Virtual Address: 0x00000283
TLB hit TLB miss, no page fault TLB miss, page fault
Q5.3 (2.5 points) Virtual Address: 0x00000AAA
TLB hit TLB miss, no page fault TLB miss, page fault
Q5.4 (2.5 points) Virtual Address: 0x00000360
TLB hit TLB miss, no page fault TLB miss, page fault
Q6 Cumulative: Potpourri (5 points)
Q6.1 (1 point) The OS running on a cluster of computers in a datacenter allows a single machine to read and write the memory and local disk of a remote machine in the same rack or array.
According to lecture, which of the following are true? Select all that apply.
(“farther away” means that the distance the data travels increases in steps, first to our local machine, then to a machine in our same rack, then to a machine in our same array.)
As our CPU sends data to DRAM farther away, bandwidth increases
As our CPU sends data to Disk farther away, bandwidth increases
As our CPU sends data to DRAM farther away, latency increases
As our CPU sends data to Disk farther away, latency increases
We have higher latency to DRAM on an Array computer than to our own disk
We have higher bandwidth to DRAM on an Array computer than to our own disk
None of the above
Q6.2 (1 point) After a single machine M finishes its assigned portion of a map task in a MapReduce cluster, which of the following can happen immediately, regardless of the overall program state? Select all that apply.
M can shut down without risking the success of the overall computation
M can be assigned a new map task
Data shuffling of the workload M just finished can begin
The reduce task for the workload M just finished can begin
None of the above
Q6.3 (1 point) We want to send one bit using a Hamming error correcting code. What are the valid bit patterns you could send that correspond to 0b0 and 0b1?
0b0: 0b1:
Your network card just received a packet with an incorrect checksum.
Q6.4 (1 point) According to lecture, which of the following is true?
There was a guaranteed error in the payload but not the checksum
There was a guaranteed error in the checksum but not the payload
There was a guaranteed error in either the payload or checksum
None of the above
Q6.5 (1 point) According to lecture, what should you do?
Send back a traditional data packet with information about which packet had the problem
Send back an “ACK”
Send back a “NO-ACK”
Send back nothing and delete the packet
2024-01-29