ECE 6913 Fall 2023
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Fall 2023, ECE 6913
Review Problems & Solutions
1. Assume for arithmetic, load/store, and branch instructions, a processor has CPIs of 1, 12, and 5, respectively. Also assume that on a single processor a program requires the execution of 2.56E9 arithmetic instructions, 1.28E9 load/store instructions, and 256 million branch instructions. Assume that each processor has a 2 GHz clock frequency.
Assume that, as the program is parallelized to run over multiple cores, the number of arithmetic and load/store instructions per processor is divided by 0.7 × p (where p is the number of processors) but the
number of branch instructions per processor remains the same
a. Find the total execution time (ET) for this program on 1, 2, 4, and 8 processors, and show the relative speedup of the 2, 4, and 8 processors result relative to the single processor result.
Instruction Count [IC]
Arithmetic IC[2] = 2.56B / [0.7 x 2] = 1.83B
Arithmetic IC[4] = 2.56B / [0.7 x 4] = 0.91B
Arithmetic IC[8] = 2.56B / [0.7 x 8] = 0.46B
Load/Store IC[2] = 1.28B / [0.7 x 2] = 0.91B
Load/Store IC[4] = 1.28B / [0.7 x 4] = 0.46B
Load/Store IC[8] = 1.28B / [0.7 x 8] = 0.23B
Branch [1] =Branch [2] = Branch [4] = Branch [8] = 0.256B
Instruction |
CPI |
IC [1] |
IC[2] |
IC[4] |
IC[8] |
Arithmetic |
1 |
2.56B |
1.83B |
0.91B |
0.46B |
Load/Store |
12 |
1.28B |
0.91B |
0.46B |
0.23B |
Branch |
5 |
0.256B |
0.256B |
0.256B |
0.256B |
Total Instruction Count [IC] |
|
4.096B |
2.996B |
1.626B |
0.946B |
ET (Execution Time) = IC (Instruction Count) x CPI (Cycles per Instruction) x Cycle Time: 1/[2x109 Hz]
Instruction |
CPI |
ET [1] |
ET [2] |
ET [4] |
ET [8] |
Arithmetic |
1 |
1x2.56Bx0.5ns=1.28s |
1x1.83Bx.5n=0.91s |
1x0.91Bx.5n=0.46s |
1x0.46Bx0.5n=0.23s |
Load/Store |
12 |
12x1.28Bx.5ns=7.68s |
12x0.91Bx.5ns=5.46s |
12x0.46Bx.5n=2.76s |
12x0.23Bx.5n=1.38s |
Branch |
5 |
5x 0.256Bx.5ns=0.64s |
5x 0.256Bx.5ns=0.64s |
5x 0.256Bx.5ns=0.64s |
5x 0.256Bx.5ns=0.64s |
Total |
- |
9.6s |
7.01s |
3.86s |
2.17s |
b. If the CPI of the arithmetic instructions was doubled, what would the impact be on the execution time of the program on 1, 2, 4, or 8 processors?
ET (Execution Time) = IC (Instruction Count) x CPI (Cycles per Instruction) x Cycle Time: 1/[2x109 Hz]
Instruction |
CPI |
ET [1] |
ET [2] |
ET [4] |
ET [8] |
Arithmetic |
2 |
2x2.56Bx0.5ns=2.56s |
2x1.83Bx.5n=1.83s |
2x0.91Bx.5n=0.91s |
2x0.46Bx0.5n=0.46s |
Load/Store |
12 |
12x1.28Bx.5ns=7.68s |
12x0.91Bx.5ns=5.46s |
12x0.46Bx.5n=2.76s |
12x0.23Bx.5n=1.38s |
Branch |
5 |
5x 0.256Bx.5ns=0.64s |
5x 0.256Bx.5ns=0.64s |
5x 0.256Bx.5ns=0.64s |
5x 0.256Bx.5ns=0.64s |
Total |
- |
10.88s |
7.93s |
4.31s |
2.48s |
c. To what should the CPI of load/store instructions be reduced in order for a single processor to match the performance of four processors using the original CPI values?
If a single processor is to execute the same Program in 3.86 seconds (instead of 9.6 seconds), then the difference of 5.74 seconds must come from the improvement in execution time of the L/S instructions.
This improvement can be accomplished by reducing L/S instruction CPIfrom 12 to CPIX
7.68s-5.74s = 1.94s = Time now available for single processor to process L/S instructions
CPIX x 1.28Bx 0.5ns = 1.94s
So, CPIX = 1.94s/[1.28 x 0.5] = 1.94s/0.64 = 3.03
2. The results of the SPEC CPU2006 bzip2 benchmark running on an AMD Barcelona has an instruction
count of 2.389E12, an execution time of 750 s, and a reference time of 9650 s
a. Find the CPI if the clock cycle time is 0.333 ns.
IC = 2.389T, ET=750s, RefET = 9650s
CPI = (ET * clock rate)/IC = 750s x 3.00GHz / 2.389T = 0.94
b. Find the SPECratio
SPECratio = T_ref/T = 9650s /750s = 12.86
c. Find the increase in CPU time if the number of instructions of the benchmark is increased by 10% without affecting the CPI
ET = CPI x IC x Tcycle = 0.94 x 2.389T x 1.1 x 0.333ns = 822s or an increase in ET of 9.67%
d. Find the increase in CPU time if the number of instructions of the benchmark is increased by 10% and the CPI is increased by 5%.
ET = CPI x IC x Tcycle = 0.94 x 1.05 x 2.389T x 1.1 x 0.333ns = 863s or an increase in ET of 15%
e. Find the change in the SPECratio for this change
SPECratio = T_ref/T = 9650s /863s = 11.18
f. Suppose that we are developing a new version of the AMD Barcelona processor with a 4 GHz clock rate. We have added some additional instructions to the instruction set in such a way that the number of instructions has been reduced by 15%. The execution time is reduced to 700 s and the new SPECratio is 13.7. Find the new CPI.
ETnew = 700s = CPIX x 2.389T x 0.85 x 0.250ns
So, CPIX = 700/507.66 = 1.38
g. This CPI value is larger than obtained in part a. as the clock rate was increased from 3 GHz to 4 GHz. Determine whether the increase in the CPI is similar to that of the clock rate. If they are dissimilar, why?
CPI increased by 1.38/0.94 = 1.46
Clock rate increased by 4/3 = 1.33
Both - Instruction Count and the Clock Cycle time were lowered as a result of adding instructions to the instruction set.
Since each instruction, on average, is accomplishing more work, the reduced cycle time is likely to require the average instruction to complete its task using more cycles and hence an increase in the average CPI of the Benchmark
h. By how much has the CPU time been reduced?
700s / 750s = 0.933 or by 6.67%
3. Assume a program requires the execution of 50 ×106 FP instructions, 110 ×106 INT instructions, 80 ×106 L/S instructions, and 16 ×106 branch instructions. The CPI for each type of instruction is 1, 1, 4, and 2, respectively. Assume that the processor has a 2 GHz clock rate.
a. By how much must we improve the CPI of FP instructions if we want the program to run two times faster?
Clock cycles = CPIfp × #FP instr. + CPIint × #INT instr. +
CPIl/s × #L/S instr. + CPIbranch × #branch instr.
= 1x50×106 + 1x110×106 + 4x80×106 + 2x16 ×106
= 512 × 106
If we lower the number of clock cycles by improving CPIFP, even if the CPI of FP instructions goes to zero, the program will still consume more than half of the 512M clock cycles with all 4 instruction classes: 292M cycles.
So, the program cannot possibly run 2 times faster by improving CPI of FP instructions
b. By how much must we improve the CPI ofL/S instructions if we want the program to run two times faster?
Number of cycles consumed by FP, INT and Branch instructions = 192M
To lower the total number of cycles (512M by half 256M cycles), the number of L/S cycles must lower to a max of 256M – 192M = 64M cycles.
Since the number of L/S instructions equals 80M, the CPIXL/S must reduce to CPIXL/S = 64M cycles/ 80M Instructions = 0.8
c. By how much is the execution time of the program improved if the CPI of INT and FP instructions is reduced by 40% and the CPI ofL/S and Branch is reduced by 30%?
With Initial assumptions of CPIX for each instruction class:
Clock cycles = CPIfp × #FP instr. + CPIint × #INT instr. +
CPIl/s × #L/S instr. + CPIbranch × #branch instr.
= 1x50×106 + 1x110×106 + 4x80×106 + 2x16 ×106
= 512 × 106
Clock Cycles with Improved CPIX for INT, FP, L/S:
Clock cycles = CPIfp × #FP instr. + CPIint × #INT instr. +
CPIl/s × #L/S instr. + CPIbranch × #branch instr.
= 0.6x50×106 + 0.6x110×106 + 2.8x80×106 + 1.4x16 ×106
= 404 × 106
that is a speedup of 512/404 = 26.73%
4. When a program is adapted to run on multiple processors in a multiprocessor system, the execution time on each processor is comprised of computing time and the overhead time required for locked critical sections and/or to send data from one processor to another. Assume a program requires t =100 sof execution time on one processor. When run p processors, each processor requires t/p s, as well as an additional 4 s of overhead, irrespective of the number of processors.
Compute the per-processor execution time for 2, 4, 8, 16, 32, 64, and 128 processors. For each case, list the corresponding speedup relative to a single processor and the ratio between actual speedup versus ideal speedup (speedup if there was no overhead).
# Processors Ex Time/Processor Time w/ overhead Speedup Actual Speedup/ideal speedup
1 |
100 |
|
|
|
2 |
50 |
54 |
100/54 = 1.85 |
1.85/2 = 0.93 |
4 |
25 |
29 |
100/29 = 3.44 |
3.44/4 = 0.86 |
8 |
12.5 |
16.5 |
100/16.5 = 6.06 |
6.06/8 = 0.75 |
16 |
6.25 |
10.25 |
100/10.25 = 9.76 |
9.76/16 = 0.61 |
5. The Pentium 4 Prescott processor, released in 2004, had a clock rate of 3.6 GHz and voltage of 1.25 V. Assume that, on average, it consumed 10 W of static power and 90 W of dynamic power.
The Core i5 Ivy Bridge, released in 2012, has a clock rate of 3.4 GHz and voltage of 0.9 V. Assume that, on average, it consumed 30 W of static power and 40 W of dynamic power.
a. For each processor find the average capacitive loads.
½ CV2F = ½ C(1.25)2x3.6 GHz = 90 W [Prescott] => C = 32nF
½ CV2F = ½ C(0.9)2x3.4 GHz = 40 W [Ivy Bridge] => C = 29nF
b. Find the percentage of the total dissipated power comprised by static power and the ratio of static power to dynamic power for each technology
Prescott: fraction of total power from Static dissipation = 10/100 = 10%
IvyBridge: 30/70 = 42.9%
c. If the total dissipated power is to be reduced by 10%, how much should the voltage be reduced to maintain the same leakage current? Note: power is defined as the product of voltage and current.
Total Power new / Total Power old = 0.9
or, (Dn + Sn) / (Do + So) = 0.9
Dynamic Power new = Dn = C Vnew2 x F
Static Power new = Sn = Vnew x IL
Static Power old = So = Vold x IL
For the Prescott:
From (2) Vnew = sqrt (Dn/[C x F]) =
√32nF3(n) 6GHz = √11(D)5(n)2
From (1b) Dn = 0.9 (So + Do) - Sn
From (3), (4), Sn = So(Vnew/Vold) =
Sn = Vnew x (10W/1.25V) = Vnew x 8
From (6), Dn = 0.9 x 100W - 8 Vnew
So, Dn = [90-8Vnew]
From (5), (9) Vnew = =
vnew =
Solving above quadratic for Vnew, we get
Vnew = 0.85V
So, for the Prescott, operating voltage would scale down from 1.25V to 0.85V
6. Assume that for a given program 70% of the executed instructions are arithmetic, 10% are load/store, and 20% are branch.
2023-10-24