Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Fall 2023, ECE 6913

Review Problems & Solutions

1. Assume  for  arithmetic, load/store, and branch instructions, a processor has CPIs of 1,  12,  and 5, respectively. Also assume that on a single processor a program requires the execution of 2.56E9 arithmetic instructions,  1.28E9  load/store  instructions,  and  256  million  branch  instructions.  Assume  that  each processor has a 2 GHz clock frequency.

Assume that, as the program is parallelized to run over multiple cores, the number of arithmetic and load/store instructions per processor is divided by 0.7 × p (where p is the number of processors) but the

number of branch instructions per processor remains the same

a. Find the total execution time (ET) for this program on 1, 2, 4, and 8 processors, and show the relative speedup of the 2, 4, and 8 processors result relative to the single processor result.

Instruction Count [IC]

Arithmetic IC[2] = 2.56B / [0.7 x 2] = 1.83B

Arithmetic IC[4] = 2.56B / [0.7 x 4] = 0.91B

Arithmetic IC[8] = 2.56B / [0.7 x 8] = 0.46B

Load/Store IC[2] = 1.28B / [0.7 x 2] = 0.91B

Load/Store IC[4] = 1.28B / [0.7 x 4] = 0.46B

Load/Store IC[8] = 1.28B / [0.7 x 8] = 0.23B

Branch [1] =Branch [2] = Branch [4] = Branch [8] = 0.256B

Instruction

CPI

IC [1]

IC[2]

IC[4]

IC[8]

Arithmetic

1

2.56B

1.83B

0.91B

0.46B

Load/Store

12

1.28B

0.91B

0.46B

0.23B

Branch

5

0.256B

0.256B

0.256B

0.256B

Total Instruction Count [IC]

4.096B

2.996B

1.626B

0.946B

ET (Execution Time) = IC (Instruction Count) x CPI (Cycles per Instruction) x Cycle Time: 1/[2x109  Hz]

Instruction

CPI

ET [1]

ET [2]

ET [4]

ET [8]

Arithmetic

1

1x2.56Bx0.5ns=1.28s

1x1.83Bx.5n=0.91s

1x0.91Bx.5n=0.46s

1x0.46Bx0.5n=0.23s

Load/Store

12

12x1.28Bx.5ns=7.68s

12x0.91Bx.5ns=5.46s

12x0.46Bx.5n=2.76s

12x0.23Bx.5n=1.38s

Branch

5

5x 0.256Bx.5ns=0.64s

5x 0.256Bx.5ns=0.64s

5x 0.256Bx.5ns=0.64s

5x 0.256Bx.5ns=0.64s

Total

-

9.6s

7.01s

3.86s

2.17s

b. If the CPI of the arithmetic instructions was doubled, what would the impact be on the execution time of the program on 1, 2, 4, or 8 processors?

ET (Execution Time) = IC (Instruction Count) x CPI (Cycles per Instruction) x Cycle Time: 1/[2x109  Hz]

Instruction

CPI

ET [1]

ET [2]

ET [4]

ET [8]

Arithmetic

2

2x2.56Bx0.5ns=2.56s

2x1.83Bx.5n=1.83s

2x0.91Bx.5n=0.91s

2x0.46Bx0.5n=0.46s

Load/Store

12

12x1.28Bx.5ns=7.68s

12x0.91Bx.5ns=5.46s

12x0.46Bx.5n=2.76s

12x0.23Bx.5n=1.38s

Branch

5

5x 0.256Bx.5ns=0.64s

5x 0.256Bx.5ns=0.64s

5x 0.256Bx.5ns=0.64s

5x 0.256Bx.5ns=0.64s

Total

-

10.88s

7.93s

4.31s

2.48s

c. To what should the CPI of load/store instructions be reduced in order for a single processor to match the performance of four processors using the original CPI values?

If a single processor is to execute the same Program in 3.86 seconds (instead of 9.6 seconds), then the  difference of 5.74 seconds must come from the improvement in execution time of the L/S instructions.

This improvement can be accomplished by reducing L/S instruction CPIfrom 12 to CPIX

7.68s-5.74s = 1.94s = Time now available for single processor to process L/S instructions

CPIX x 1.28Bx 0.5ns = 1.94s

So, CPIX = 1.94s/[1.28 x 0.5] = 1.94s/0.64 = 3.03

2. The results of the SPEC CPU2006 bzip2 benchmark running on an AMD Barcelona has an instruction

count of 2.389E12, an execution time of 750 s, and a reference time of 9650 s

a. Find the CPI if the clock cycle time is 0.333 ns.

IC = 2.389T, ET=750s, RefET = 9650s

CPI = (ET * clock rate)/IC = 750s x 3.00GHz / 2.389T = 0.94

b. Find the SPECratio

SPECratio = T_ref/T = 9650s /750s = 12.86

c. Find the increase in CPU time if the number of instructions of the benchmark is increased by 10% without affecting the CPI

ET = CPI x IC x Tcycle = 0.94 x 2.389T x 1.1 x 0.333ns = 822s or an increase in ET of 9.67%

d. Find the increase in CPU time if the number of instructions of the benchmark is increased by 10% and the CPI is increased by 5%.

ET = CPI x IC x Tcycle = 0.94 x 1.05 x 2.389T x 1.1 x 0.333ns = 863s or an increase in ET of 15%

e. Find the change in the SPECratio for this change

SPECratio = T_ref/T = 9650s /863s = 11.18

f. Suppose that we are developing a new version of the AMD Barcelona processor with a 4 GHz clock rate. We have added some additional instructions to the instruction set in such a way that the number of instructions has been reduced by 15%. The execution time is reduced to 700 s and the new SPECratio is 13.7. Find the new CPI.

ETnew = 700s = CPIX x 2.389T x 0.85 x 0.250ns

So, CPIX = 700/507.66 = 1.38

g. This CPI value is larger than obtained in part a. as the clock rate was increased from 3 GHz to 4 GHz. Determine whether the increase in the CPI is similar to that of the clock rate. If they are dissimilar, why?

CPI increased by 1.38/0.94 = 1.46

Clock rate increased by 4/3 = 1.33

Both - Instruction Count and the Clock Cycle time were lowered as a result of adding instructions to the instruction set.

Since each instruction, on average, is accomplishing more work, the reduced cycle time is likely to require the average instruction to complete its task using more cycles and hence an increase in the average CPI of the Benchmark

h. By how much has the CPU time been reduced?

700s / 750s = 0.933 or by 6.67%

3. Assume a program requires the execution of 50 ×106 FP instructions, 110 ×106 INT instructions, 80 ×106 L/S instructions, and 16 ×106  branch instructions. The CPI for each type of instruction is 1,  1, 4, and 2, respectively. Assume that the processor has a 2 GHz clock rate.

a. By how much must we improve the CPI of FP instructions if we want the program to run two times faster?

Clock cycles = CPIfp × #FP instr. + CPIint × #INT instr. +

CPIl/s × #L/S instr. + CPIbranch × #branch instr.

= 1x50×106  + 1x110×106  + 4x80×106  + 2x16 ×106

= 512 × 106

If we lower the number of clock cycles by improving CPIFP, even if the CPI of FP instructions goes to zero, the program will still consume more than  half of the 512M clock cycles with all 4 instruction classes: 292M cycles.

So, the program cannot possibly run 2 times faster by improving CPI of FP instructions

b. By how much must we improve the CPI ofL/S instructions if we want the program to run two times faster?

Number of cycles consumed by FP, INT and Branch instructions = 192M

To lower the total number of cycles (512M by half 256M cycles), the number of L/S cycles must lower to a max of 256M – 192M = 64M cycles.

Since the number of L/S instructions equals 80M, the CPIXL/S  must reduce to CPIXL/S = 64M cycles/ 80M Instructions = 0.8

c. By how much is the execution time of the program improved if the CPI of INT and FP instructions is reduced by 40% and the CPI ofL/S and Branch is reduced by 30%?

With Initial assumptions of CPIX for each instruction class:

Clock cycles = CPIfp × #FP instr. + CPIint × #INT instr. +

CPIl/s × #L/S instr. + CPIbranch × #branch instr.

= 1x50×106  + 1x110×106  + 4x80×106  + 2x16 ×106

= 512 × 106

Clock Cycles with Improved CPIX for INT, FP, L/S:

Clock cycles = CPIfp × #FP instr. + CPIint × #INT instr. +

CPIl/s × #L/S instr. + CPIbranch × #branch instr.

= 0.6x50×106  + 0.6x110×106  + 2.8x80×106  + 1.4x16 ×106

= 404 × 106

that is a speedup of 512/404 = 26.73%

4. When a program is adapted to run on multiple processors in a multiprocessor system, the execution time on each processor is comprised of computing time and the overhead time required for locked critical sections and/or to send data from one processor to another. Assume a program requires t =100 sof execution time on one processor. When run p processors, each processor requires t/p s, as well as an additional 4 s of overhead, irrespective of the number of processors.

Compute the per-processor execution time for 2, 4, 8, 16, 32, 64, and 128 processors. For each case, list the corresponding speedup relative to a single processor and the ratio between actual speedup versus ideal speedup (speedup if there was no overhead).

# Processors Ex Time/Processor Time w/ overhead Speedup Actual Speedup/ideal speedup

1

100

2

50

54

100/54 = 1.85

1.85/2 = 0.93

4

25

29

100/29 = 3.44

3.44/4 = 0.86

8

12.5

16.5

100/16.5 = 6.06

6.06/8 = 0.75

16

6.25

10.25

100/10.25 = 9.76

9.76/16 = 0.61

5. The Pentium 4 Prescott processor, released in 2004, had a clock rate of 3.6 GHz and voltage of 1.25 V. Assume that, on average, it consumed 10 W of static power and 90 W of dynamic power.

The Core i5 Ivy Bridge, released in 2012, has a clock rate of 3.4 GHz and voltage of 0.9 V. Assume that, on average, it consumed 30 W of static power and 40 W of dynamic power.

a. For each processor find the average capacitive loads.

½ CV2F = ½ C(1.25)2x3.6 GHz = 90 W [Prescott] => C = 32nF

½ CV2F = ½ C(0.9)2x3.4 GHz = 40 W [Ivy Bridge] => C = 29nF

b. Find the percentage of the total dissipated power comprised by static power and the ratio of static power to dynamic power for each technology

Prescott: fraction of total power from Static dissipation = 10/100 = 10%

IvyBridge: 30/70 = 42.9%

c. If the total dissipated power is to be reduced by 10%, how much should the voltage be reduced to maintain the same leakage current? Note: power is defined as the product of voltage and current.

Total Power new / Total Power old = 0.9

or, (Dn + Sn) / (Do + So) = 0.9

Dynamic Power new = Dn = C Vnew2  x F

Static Power new = Sn = Vnew  x IL

Static Power old = So = Vold  x IL

For the Prescott:

From (2) Vnew  = sqrt (Dn/[C x F]) =

32nF3(n) 6GHz     = 11(D)5(n)2

From (1b) Dn = 0.9 (So + Do) - Sn

From (3), (4), Sn = So(Vnew/Vold) =

Sn = Vnew  x (10W/1.25V) = Vnew  x 8

From (6), Dn = 0.9 x 100W - 8 Vnew

So, Dn = [90-8Vnew]

From (5), (9) Vnew  = =

vnew  =

Solving above quadratic for Vnew, we get

Vnew = 0.85V

So, for the Prescott, operating voltage would scale down from 1.25V to 0.85V

6. Assume that for a given program 70% of the executed instructions are arithmetic, 10% are load/store, and 20% are branch.