关键词 > CSCI-4320/6360

CSCI-4320/6360 Parallel CUDA Program for a Carry-Lookahead Adder 2022

发布时间：2022-02-07

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSCI-4320/6360 - Assignment 2:

Parallel CUDA Program for a Carry-Lookahead Adder

2022

1 Overview

You are to construct a parallel CUDA program that speciﬁcally computes a 33,554,432 bit Carry Lookahead Adder using 32 bit blocks which is also the CUDA warp size.

For a list of bitwise operators and the associated C language syntax, please see: gttps:// dn.whkhpdch-.ore/whkh/spdr-tors hn c -nc c\(2B\(2B. Make sure you used the older ANSI-C operators and not the newer operator forms/syntax (e.g., XOR is ˆ in C).

Recall, that this speciﬁc CLA adder can be constructed from the follow:

1. Calculate gi and pi for all 33,554,432 bits i.

2. Calculate ggj and gpj for all 1,048,576 groups j using gi and pi .

3. Calculate sgk and spk for all 32,768 sections k using ggj and gpj .

4. Calculate ssgl and sspl for all 1,024 super sections l using sgk and spk .

5. Calculate sssgm and ssspm for all 32 super sections m using ssgl and sspl . Note, it is at this point, we can shift to computing the top-level sectional carries. This is because the number of sections is less than or equal the block size which is 32 bits.

6. Calculate ssscm using sssgm and ssspm for all m super sections and 0 for sssc− 1 .

7. Calculate sscl using ssgl and sspl and correct ssscm , m = l div 32 as super super sectional carry-in for all sections l .

8. Calculate sck using sgk and spk and correct sscl , l = k div 32 as super sectional carry-in for all sections k .

9. Calculate gcj using ggj , gpj and correct sck , k = j div 32 as sectional carry-in for all groups j .

10. Calculate ci using gi , pi and correct gcj , j = i div 32 as group carry-in for all bits i.

11. Calculate sumi using ai 9bi 9ci − 1 for all i where 9 is the exclusive-or or XOR operation.

You need to construct a CUDA program that executes on Ah0sS that reproduces this algorithm. Each step in the above algorithm will be implemented as a separate CUDA kernel function.

A template is provided that generates deterministic random hex input data and outputs the result of a comparison of the CLA result in binary with a Carry Ripple Adder (CRA).

More speciﬁcally, your program will do the following:

1. Convert all b-llob and m-llob C program memory calls to using buc-0-llob0-n-edc for the data arrays that the CUDA kernels will need to operate over.

2. Convert all CLA routines speciﬁed in the template from C to CUDA kernel calls.

3. Use the CUDA -cc.bu and rdcubthon.bu programs as guides. Here, each CUDA thread will operate on a single “bit” at a time. So you are eﬀectively unrolling the for-loops in the C serial code.

4. Note, you are free to be a little inventive, pull from your own C code in Assignment 1 and you don’t have to rely/port the provided C serial code. E.g., you don’t have to port the er-a slhbd function into CUDA.

5. Note, do no modify the timing routines or the comparison test code.

2 Testing and Correctness

The testing of this program is simple and straightforward. A random input pattern is generated for two hex inputs which are then converted to binary inputs. Next, the CUDA CLA adder will be invoked as well as a serial Ripple Carry Adder. The results from the two adders will be compared. You must pass this test for your program to be considered correct. If the test fails, you will be informed where the ﬁrst bit position that diﬀered from the Ripple Carry Adder. The TA will run your programs by hand to make sure you code runs correctly on AiMOS.

3 One Page Performance Report

In this report, you will brieﬂy describe how you implemented each of the CLA adder functions as CUDA kernel calls. Next, you will use the provided cycle timer function blobk now（） to report the total number of clock cycles consumed by the CLA function and Ripple Carry Adder function **IN SERIAL** using the bl--sdrh-l program provide with the template. Also, report the number of cycles for the CUDA CLA adder function in your bl--p-r-lldl program for a number of diﬀerent block sizes as noted below. Note, that on AiMOS, the cycle timer has a clock rate of 512,000,000 cycles. So if you divide your reported number of cycles (cast to couald) by this clock rate (cast to couald), it will convert cycles to seconds (in double/64 bit precision).

CUDA block sizes (not CLA block size which is ﬁxed at 32) are: 32, 64, 128, 256 and 512. Which CUDA block size yields the best performance? Why do you think that is the case?

If run on an x86 64 system, you need to the clock rate of the processor inorder to convert cycles to seconds.

Last, compute the speedup obtained for the CUDA CLA function (using the BEST/fastest block size conﬁguration) relative (1) the serial CLA function and (2) the serial Ripple Carry Adder. What do you observe? How much faster is the CUDA CLA function than the serial CLA function? Why might the Ripple Carry Adder be faster or slower than the CUDA CLA?

This report must be in PDF format!

4 HAND-IN and GRADING INSTRUCTIONS

Please submit your complete CUDA code (all *.c, *.cu, *.h and Makeﬁle ﬁles) and your Performance Report in PDF format to the suamhtty.bs.rph.dcu grading system.