Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSE 30 Spring 2022 Programming Assignment #2

Assignment – Data Extraction, Pointer Practice and Modifying Code

The goals of this PA are:

1.   To get practice writing code using C pointers and strings

2.   Learn about Delimiter-Separated-Values (also known as DSV) data format (a generalized version of Comma-Separated-Values or CSV files).

3.   Get familiar with the New York City Traffic Ticket data (in CSV format) that we use with the  in-memory database assignment in PA3

4.   Practice with C library routines for reading lines of text, parsing command lines options and converting strings to integer and unsigned long values

5.   Write routines that we will reuse in PA3 to read from a CSV file and load data into the PA3 database (a data handling process called ETL - Extract Transform and Load)

6. Learn to modify code you did not write (something you will do a lot of when you graduate). Practice figuring out someone else's code and "fixing or finishing it".

A  common  format  for  moving  data  between  systems  is  called  a DSV file. A DSV file stores tabular  data (numbers and text) in plain ASCII text. A delimiter-separated values (DSV) file uses a single character delimiter (typically  ',' OR  '#' OR  ' ' OR  '\t', etc) to separate values on a line of data. A CSVfile is a DSV file where the delimiter is a comma.

Each line of the file, or data record, is terminated by a newline ('\n'). Each record consists of one or more data fields (or columns) separated by the delimiter only between the fields. Or stated differently, a delimiter is found after each field except for the last field in a record, which is always followed by a newline.  In a proper DSV file, each record (line) will always have the same number of fields (columns). Typically the first record in a DSV file (the header record) is not part of the data but is documentation of the names (meaning) of each of the columns.

Fields are numbered from left to right starting with column 1 (or field 1). Fields can be empty. For example, a record that looks like "A,,C" has three fields where field 1 is A, field 2 is empty, and field 3 is C.  Records are numbered starting from 1 (the first record in the file).

In terms of C, an entire record is processed as a single string. As we have discussed in lecture, you can use pointers to break a string into tokens or fields in place, without incurring the cost of copying the string. Each token will be a string. An empty record (one where there are no characters between the adjacent delimiters) is a string where the only character is the string termination character, a '\0'. The technique to break a string into tokens, shown in class, will be at the heart of this assignment.

In this programming assignment you will build a tool to extract data from DSV formatted data read from stdin (with different characters for delimiters), reformat it by dropping or moving fields within the record and writing it out to stdout. You will reuse the extraction routines from this PA to load the Data From the New York City Parking tickets CSV file into the in-memory database you will build in PA3. In the field of data analytics the process  of  reading  data,  modifying  the  data  and  then  loading the  data  into  a  database  is  called  Extract Transform Load (or ETL for short).

The NYC data file is in CSV format and consists of about 1.146 million records where each record contains 43 fields  of data,  and the total file  size is 2GB. In PA3 you will use only 5 fields in each record, so you must eliminate the  other  unused fields  out  of the 43 fields. When working with  datasets  at this  size, program runtime performance is an important consideration in the design and implementation of your solution.

To make testing easier, command line options to your program will allow a user to process only a subset of the records. This is accomplished by specifying record ranges (by line number) to extract.

Below we show an extracted subset of that file, with a reduced field count, of the header record followed by four data records.

Summons Number,Plate ID,Registration State,Issue Date,Violation Code

1105232165,GLS6001,NY,07/03/2018,14

1121274900,HXM7361,NY,06/28/2018,46

1130964875,GTR7949,NY,06/08/2018,24

1130964887,HH1842,NC,06/07/2018,24

In the  above, there are five fields extracted from the 43 fields in the input data. The name of the fields is Summons  Number,  Plate  ID  (the license plate number), Registration State (the state that issued the plate), Issue Date (the date of the ticket), Violation Code (the type of the ticket encoded by a number from 1 to 99). In PA3 you will use this extracted data to build an in-memory database where you will make queries by license plate  number  to  get  the  tickets  and  total  fines  outstanding  for  that  vehicle  (and  update  the  database simulating when fines are paid).

DSV Format Specications

In this programming assignment (PA2), you will write a program called extract that will process DSV files and conform to the specifications numbered below. These specifications are a modification (simplification) of      rfc4180 "standard" (https://www.ietf.org/rfc/rfc4180.txt) for CSV files.

In the specifications given below, we use a single comma as the field separator on the input file. In your solution, this would be any input delimiter as specified with command line options (, is the default).

Also note, that in the specifications, a descending "b" is a single blank and a descending "n " is a single newline.

1.    Each and every record is located on a separate line and terminated by a newline n.

2.    For this assignment, you can assume that every record is properly terminated with a newline n.

3.   All records in the file contain exactly the same number of fields.

4.    Delimiters are only placed between fields. The number of delimiters in a record is one less than the number of fields. The last field in each record is only (and always)  followed by a newline.

Here is a comma delimiter

aaa,MMM,cccn

and the same record contents but using a colon ‘:’ delimiter

aaa:MMM:cccn

5.   The number and position of blanks in a record must be preserved. Below are some examples of normal record fields (also sometimes called columns) and the last two show records where there is one empty

field. Obviously, a field cannot contain a newline n

aaa,MMM,cccn

zzz,123,cccn

zzbbz,XbbbbZ,bcccbn

zzz,123,n

aa,,ccn

6.   There may be an optional header line appearing only as the first line of the file with the same format as a normal line. The header will contain names corresponding to the fields in a record. The header record has the same number of fields as the rest of the data records in the file. Header records are treated like any other record. As an example here is a two record file, a header record and one data record:

Summons Number,Plate ID,Registration State,Issue Date,Violation Code

1105232165,GLS6001,NY,07/03/2018,14

7.   Within a record there are two types of fields: normal fields and quoted fields. All fields in the record      can be classified as either a normal field or a quoted field. Normal fields and quoted fields can be mixed within a single record. In a quoted field, the first and last character in the field must be double quotes ". In a normal field, double quotes " cannot occur anywhere within the field. If you see double quotes

" in a normal field, that record has an error in it. A quoted field must have all the characters in the field between the quotes. If the first field in a record is quoted, the first character in the record is a double quote ".  Quoted fields always end with a double quote " immediately followed by either a  delimiter or a newline when it is the last field in a record.

"aaa","MMbM","ccc"n

"zzz",123,cccn

The following is an example that contains incorrectly quoted fields:

"aaa"c,"MMbM,b"ccc"n

8.   A quoted field can contain any character including the delimiter (Exception: for this PA a newline cannot be inside a field) with one rule for double quotes. If a double quote is to be found in a quoted  field, it must appear as a pair of double quotes "" (two double quotes "" immediately next to each      other). No isolated/unaccompanied double quote " can appear in a quoted field (that field would have an error). Here are some examples of using "" and , (the comma delimiter) inside a quoted field. There are two fields in the first record and there are three fields in the second record. The fields are                highlighted. As a note, the number of double quotes " in a properly formatted field is always an even number (two for the start and end, plus zero or more pairs of double quotes).

"Aaa""Bbb","x,yz"

"a,ha","""the Great one"",""Rules,""Now","ccc"n

9.   A record with just one field is also possible. It will have just the field and no delimiters in the record. A special case is with a single field record, you cannot have a record with just an empty field (no lines     with just a newline in the file).

Usage

The program that you write will operate like a Linux command line utility.

1.   It will read the input records from the standard input stream, one line (one record) at a time

2.   Process each input record as specified above to create an output record

3.   Write the output record to standard output stream one line (one record) at a time

4.   Write error messages to the standard error stream (if any).

5.   If there are no errors in the input stream, the program will return EXIT_SUCCESS from main().

6.   If there are errors in either the input stream or while processing the command line options, the program will return EXIT_FAILURE from main().

Specifically, your program (which is called extract) will perform the following steps in a loop until EOF (end of file) is reached on the input stream:


1)   Read one record (one \n terminate line) at a time using the library function getline() (see man 3 getline) from the standard input stream.

2)   Determine if the record lies within the optional specified record number range. This is changed through a command line option. The default is to process all records.  As a result of the record number check, the     program will either process the record,  skip the record, or exit the program. See the section below that    describes the command line arguments -b and -e on how the record number range limits work.

3)   Break the record into null terminated tokens, using the specified input record delimiter, one token per     column/field. Each token is pointed at by an array of pointers, one pointer per input record field/column.

4)   Check that all fields/columns are valid per the specification above. Make sure that each record has exactly the specified number of fields/columns. Remember you can assume every record ends in a newline.

5)   If the record is not valid, skip processing that record and write to stderr an error message using the supplied routine dropmsg() that is located in the file misc.c.

6)   If the record is valid, output the selected columns in the order they appear on the command line using the specified output delimiter (see the description of the command line below) to the standard output stream using the library function printf().

A typical execution of your program from the shell might look like this (using an input file with three (3) fields, output field 1 followed by field 3 using a comma (default) as the field delimiters on both the input and output file.

./extract -c3 1 3 < input > output 2> ERRORS