Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


School of Computing & Mathematics

CSC-40054: Data Analytics and Databases

Coursework Assignment SQLite & dplyrin R or SQLite & dplython in Python 


Introduction: SQLite is an open source, all-inclusive SQL-based database system in a single file. Specifically, it does not require a separate server (i.e., server-less), but instead the entire database engine is integrated into an application that needs to access a database. In addition, SQLite packages the entire database into a single file, within which the database layout and the actual data held (in all the different tables and indexes) are contained. As with all RDBMS, all interaction with a SQLite based system is carried out through the SQL language. In R, both the RSQLite and sqldf packages make use of the integrated DataBase Interface to access the constructed system \ The dplyr package developed by RStudio is an R-based package that is designed to provide a highly optimised set of routines specifically for dealing with data frames. The latter is a particularly important data structure in statistics and in R , where several RDBMS such as SQLite described above also implement such a structure for data manipulations.


This coursework assignment is divided into two parts - Parts I and II. Part I concerns the use of SQLite and dplyr (dplython if you want to use Python) on a dataset available at,

https://archive.ics.uci.edu/ml/machine-leaming-databases/census-income-mld/census- income.data.gz

and perform a number of tasks as specified in the next section (under Tasks). In part II, you are required to discuss in a technical report with approximately 2000 words (figures, tables and appendix excluded) which compares and evaluates the use of the two packages based on your work in Part I.


Part I Tasks (60%)

Download the Census Income data set from the above link and unzip/extract the data file onto a directory in your own filesystem.

1. Create a SQIite database called censusjncome in R (or Python) and a table named Income defined with appropriate column (attribute) names and data types as provided in the Appendix of this document.

2. Add a column with the name SS_ID to the Income table. Fill this column with consecutive numbers starting from 1 for the first row. Make the SS_ID attribute the primary key of the Income table.

3. Construct SQL queries that provide the total number of males and females for each race group reported in the data. The result should show for example how many white females, white males, black males etc. are included into the dataset.

4. Write queries to calculate the average annual income (income = weeks worked per year * wage per hour, assuming 40 hours of work per week) of the reported individuals for each race groups, considering only those with non-zero wage per hour.

5. Create 3 other tables named: Person, Job and Pay, by extracting the following fields respectively from the Income table:



i. Id(SSJD), Age(AAGE), education(AHGA), sex (ASEX), citizenship (PRCITSHP), family

members under 18 (PARENT), previous state (GRINST), previous region (GRINREG), Hispanic origin (AREORGN), employment stat (AWKSTAT);

ii. occjd (SS_ID), Detailed Industry code (ADTIND), detailed occupation code (ADTOCC), major industry code (AMJOCC), major occupation code (AMJIND);

iii. job_id (SS_ID), Wage per hour (AHRSPAY), weeks worked per year (WKSWORK)

Write queries that extract the appropriate values from corresponding attributes in the initial Income table and insert them into the newly created tables.

6. Use these three new tables (constructed in 5.) for the tasks below.

i. Given the data in your tables, create an SQL statement to select the highest hourly wage, the

number of people residing in each state (GRINST) employed in this job, the state, the job type and major industry.

ii. Write an SQL query to determine the employment of people of Hispanic origin with BSc

(Bachelors degree), MSc (Masters degree), and PhD (Doctorate degree) showing the type of industry they are employed in, their average hourly wage and average number of weeks worked per year for each industry.

7. Repeat Tasks 1-6 above using the dplyr package (or dplython if you want to use Python).

For task 7, you should consider using the dplyr package (or dplython) to directly construct the corresponding data frames, as local in-memory data, for the respective tasks, without re/using the SQIite database tables created previously.

NB. For tasks 3, 4 and 6 above you should provide the R (or Python) coded solutions you used in Part I; seealso Report Submission below.


Part II Report Submission (40%)

The report (Part II) and the attachment containing the R (or Python) scripts for Tasks 1-7 (Part I) should be submitted electronically via the KLE as a single ZIP file that should be named using your Keele's userid. NB. As good practice, you should comment your R (or Python) code. The report should be in either Microsoft Word or PDF format, including the userid and document name in the footer. The (zip) archive must be uploaded to the KLE by the deadline. Further details and instructions, including a copy of this worksheet are available in the associated submission dropbox.


Marking Scheme

The weights applied to Parts I (Tasks 1-7) and II (Technical report) are 60% and 40% respectively.

For Part I the marks are split 30% for Tasks 1-6 (using R (or Python)) and 30% for Task 7 (N.B. Task 7 essentially involves re-doing Tasks 1-6 using dplyr (or dplython)).

For Part II, a good report (> 65%) could include a critical evaluation of the two packages (i.e, SQLite v.s. dplyr in R - or - SQLite v.s. dplython in Python) in terms of the data manipulation and analysis capabilities, ease of use, etc., with examples drawn from your work on the tasks carried out in Part I. It should also include relevant references drawn from a range of sources which support your arguments. An adequate report (~55%) will adhere to scholarly standards of academic writing and will demonstrate a clear understanding of the use of the two packages, with comparison that identifies the technical (dis)advantages of each package in terms of the tasks required in (Part I of) this assignment. A weak report will be poorly presented, with numerous typos and unclear sections. It will contain largely of basic assessments of the two packages, with little reference to the literature (other than textbook description) or material learnt from this module.

Please note that you have to either use R - or - Python for the first part of this coursework and you can't mixed them together.




Appendix (Data Specifications)

age

(AAGE)

INT

class of worker

(ACLSWKR)

TEXT

detailed industry recode

(ADTIND)

TEXT

detailed occupation recode

(ADTOCC)

TEXT

education

(AHGA)

TEXT

wage per hour

(AHRSPAY)

NUM

enroll in edu inst last wk

(AHSCOL)

TEXT

marital stat

(7XMARITL)

TEXT

major industry code

(AMJIND)

TEXT

major occupation code

(AMJOCC)

TEXT

race

(ARACE)

TEXT

Hispanic origin

(AREORGN)

TEXT

sex

(ASEX)

TEXT

member of a labor union

(AUNMEM)

TEXT

reason for unemployment

(AUNTYPE)

TEXT

full or part time employment stat

(AWKSTAT)

TEXT

capital gains

(CAPGAIN)

NUM

capital losses

(CAPLOSS)

NUM

dividends from stocks

(DIWAL)

NUM

tax filer stat

(FILESTAT)

TEXT

region of previous residence

(GRINREG)

TEXT

state of previous residence

(GRINST)

TEXT

detailed household and family stat

(HDFMX)

TEXT

detailed household summary in household

(HHDREL)

TEXT

instance weight

(MARSUPWT)

NUM

migration code-change in msa

(MIGMTR1)

TEXT

migration code-change in reg

(MIGMTR3)

TEXT

migration code-move within reg

(M GMTR4)

TEXT

live in this house 1 year ago

(MIGSAME)

TEXT

migration prev res in sunbelt

(MIGSUN)

TEXT

num persons worked for employer

(NOEMP)

NUM

family members under 18

(PARENT)

TEXT

country of birth father

(PEFNTVTY)

TEXT

country of birth mother

(PEMNTVTY)

TEXT

country of birth self

(PENATVTY)

TEXT

citizenship

(PRCITSHP)

TEXT

own business or self employed

(SEOTR)

TEXT

fill inc questionnaire for veteran's admin

(VETQVA)

TEXT

veterans benefits

(VETYN)

TEXT

weeks worked in year

(WKSWORK)

NUM

year

(YEAR)

TEXT

target

(TRGT)

TEX