闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

School of Computing & Mathematics

CSC-40054: Data Analytics and Databases

Coursework Assignment SQLite & dplyrin R or SQLite & dplython in Python

Introduction: SQLite is an open source, all-inclusive SQL-based database system in a single file. Specifically, it does not require a separate server (i.e., server-less), but instead the entire database engine is integrated into an application that needs to access a database. In addition, SQLite packages the entire database into a single file, within which the database layout and the actual data held (in all the different tables and indexes) are contained. As with all RDBMS, all interaction with a SQLite based system is carried out through the SQL language. In R, both the RSQLite and sqldf packages make use of the integrated DataBase Interface to access the constructed system \ The dplyr package developed by RStudio is an R-based package that is designed to provide a highly optimised set of routines specifically for dealing with data frames. The latter is a particularly important data structure in statistics and in R, where several RDBMS such as SQLite described above also implement such a structure for data manipulations.

This coursework assignment is divided into two parts - Parts I and II. Part I concerns the use of SQLite and dplyr (dplython if you want to use Python) on a dataset available at,

https://archive.ics.uci.edu/ml/machine-leaming-databases/census-income-mld/census- income.data.gz

and perform a number of tasks as specified in the next section (under Tasks). In part II, you are required to discuss in a technical report with approximately 2000 words (figures, tables and appendix excluded) which compares and evaluates the use of the two packages based on your work in Part I.

Part I Tasks (60%)

Download the Census Income data set from the above link and unzip/extract the data file onto a directory in your own filesystem.

1. Create a SQIite database called censusjncome in R (or Python) and a table named Income defined with appropriate column (attribute) names and data types as provided in the Appendix of this document.

2. Add a column with the name SS_ID to the Income table. Fill this column with consecutive numbers starting from 1 for the first row. Make the SS_ID attribute the primary key of the Income table.

3. Construct SQL queries that provide the total number of males and females for each race group reported in the data. The result should show for example how many white females, white males, black males etc. are included into the dataset.

4. Write queries to calculate the average annual income (income = weeks worked per year * wage per hour, assuming 40 hours of work per week) of the reported individuals for each race groups, considering only those with non-zero wage per hour.

5. Create 3 other tables named: Person, Job and Pay, by extracting the following fields respectively from the Income table:

i. Id(SSJD), Age(AAGE), education(AHGA), sex (ASEX), citizenship (PRCITSHP), family

members under 18 (PARENT), previous state (GRINST), previous region (GRINREG), Hispanic origin (AREORGN), employment stat (AWKSTAT);

ii. occjd (SS_ID), Detailed Industry code (ADTIND), detailed occupation code (ADTOCC), major industry code (AMJOCC), major occupation code (AMJIND);

iii. job_id (SS_ID), Wage per hour (AHRSPAY), weeks worked per year (WKSWORK)

Write queries that extract the appropriate values from corresponding attributes in the initial Income table and insert them into the newly created tables.

6. Use these three new tables (constructed in 5.) for the tasks below.

i. Given the data in your tables, create an SQL statement to select the highest hourly wage, the

number of people residing in each state (GRINST) employed in this job, the state, the job type and major industry.

ii. Write an SQL query to determine the employment of people of Hispanic origin with BSc

(Bachelors degree), MSc (Masters degree), and PhD (Doctorate degree) showing the type of industry they are employed in, their average hourly wage and average number of weeks worked per year for each industry.

7. Repeat Tasks 1-6 above using the dplyr package (or dplython if you want to use Python).

For task 7, you should consider using the dplyr package (or dplython) to directly construct the corresponding data frames, as local in-memory data, for the respective tasks, without re/using the SQIite database tables created previously.

NB. For tasks 3, 4 and 6 above you should provide the R (or Python) coded solutions you used in Part I; seealso Report Submission below.

Part II Report Submission (40%)

The report (Part II) and the attachment containing the R (or Python) scripts for Tasks 1-7 (Part I) should be submitted electronically via the KLE as a single ZIP file that should be named using your Keele's userid. NB. As good practice, you should comment your R (or Python) code. The report should be in either Microsoft Word or PDF format, including the userid and document name in the footer. The (zip) archive must be uploaded to the KLE by the deadline. Further details and instructions, including a copy of this worksheet are available in the associated submission dropbox.

Marking Scheme

The weights applied to Parts I (Tasks 1-7) and II (Technical report) are 60% and 40% respectively.

For Part I the marks are split 30% for Tasks 1-6 (using R (or Python)) and 30% for Task 7 (N.B. Task 7 essentially involves re-doing Tasks 1-6 using dplyr (or dplython)).

For Part II, a good report (> 65%) could include a critical evaluation of the two packages (i.e, SQLite v.s. dplyr in R - or - SQLite v.s. dplython in Python) in terms of the data manipulation and analysis capabilities, ease of use, etc., with examples drawn from your work on the tasks carried out in Part I. It should also include relevant references drawn from a range of sources which support your arguments. An adequate report (~55%) will adhere to scholarly standards of academic writing and will demonstrate a clear understanding of the use of the two packages, with comparison that identifies the technical (dis)advantages of each package in terms of the tasks required in (Part I of) this assignment. A weak report will be poorly presented, with numerous typos and unclear sections. It will contain largely of basic assessments of the two packages, with little reference to the literature (other than textbook description) or material learnt from this module.

Please note that you have to either use R - or - Python for the first part of this coursework and you can't mixed them together.

Appendix (Data Specifications)

age	(AAGE)	INT
class of worker	(ACLSWKR)	TEXT
detailed industry recode	(ADTIND)	TEXT
detailed occupation recode	(ADTOCC)	TEXT
education	(AHGA)	TEXT
wage per hour	(AHRSPAY)	NUM
enroll in edu inst last wk	(AHSCOL)	TEXT
marital stat	(7XMARITL)	TEXT
major industry code	(AMJIND)	TEXT
major occupation code	(AMJOCC)	TEXT
race	(ARACE)	TEXT
Hispanic origin	(AREORGN)	TEXT
sex	(ASEX)	TEXT
member of a labor union	(AUNMEM)	TEXT
reason for unemployment	(AUNTYPE)	TEXT
full or part time employment stat	(AWKSTAT)	TEXT
capital gains	(CAPGAIN)	NUM
capital losses	(CAPLOSS)	NUM
dividends from stocks	(DIWAL)	NUM
tax filer stat	(FILESTAT)	TEXT
region of previous residence	(GRINREG)	TEXT
state of previous residence	(GRINST)	TEXT
detailed household and family stat	(HDFMX)	TEXT
detailed household summary in household	(HHDREL)	TEXT
instance weight	(MARSUPWT)	NUM
migration code-change in msa	(MIGMTR1)	TEXT
migration code-change in reg	(MIGMTR3)	TEXT
migration code-move within reg	(M工 GMTR4)	TEXT
live in this house 1 year ago	(MIGSAME)	TEXT
migration prev res in sunbelt	(MIGSUN)	TEXT
num persons worked for employer	(NOEMP)	NUM
family members under 18	(PARENT)	TEXT
country of birth father	(PEFNTVTY)	TEXT
country of birth mother	(PEMNTVTY)	TEXT
country of birth self	(PENATVTY)	TEXT
citizenship	(PRCITSHP)	TEXT
own business or self employed	(SEOTR)	TEXT
fill inc questionnaire for veteran's admin	(VETQVA)	TEXT
veterans benefits	(VETYN)	TEXT
weeks worked in year	(WKSWORK)	NUM
year	(YEAR)	TEXT
target	(TRGT)	TEX