Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Programming in Python for Data Science

CS2PP22

ASSESSMENT CLASSIFICATIONS

This coursework assesses your ability to:

•    understand and use appropriate Python syntax and ecosystem;

•    implement common computer science algorithms and functional programming in Python;

•    understand statistical and machine learning methods for data analytics and mining in Python;

•    apply appropriate statistical and machine learning techniques for data science tasks .

In general, you will gain credit for:

•    preparing and submitting required files as requested;

•    successful implementation of the specified coding tasks;

•    writing efficient, functional code;

•    providing thoughtful, clear, well-structured written analysis.

Your assignment will be marked according to the marking scheme provided below. The scheme is designed so that  the  collectively  weighted  assignment  mark  will  correspond  to  the  following  qualitative  degree classification descriptions:

The table below shows what is typically expected of the work to obtain a given mark.

Classification Range

Typically, the work should meet these requirements:

First Class (>=70%)

Outstanding/excellent work with correct codes and results. An outstanding work should demonstrate coding proficiency with high efficiency and based on advanced techniques. Evidence of independent research into methods used and a thorough justification of applications of these methods.

Upper Second (60-69%)

Good work with few mistakes. Some minor tasks have not been carried out or are not completely correct. Coding with good efficiency. Evidence of good knowledge of the core concepts, with good explanations and justifications.

Lower Second (50-59%)

Demonstrates knowledge of core concepts but with some mistakes. Explanations and justifications of methods used are logical but limited in depth. Coding with average efficiency. Most tasks have been carried out with sufficient accuracy.

Third (40-49%)

Some parts of the assignment are missing and/or have partially correct results. Most tasks have not been carried out with sufficient accuracy. Results may not be correct or technically sound. Mistakes in application of knowledge and shows some misunderstandings. Explanations and justifications of methods used are not clear or logical. Coding might be inefficient.

Pass (35-39%)

Some significant part of the assignment is missing and/or has partially correct results. Gaps in knowledge and many mistakes, little evidence of understanding. Methods used are not well explained or justified. Coding is notably inefficient.

Fail (0-34%)

Many aspects of the assignment are missing, or there are large gaps in knowledge and significant mistakes, also showing limited understanding. Lack of logical explanations behind the methods used.

ASSIGNMENT DESCRIPTION

Major Coursework (100% of module assessment)

This assignment consists of two tasks.  Both of these will be used to assess your implementation of elements of the Data Science process, using Python as the main tool.

A detailed breakdown of theMarking Schemeis provided later in this document.

Task 1 – Data Preprocessing, Exploratory Data Analysis, and Python Classes

Using the cardata.csv file within the CS2PP22_Assessment_Task1.ipynb Jupyter notebook, you will execute several components of the data science process and design and implement a class structure that controls and compiles data about a fictional sporting event by writing Python code to perform the outlined sub-tasks detailed in the notebook. Working through this notebook, you will read, write, and manipulate data to extract specific features, design and implement functional routines, and design and implement an algorithm to select an optimal subset from a larger dataset.

Some sub-tasks will ask you to provide a written explanation of the justification behind it your coding choices. Code and written responses should be presented in a set of well-formatted code and Markdown cells at appropriate points in your Jupyter notebook. This work will require the production and submission of additional files; details about these files and how they should be submitted are provided in the notebook and the Assignment Submission Requirements.

Task 2 – Twitter Data Analysis

Using the CS2PP22_Assessment_Task2.ipynb Jupyter notebook, you will extract data from the social media platform, Twitter, and use the data as the basis for implementing components of the data science process to build and test a regression model. You will need to extract at least 300 tweets (perhaps, the 300 most recent tweets) from at least 3 Twitter accounts.

Visualise the results concisely and discuss the reasons why one might prefer the use of one of your tested methods over another. As in Task 1, written responses should be provided in a set of well-formatted Markdown cells at appropriate points in your Jupyter notebook.

Additional points of consideration and example extraction methods are provided in the notebook. Efficient extraction of the tweets will require installation of at least one new Python package. The most efficient of these, tweepy, requires that you obtain a developer account with Twitter. Instructions for gaining the appropriate access are found in the Additional Considerations section of this document.

 Project Directory and Data Description

The materials needed to complete this assessment are available in a single CS2PP22_Assessment.zip file on the CS2PP22 Blackboard space, under the Assessment heading, in the Coursework Description and Datasets item.  This is outlined below and contains a data directory with subdirectories for Task1 and Task2.

The first task relies on a file consisting of comma-separated values (CSV) with a header that briefly describes each column. This file will be used to work throughthe prompts in CS2PP22_Assessment_Task1.ipynb that guide analysis of the data.

In the second task, you are asked to source your own data from Twitter.  Use the provided Task 2 notebook, CS2PP22_Assessment_Task2.ipynb, to begin this analysis.

CS2PP22_Assessment.zip

── data/

│    ├── Task1/

   L── cardata.csv

L── Task2/

L─ < - empty - >

── CS2PP22_Assessment.pdf

── CS2PP22_Assessment_Task1.ipynb

LCS2PP22_Assessment_Task2.ipynb

Car Features and MSRP Data:             cardata.csv

This dataset includes car features such as make, model, year, and engine type, as scraped from Edmunds and Twitter.  It is often used to develop models to predict car prices based on their other characteristics.

Source:https://www.kaggle.com/datasets/CooperUnion/cardataset

Each row corresponds to a single kind of vehicle.

The columns correspond to:

Make

Car maker

Model

Car model

Year

Car year (Marketing)

Engine Fuel Type

Type of engine fuel category

Engine HP

Engine horsepower (HP)

Engine Cylinders

Number of engine cylinders

Transmission Type

Type of transmission category

Driven_Wheels

Drive wheel category

Number of Doors

Number of doors

Market Category

Market category

Vehicle Size

Vehicle size category

Vehicle Style

Vehicle style category

highway MPG

Highway fuel efficiency in miles per gallon

city mpg

City fuel efficiency in miles per gallon

Popularity

Twitter-based popularity metric

MSRP

Manufacturer suggested retail price (USD)

Twitter Data:

As noted in the Task 2 description above, you will extract the data from 3 accounts of your choice.  The format of this data will differ based on the method of extraction you choose and the specific data features you choose to extract.

Assignment Submission Requirements

Front page of the Submission

The following are compulsory.  Please add these items to at the top of your Jupyter notebooks in a        Markdown cell.  To be extra helpful, please repeat this information in the Add Comments section of the Blackboard submission page.

Module Code:

Assignment Report Title:

Student Number (e.g., 25098635):

Date (when work was completed):

Actual hours spent on assignment:

Assignment evaluation (3 key points):

We will use information about how long you spent on the assignment when we review and balance        coursework between modules for later years.  An exact answer is not necessary, but please try to give a reasonable approximation.

The assignment evaluation is an opportunity for you to provide feedback on your experience with the assignment.  We will use this to improve coursework for next year.  You might like to comment on the following concepts:

•    Were any parts of the assignment particularly fun, engaging, interesting, boring, or frustrating?

•    Was the assignment too long/short/easy/difficult, or were these features simply appropriate?

•    Were there any notable errors or technical problems with the materials supporting the assignment?

You will not be penalised for providing negative points of evaluation.

Content of the Required Work:

You must use Python (version 3.8 or above) Jupyter Notebooks (version 6.3.0 or above).  Where possible, use the packages included in the Anaconda3 distribution used in this module (2021.05).

If you find good reason to employ additional Python packages in the creation of your solution, please provide an excruciatingly detailed description of the package installation procedure that includes        specification of your Anaconda3, Python, and Jupyter Notebook versions, as well as the version            information for your additional Python packages.

As mentioned above, your submission should take the form of 3 items: a single archive file (based on the one downloaded for this project) and separate .pdf copies of the notebooks, one for each of the two     tasks.

You will find the submission point on the module’s Blackboard page under Assessment.  The name of the  archive and .pdfs should be formatted with your student ID, the module code, and the tag Assessment” (e.g., ce9201209_CS2PP22_Assessment.tar.gz).

While you might find it useful to include more material (e.g., modules containing functions or classes   used in the notebooks), the final content of your Blackboard submission should have, at minimum, the following structure and contents.  Items in orange represent new files that you will produce or modify.

cz9201209_CS2PP22_Assessment_Task1.pdf

cz9201209_CS2PP22_Assessment_Task2.pdf

cz9201209_CS2PP22_Assessment.zip

├─data/

│    ├── Task1/

    │   ├── cardata.csv

    │   L── cardata_modified.csv

│    L── Task2/

│        ├── twitter_user1.csv

│        ├── twitter_user2.csv

        L── twitter_user3.csv

── CS2PP22_Assessment.pdf

├── CS2PP22_Assessment_Task1.ipynb [completed and fully executed] ├── CS2PP22_Assessment_Task2.ipynb [completed and fully executed] ├── enhanced_boxplot.png

├── popularity.png

L── [any auxiliary modules, package version notes]

Code Plagiarism

Copying whole tutorials, scripts or images from other sources is not allowed. Any material you borrow from other sources to build upon should be clearly referenced (use comments to reference in Python  scripts); otherwise, it will be treated as plagiarism, which may lead to investigation and subsequent     action.

Marking Scheme

Task                                                     Element                                                Marks Available

 

 

 

 

 

Task 1

Organisation: Preparation and submission of all required files

5

1.0: Analysis Preparation

5

1.1: Data Cleaning

15

1.2: Creating New Columns

5

1.3: Exploratory Data Analysis

20

1.4: Fuel Efficiency Tournaments

40

Overall: Coding efficiency and structure, including comments and docstrings, where appropriate.

10

Task 1 Total

100

 

 

 

 

 

 

Task 2

Organisation: Preparation and submission of all required files

10

2.1: Extraction of tweet datasets

10

2.2: Exploratory data analysis

20

2.3: Data processing

10

2.4: Regression analysis

20

2.5: Model evaluation and testing

10

Overall: Coding efficiency and structure, including comments and docstrings, where appropriate.

10

Overall: Report structure and reasoning (format, clarity, logic, quality of written communication)

10

Task 2 Total

100

Total

Assessment Total

200


Additional Considerations

Task 2

To extract the tweets from the accounts you have selected with the tweepy package, you MUST:

•    Have or create a Twitter account.