Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit



Department of Computer Science

Summative Coursework Set Front Page

Module Title: Applied Data Science with Python

Module Code: CSMAD21

Lecturer responsible: Miguel Angel Sanchez Razo

Type of Assignment (coursework / online test): Coursework

Individual / Group Assignment: Individual

Weighting of the Assignment: 100%

Page limit/Word count: NA

Expected hours spent for this assignment: 40

Items to be submitted: Zip file containing (more details inAssignment submission requirements):

•    Scenario 1 .ipynb notebook file and its HTML version

•    Scenario 2 .ipynb notebook file and its HTML version

•    Scenario 2 - Task2 output file

•    Scenario 3 - Task3 output file

Work to be submitted on-line via Blackboard Learn by: 10th of January 2022 Work will be marked and returned by: 31th of January 2022

NOTES

By submitting this work, you are certifying that it is all your sentences, figures,             tables, equations, code snippets, artworks, and illustrations in this report are original and have not been taken from any other person's work except where explicitly the     works of others have been acknowledged, quoted, and referenced. You understand   that failing to do so will be considered a case of plagiarism. Plagiarism is a form of       academic misconduct and will be penalised accordingly. The University’s Statement   of Academic Misconduct is available on the University web pages.

If your work is submitted after the deadline, 10% of the maximum possible mark will be deducted for each working day (or part of) it is late. A mark of zero will be              awarded if your work is submitted more than 5 working days late. You are strongly    recommended to hand work in by the deadline as a late submission on one piece of  work can impact on other work.

If you believe that you have a valid reason for failing to meet a deadline then you should complete an Extenuating Circumstances form and submit it to the Student Support Centre before the deadline, or as soon as is practicable afterwards, explaining

why.

1. Assessment classifications

First Class (>= 70%)

The coursework demonstrates:

Excellent knowledge and understanding of the concepts, evidence of independent research into methods used, and a thorough justification of methods

Upper Second (60-69%)

The coursework demonstrates:

Good knowledge of the core concepts, showing understanding, with few mistakes. Good explanations and justification of the methods used

Lower Second (50-59%)

The coursework demonstrates:

Demonstrates knowledge of core concepts but with some mistakes. Explanations and justifications of methods used are logical, but limited in depth

Third (40-49%)

The coursework demonstrates:

Mistakes in application of knowledge, and shows some misunderstandings, explanation and justification of methods used is not clear or logical.

Pass (35-39%)

The coursework demonstrates:

Gaps in knowledge and many mistakes, little evidence of understanding. Methods used are not explained or justified.

Fail (0-34%)

The coursework demonstrates:

Large gaps in knowledge and significant mistakes, also showing limited understanding. Lack of logical explanations behind the methods used.

2. Assignment description

The coursework consists of two scenarios to assess the implementation of the Data Science process with Python as main tool.

Scenario 1 of 2: Twitter network map data extraction, pre-processing, and analysis

You have been asked to analyse information of the social media Twitter, such as the network of certain accounts, hashtags and some other data that can be extracted from it. You are required to implement a full Data Science Workflow going from the data gathering,  cleaning,  pre-processing,  implementation  of  a  model  (network),  and analysis of different statistics (e.g. Degree Distribution, Cluster coefficient, etc.); you are  also  required to  provide justification  of the  process,  analysis  of the findings, reasoning behind the design and implementation, decisions, and assumptions.

Your Tasks

Your overall task is to implement the data science process on data collected from Twitter of at least three accounts and three hundred tweets (most recent tweets) of each account. The tasks need to be developed in a Jupiter notebook.

Task 1 Data Gathering, Pre-processing and EDA

Implement a  process/workflow to extract information from Twitter. Your solution must consider:

•    API connection and data extraction from the data source.

•    Data Pre-processing from the data source to transform the original data into a Pandas dataframe.

•    Perform a data cleansing activity considered relevant for the process.

•   Provide the  explanation  of the  process, the justification  behind  it,  lessons learned and findings.

•    Exploratory Data Analysis of the accounts, e.g. number of followers, are the accounts producing original twits or mostly retweeting, etc.

For  more details  of the data extraction from Twitter  please  review  below  in this document section5. Additional Considerations.

Task 2 Network analysis

The goal of this task is to create a network that represents the area of influence of the accounts/influencers  selected.   For  this  you   need  to  consider  the   network  as bidirectional, there are two ways to do  it: you can  extract the accounts that the influencer is following and/or create the links from the accounts that were retweeted. You need to provide the following:

•    Provide a sample (max 10 records) of the edge list and the neighbour list of the network.

•    Produce a visualisation of the network topology and discuss the output .

•    Calculate statistics of the network, plot them where relevant, and discuss the results, explaining the meaning of any statistics you have calculated.

o Statistics of the network such as

▪    Degree Distribution

▪    Cluster coefficient

▪    Betweenness Centrality

▪    Assortativity

•    Conclusions and lessons learned.

Use  Networkx  (Python  library) to  calculate  statistics  of the  network,  rather than implementing your  own  Python  code to  do so. The visualisation  may  be  hard to interpret at first, experimenting with different settings for the layout may help.

Scenario 2 of 2: Travel time to Uni

An internal department in the university is looking to create a full and comprehensive list of travel time form some English postcodes to the university Whiteknights campus. They are expecting the output to be an Excel file or similar.

The postcode areas for which they are asking to have this information are the following:

•    RG, OX, SN, SP, GU, PO, SO, DH, DT, all London, SL, HP, MK, LU, AL, SG, GL, CV, B, GL, WR, HR, NP, DY, BA, BS, NN, LE and RH.

The university campuses are:

•    Whiteknights RG6 6AH

The output expected (not limited) is the following:

Postcode

Car travel time

to whiteknights

Public

transport time

to whiteknights

Walking

to whiteknights

SW8 1DL

They have provided the following data source for the postcodes: https://www.doogal.co.uk/UKPostcodes.php . Even though is not an official data source, it is reliable, properly updated, cost free and has relevant additional data such as geographical data.

The client knows that there are several providers from where to obtain the data, but they prefer Google Maps. An additional constraint is that they would like to be cost   effective. Given that every API request has a cost involved and following the                constraints of the project we need to make the requests effective. There are cases where the postcodes are close to each other or even are in the same building, considering this, one strategy is to select the most representative points in the data and just execute the API request on this representative sample; then the data can be replicated to the rest of data points.

Your task

They have contacted you to deliver this solution but initially they have requested a    feasibility analysis, the solution proposal, and the strategy to follow to implement it. The tasks need to be developed in a Jupiter notebook. Consider just one postcode area as scope for the coursework.

Task 1: Feasibility analysis

Analyse the postcodes data source and if the information requested can be extracted from third party solutions.

Select one postcode area, download it from www.doogal.com.uk and             perform and EDA on the data. Highlight your findings and relevant attributes that can be considered to develop the solution.

Select a sample of postcodes (up to 5) and perform the API request to Google Maps to extract the data requested by the user.

•    Manipulate the data to follow the requirements.

•   Elaborate your conclusions.

Task 2: Data Extraction Strategy

Once you have validated that the extraction and manipulation of the data is possible:

Select one postcode area and apply a unsupervised machine learning algorithm (clustering) to extract the representative data points.

•    Define the extraction strategy providing the justification of the algorithm    implemented, the amount of the representative points suggested and how these will benefit the project. Provide the assumptions of the. Add some     visual support showing the postcodes and the representative points.

Task 3: Solution implementation

Create the workflow to apply your strategy solution in one postcode area.

•    Create a workflow that controls the API requests to Google Maps of the representative points postcodes you have stablished.

•    Replicate the data of the representative points to the remaining postcodes (all postcodes of the postcode area selected must have distance and time   information by each transport option). Manipulate the data to fulfil the       requirements. Provide the output file.

•    Elaborate your conclusions and lessons learned.

Important:

•    Google Cloud provides $200 USD free credit, and each request has a cost        involved (approx. $0.005USD). Each transport option is considered a request. Do your tests with a small data sample and once you familiarised with the       solution, execute bigger batches.

The scenario is to be implemented in just one postcode area requested by the client.

•    Additional information of how to interact with Google Maps API can be found inAdditional Considerations.

3. Assignment submission requirements

•    You must create a Python 3.6 or above Jupyter notebook; when possible, use   the  packages  included  in Anaconda,  Python  3.6 or  above versions  in your   notebook. If you have a good reason to use a Python package not included in   Anaconda, please contact the lecturer ([email protected]) first to   check  before  using  it   (except  the   libraries  mentioned  in 5. Additional considerations).

•    Before submitting, please remove the Twitter API connection credentials that you used to extract the data as it is confidential data.

•    Your   notebook   should   be   submitted   on   Blackboard   Learn,   under  the Assignments section, as one archive containing:

o A zip file with your student ID followed by module code and the legend

“Coursework”(e.g.                  “ce9201209_CSMAD21_Coursework.zip”)

containing:

▪    Scenario 1 .ipynb notebook file and its HTML version stating your         student          ID,         task         abbreviation          (e.g. ce920109_S1.ipynb/html )

▪    Scenario 2 .ipynb notebook file and its HTML version stating your         student          ID,         task         abbreviation          (e.g. ce920109_S2.ipynb/html )

▪    Scenario  2  - Task2  output  file  stating  your  student  ID, task abbreviation     and     the     postcode     area     analysed     (e.g. ce920109_S2T2_RG.xls)

▪    Scenario  3 -   Task3 output file stating your student  ID, task abbreviation     and    the     postcode     area     analysed     (e.g. ce920109_S2T3_RG.xls)

o Note: The HTML version can be saved from the Jupyter interface under File -> Download as.

•    At the beginning of the submission, please add the following (in a markdown cell in the notebook):

o Module Code:

o Assignment report Title:

o Student Number (e.g. 25098635):

o Date (when the work completed):

o Actual hrs spent for the assignment:

o Assignment evaluation (3 key points):

•    Include your student ID number in the name of the file containing your work.