CSMAD21 Applied Data Science with Python
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Department of Computer Science
Summative Coursework Set Front Page
Module Title: Applied Data Science with Python
Module Code: CSMAD21
Lecturer responsible: Miguel Angel Sanchez Razo
Type of Assignment (coursework / online test): Coursework
Individual / Group Assignment: Individual
Weighting of the Assignment: 100%
Page limit/Word count: NA
Expected hours spent for this assignment: 40
Items to be submitted: Zip file containing (more details inAssignment submission requirements):
• Scenario 1 .ipynb notebook file and its HTML version
• Scenario 2 .ipynb notebook file and its HTML version
• Scenario 2 - Task2 output file
• Scenario 3 - Task3 output file
Work to be submitted on-line via Blackboard Learn by: 10th of January 2022 Work will be marked and returned by: 31th of January 2022
NOTES
By submitting this work, you are certifying that it is all your sentences, figures, tables, equations, code snippets, artworks, and illustrations in this report are original and have not been taken from any other person's work except where explicitly the works of others have been acknowledged, quoted, and referenced. You understand that failing to do so will be considered a case of plagiarism. Plagiarism is a form of academic misconduct and will be penalised accordingly. The University’s Statement of Academic Misconduct is available on the University web pages.
If your work is submitted after the deadline, 10% of the maximum possible mark will be deducted for each working day (or part of) it is late. A mark of zero will be awarded if your work is submitted more than 5 working days late. You are strongly recommended to hand work in by the deadline as a late submission on one piece of work can impact on other work.
If you believe that you have a valid reason for failing to meet a deadline then you should complete an Extenuating Circumstances form and submit it to the Student Support Centre before the deadline, or as soon as is practicable afterwards, explaining
why.
1. Assessment classifications
First Class (>= 70%) |
The coursework demonstrates: Excellent knowledge and understanding of the concepts, evidence of independent research into methods used, and a thorough justification of methods |
Upper Second (60-69%) |
The coursework demonstrates: Good knowledge of the core concepts, showing understanding, with few mistakes. Good explanations and justification of the methods used |
Lower Second (50-59%) |
The coursework demonstrates: Demonstrates knowledge of core concepts but with some mistakes. Explanations and justifications of methods used are logical, but limited in depth |
Third (40-49%) |
The coursework demonstrates: Mistakes in application of knowledge, and shows some misunderstandings, explanation and justification of methods used is not clear or logical. |
Pass (35-39%) |
The coursework demonstrates: Gaps in knowledge and many mistakes, little evidence of understanding. Methods used are not explained or justified. |
Fail (0-34%) |
The coursework demonstrates: Large gaps in knowledge and significant mistakes, also showing limited understanding. Lack of logical explanations behind the methods used. |
2. Assignment description
The coursework consists of two scenarios to assess the implementation of the Data Science process with Python as main tool.
Scenario 1 of 2: Twitter network map data extraction, pre-processing, and analysis
You have been asked to analyse information of the social media Twitter, such as the network of certain accounts, hashtags and some other data that can be extracted from it. You are required to implement a full Data Science Workflow going from the data gathering, cleaning, pre-processing, implementation of a model (network), and analysis of different statistics (e.g. Degree Distribution, Cluster coefficient, etc.); you are also required to provide justification of the process, analysis of the findings, reasoning behind the design and implementation, decisions, and assumptions.
Your Tasks
Your overall task is to implement the data science process on data collected from Twitter of at least three accounts and three hundred tweets (most recent tweets) of each account. The tasks need to be developed in a Jupiter notebook.
Task 1 – Data Gathering, Pre-processing and EDA
Implement a process/workflow to extract information from Twitter. Your solution must consider:
• API connection and data extraction from the data source.
• Data Pre-processing from the data source to transform the original data into a Pandas dataframe.
• Perform a data cleansing activity considered relevant for the process.
• Provide the explanation of the process, the justification behind it, lessons learned and findings.
• Exploratory Data Analysis of the accounts, e.g. number of followers, are the accounts producing original twits or mostly retweeting, etc.
For more details of the data extraction from Twitter please review below in this document section5. Additional Considerations.
Task 2 – Network analysis
The goal of this task is to create a network that represents the area of influence of the accounts/influencers selected. For this you need to consider the network as bidirectional, there are two ways to do it: you can extract the accounts that the influencer is following and/or create the links from the accounts that were retweeted. You need to provide the following:
• Provide a sample (max 10 records) of the edge list and the neighbour list of the network.
• Produce a visualisation of the network topology and discuss the output .
• Calculate statistics of the network, plot them where relevant, and discuss the results, explaining the meaning of any statistics you have calculated.
o Statistics of the network such as
▪ Degree Distribution
▪ Cluster coefficient
▪ Betweenness Centrality
▪ Assortativity
• Conclusions and lessons learned.
Use Networkx (Python library) to calculate statistics of the network, rather than implementing your own Python code to do so. The visualisation may be hard to interpret at first, experimenting with different settings for the layout may help.
Scenario 2 of 2: Travel time to Uni
An internal department in the university is looking to create a full and comprehensive list of travel time form some English postcodes to the university Whiteknights campus. They are expecting the output to be an Excel file or similar.
The postcode areas for which they are asking to have this information are the following:
• RG, OX, SN, SP, GU, PO, SO, DH, DT, all London, SL, HP, MK, LU, AL, SG, GL, CV, B, GL, WR, HR, NP, DY, BA, BS, NN, LE and RH.
The university campuses are:
• Whiteknights RG6 6AH
The output expected (not limited) is the following:
Postcode |
Car travel time to whiteknights |
Public transport time to whiteknights |
Walking to whiteknights |
SW8 1DL |
|
|
|
They have provided the following data source for the postcodes: https://www.doogal.co.uk/UKPostcodes.php . Even though is not an official data source, it is reliable, properly updated, cost free and has relevant additional data such as geographical data.
The client knows that there are several providers from where to obtain the data, but they prefer Google Maps. An additional constraint is that they would like to be cost effective. Given that every API request has a cost involved and following the constraints of the project we need to make the requests effective. There are cases where the postcodes are close to each other or even are in the same building, considering this, one strategy is to select the most representative points in the data and just execute the API request on this representative sample; then the data can be replicated to the rest of data points.
Your task
They have contacted you to deliver this solution but initially they have requested a feasibility analysis, the solution proposal, and the strategy to follow to implement it. The tasks need to be developed in a Jupiter notebook. Consider just one postcode area as scope for the coursework.
Task 1: Feasibility analysis
Analyse the postcodes data source and if the information requested can be extracted from third party solutions.
• Select one postcode area, download it from www.doogal.com.uk and perform and EDA on the data. Highlight your findings and relevant attributes that can be considered to develop the solution.
• Select a sample of postcodes (up to 5) and perform the API request to Google Maps to extract the data requested by the user.
• Manipulate the data to follow the requirements.
• Elaborate your conclusions.
Task 2: Data Extraction Strategy
Once you have validated that the extraction and manipulation of the data is possible:
• Select one postcode area and apply a unsupervised machine learning algorithm (clustering) to extract the representative data points.
• Define the extraction strategy providing the justification of the algorithm implemented, the amount of the representative points suggested and how these will benefit the project. Provide the assumptions of the. Add some visual support showing the postcodes and the representative points.
Task 3: Solution implementation
Create the workflow to apply your strategy solution in one postcode area.
• Create a workflow that controls the API requests to Google Maps of the representative points postcodes you have stablished.
• Replicate the data of the representative points to the remaining postcodes (all postcodes of the postcode area selected must have distance and time information by each transport option). Manipulate the data to fulfil the requirements. Provide the output file.
• Elaborate your conclusions and lessons learned.
Important:
• Google Cloud provides $200 USD free credit, and each request has a cost involved (approx. $0.005USD). Each transport option is considered a request. Do your tests with a small data sample and once you familiarised with the solution, execute bigger batches.
• The scenario is to be implemented in just one postcode area requested by the client.
• Additional information of how to interact with Google Maps API can be found inAdditional Considerations.
3. Assignment submission requirements
• You must create a Python 3.6 or above Jupyter notebook; when possible, use the packages included in Anaconda, Python 3.6 or above versions in your notebook. If you have a good reason to use a Python package not included in Anaconda, please contact the lecturer ([email protected]) first to check before using it (except the libraries mentioned in 5. Additional considerations).
• Before submitting, please remove the Twitter API connection credentials that you used to extract the data as it is confidential data.
• Your notebook should be submitted on Blackboard Learn, under the Assignments section, as one archive containing:
o A zip file with your student ID followed by module code and the legend
“Coursework”(e.g. “ce9201209_CSMAD21_Coursework.zip”)
containing:
▪ Scenario 1 .ipynb notebook file and its HTML version stating your student ID, task abbreviation (e.g. ce920109_S1.ipynb/html )
▪ Scenario 2 .ipynb notebook file and its HTML version stating your student ID, task abbreviation (e.g. ce920109_S2.ipynb/html )
▪ Scenario 2 - Task2 output file stating your student ID, task abbreviation and the postcode area analysed (e.g. ce920109_S2T2_RG.xls)
▪ Scenario 3 - Task3 output file stating your student ID, task abbreviation and the postcode area analysed (e.g. ce920109_S2T3_RG.xls)
o Note: The HTML version can be saved from the Jupyter interface under File -> Download as.
• At the beginning of the submission, please add the following (in a markdown cell in the notebook):
o Module Code:
o Assignment report Title:
o Student Number (e.g. 25098635):
o Date (when the work completed):
o Actual hrs spent for the assignment:
o Assignment evaluation (3 key points):
• Include your student ID number in the name of the file containing your work.
2021-12-29