关键词 > Python代写

City Analytics on the Cloud

发布时间:2021-05-19

Cluster and Cloud Computing Assignment 2 - City Analytics on the Cloud


Background

In development and delivery of non-trivial software systems, working as part of a team is generally (typically!) the norm. This assignment is very much a group project. Students will be put into software teams to work on the implementation of the system described below. These will be teams of up to 5 students. In this assignment, students need to organize their team and their collective involvement throughout. There is no team leader as such, but teams may decide to set up processes for agreeing on the work and who does what. Understanding the dependencies between individual efforts and their successful integration is key to the success of the work and for software engineering projects more generally. If teams have “issues”, then please let me know asap and I will help resolve them.


Assignment Description

The software engineering activity builds on the lecture materials describing Cloud systems and especially the UniMelb Research Cloud and its use of OpenStack; on data from the Twitter APIs, and CouchDB and the kinds of data analytics (e.g. MapReduce) that CouchDB supports as well as data from the Australian Urban Research Infrastructure Network (AURIN – https://portal.aurin.org.au). The focus of this assignment is to harvest tweets from across the cities of Australia on the UniMelb Research Cloud and undertake a variety of social media data analytics scenarios that tell interesting stories of life in Australian cities and importantly how the Twitter data can be used alongside/compared with/augment the data available within the AURIN platform to improve our knowledge of life in the cities of Australia. Teams can download data from the AURIN platform, e.g. as JSON, CSV or Shapefiles, or using the AURIN openAPI (https://aurin.org.au/aurin-apis/). This data can/should be included into the team’s CouchDB database for analysis with Twitter data.

The teams should develop a Cloud-based solution that exploits a multitude of virtual machines (VMs) across the UniMelb Research Cloud for harvesting tweets through the Twitter APIs (using both the Streaming and the Search API interfaces). The teams should produce a solution that can be run (in principle) across any node of the UniMelb Research Cloud to harvest and store tweets and scale up/down as required. Teams have been allocated 4 servers (instances) with 8 virtual CPUs and 500Gb of volume storage. All students have access to the UniMelb Research Cloud as individual users and can test/develop their applications using their own (small) VM instances, e.g. using personal instances such as pt-1234. (Remembering that there is no persistence in these small, free and dynamically allocated VMs).

The solution should include a Twitter harvesting application for any/all of the cities of Australia. The teams are expected to have multiple instances of this application running on the UniMelb Research Cloud together with an associated CouchDB database containing the amalgamated collection of Tweets from the harvester applications. The CouchDB setup may be a single node or based on a cluster setup. The system should be designed so that duplicate tweets will not arise.

Students may want to explore other sources of data they find on the Internet, e.g. information on weather, sport events, TV shows, visiting celebrities, stock market rise/falls, official statistics on Covid-19 however these are not compulsory to complete the work. A large corpus of Twitter posts will be made available for data analytics, but again teams may decide that they only wish to focus on Twitter data that they collect.

Teams are expected to develop a range of analytic scenarios, e.g. using the MapReduce capabilities offered by CouchDB for social media analytics and comparing the data with official data from AURIN. Teams are free to explore any scenarios that connect “in some way” to the AURIN data. Teams are encouraged to be creative here. A prize will be awarded for the most interesting scenarios identified! For example teams may look at scenarios such as:

● How many tweets mention Covid-19 or coronavirus and are these clustered in certain areas, e.g. rich vs poor suburbs or in statistical areas where there are more/less hospitals etc?

● What do the movement patterns of people look like before Covid-19, during and after lockdown etc?

● Which suburb has the most tweeters and does this correlate with what we might expect from the population demographic of the suburb from AURIN, e.g. more young people live in a given area so we might expect a proportionate increase in the number of tweets (assuming young people tweet more)?

● Do the different languages used when tweeting correlate with the cultures we would expect to find in those areas, e.g. more Chinese live in Box Hill in Melbourne hence we would expect to see for tweets tagged as Chinese from those suburbs, or Italians in Carlton etc?

● Is there a correlation between crime related tweets and official crime statistics across the suburbs of Melbourne?

● Is there a correlation between alcohol related tweets or crime and locations of places to buy alcohol (bottleshops)?

● Does language use, e.g. vulgar words used in Twitter happen more or less in wealthy or poor areas?

The above are examples – students may decide to create their own analytics based on the data they obtain. Students are not expected to build advanced “general purpose” data analytic services that can support any scenario but show how tools like CouchDB with targeted data analysis capabilities like MapReduce when provided with suitable inputs can be used to capture the essence of life in Australia.

A front-end web application is required for visualising these data sets/scenarios.

For the implementation, teams are recommended to use a commonly understood language across team members – most likely Java or Python. Information on building and using Twitter harvesters can be found on the web, e.g. see https://dev.twitter.com/ and related links to resources such as Tweepy and Twitter4J. Teams are free to use any pre-existing software systems that they deem appropriate for the analysis and visualisation capabilities, e.g. Javascript libraries, Googlemaps etc.


Error Handling

Issues and challenges in using the UniMelb Research Cloud for this assignment should be documented. You should describe the limitations of mining twitter content and language processing (e.g. sarcasm). You should outline any solutions developed to tackle such scenarios. The database may however contain re-tweets. You should demonstrate how you tackled working within the quota imposed by the Twitter APIs through the use of the Cloud.


Final packaging and delivery

You should collectively write a team report on the application developed and include the architecture, the system design and the discussions that lead into the design. You should describe the role of the team members in the delivery of the system and where the team worked well and where issues arose and how they were addressed. The team should illustrate the functionality of the system through a range of scenarios and explain why you chose the specific examples. Teams are encouraged to write this report in the style of a paper than can ultimately be submitted to a conference/journal.

Each team member is expected to complete a confidential report on their role in the project and the experiences in working with their individual team members. This will be handed in separately to the final team report. (This is not to be used to blame people, but to ensure that all team members are able to provide feedback and to ensure that no team has any member that does nothing!!!).

The length of the team report is not fixed. Given the level of complexity of the assignment and total value of the assignment a suitable estimate is a report in the range of 20-25 pages. A typical report will comprise:

● A description of the system functionalities, the scenarios supported and why, together with graphical results, e.g. pie-charts/graphs of Tweet analysis and snapshots of the web apps/maps displaying certain Tweet scenarios;

● A simple user guide for testing (including system deployment and end user invocation/usage of the systems);

● System design and architecture and how/why this was chosen;

● A discussion on the pros and cons of the UniMelb Research Cloud and tools and processes for image creation and deployment;

● Teams should also produce a video of their system that is uploaded to YouTube (these videos can last longer than the UniMelb deployments unfortunately!);

● Reports should also include a link to the source code (github or bitbucket). It is recommended that all students commit their code to the code repository rather than delegate this to a single team member. This can provide an evidence base if teams have “issues”.

It is important to put your collective team details (team, city, names, surnames, student ids) in:

● the head page of the report;

● as a header in each of the files of the software project.

Individual reports describing your role and your teams’ contributions should be submitted through a Qualtrics link that will be sent through in due course.


Implementation Requirements

Teams are expected to use:

● a version-control system such as GitHub or Bitbucket for sharing source code.

● MapReduce based implementations for analytics where appropriate, using CouchDB’s built in MapReduce capabilities.

● The entire system should have scripted deployment capabilities. This means that your team will provide a script, which, when executed, will create and deploy one or more virtual machines and orchestrate the set up of all necessary software on said machines (e.g. CouchDB, the twitter harvesters, web servers etc.) to create a ready-to-run system. Note that this setup need not populate the database but demonstrate your ability to orchestrate the necessary software environment on the UniMelb Research Cloud. Teams should use Ansible (http://www.ansible.com/home) for this task.

● Teams may wish to utilise container technologies such as Docker, but this is not mandatory.

● The server side of your analytics web application may expose its data to the client through a ReSTful design. Authentication or authorization is NOT required for the web front end.

Teams are also encouraged to describe:

● How fault-tolerant is your software setup? Is there a single point-of-failure?

● Can your application and infrastructure dynamically scale out to meet demand?


Deadline

One copy of the team assignment is to be submitted through Canvas. The zip file must be named with your team, i.e. <CCC2021-TeamN>.zip.

Individual reports describing your role and individual team member contributions should be submitted a Qualtrics link that will be distributed in due course. These individual reports will be completion of web-based forms, i.e. they do not require Word/PDF documents etc.

The deadline for submitting the team assignment is Wednesday 19th May (by 12 noon!).


Marking

The marking process will be structured by evaluating whether the assignment (application + report) is compliant with the specification given. This implies the following:

● A working demonstration of the Cloud-based solution with dynamic deployment – 25% marks

● A working demonstration of tweet harvesting and CouchDB utilization for specific analytics scenarios – 25% marks

● Detailed documentation on the system architecture and design – 20%

● Report and write up discussion including pros and cons of the UniMelb Research Cloud and supporting twitter data analytics – 20% marks

● Proper handling of the errors and removal of duplicate tweets – 10% marks

The (confidential) assessment by your peers in your team on the Qualtrics system will be used to weight your individual scores accordingly. Timeliness in submitting the assignment in the proper format is important. A 10% deduction per day will be made for late submissions.


Demonstration Schedule and Venue

The student teams are required to give a virtual presentation (with a few slides) and a demonstration of the working application via zoom. Note that a single representative from each team should give the presentation, i.e. not all students are required to present. The presentation should include the key data analytics scenarios supported as well as the design and implementation choices made. Each team has up to 15 minutes to present their work. This will take place on Wednesday 20th May (8 teams present), Thursday 21st May (4 teams present), Friday 22nd May (4 teams present), Wednesday 27th May (4 teams present). Note that given the numbers of teams this year, not all teams will be able to present – however all teams should be prepared to present on 20th May!!! I will randomly identify a team on the day (using a random number generator for fairness!!!). Note this is the same day as submission hence the deadline for submission is a hard one!!!

As a team, you are free to develop your system(s) where you are more comfortable with (at home, on your PC/laptop, in the labs...) but obviously the demonstration should work on the UniMelb Research Cloud.