COMP42315 Assignment – Web Scraping, Data Analysis, and Visualization
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COMP42315 Assignment – Web Scraping, Data Analysis, and Visualization
Content and skills covered by the assignment
Understand advanced concepts of programming in Python.
Have a critical appreciation of the main strengths and weaknesses of a range of Python packages and understand how to use them.
Have a critical appreciation of how to acquire and clean datasets for analysis.
Understand how to manipulate potentially large datasets efficiently.
Be able to write computer programs in python using industry-standard packages.
Be able to select appropriate data structures for modelling various data science scenarios. Be able to select the appropriate algorithm and programming package for a given problem. Be able to write a computer program in python to collect or read data from available sources, and clean these datasets using the appropriate packages.
Effective written communication.
Planning, organising, and time-management.
Problem solving and analysis.
Requirements
Students are expected to work on the coursework individually.
In this assignment, you are asked to scrape data from a website, perform data analysis and visualization. You will implement the programming solution with a written report that explains the implementation and justifies the design.
What the examiners expect from program implementation:
Your program must be runnable on the Durham NCC server – a program that partially works or does not run at all will receive no mark.
You are asked to use Python and the Python libraries taught in this module to complete this part. If you wish to use other libraries, you should ask for permission from your tutors first and provide a strong reason.
Your source code should be documented with comments, making it to be followed as easily as possible.
Apart from performing the requested functionality, your design should aim at a clear programming logic. Your proposed solution should also be as robust as possible, such that it works in different situations and would hopefully work in the future when the site owner updates the webpage (i.e. as future-proof as possible).
What the examiners expect from the report:
Your report should explain your solution with reference to your source code. You are NOT encouraged to copy the whole source code to your report, but you may refer to/quote important lines if you believe that is helpful.
If there are any features that you wish to highlight, you are also encouraged to do so such that your examiner can pay attention to them.
You are welcome to use visualizations, figures, tables, organization structures, etc. to help you explain your design ideas and showcase the results.
You should also provide support and justification for your design.
Questions
You are asked to perform the following tasks based on the following target website, which contains
artificial content designed for this assignment:https://community.dur.ac.uk/hubert.shum/comp42315/
1. Please design and implement the solution to crawl all the unique URLs for the detailed publication pages. Explain your design and highlight any features with no more than 150 words. (10%)
2. Please design and implement the solution to crawl all the text-based information of each
publication from the website, to convert such information into a suitable data format, and to store it in a data file. Explain your design and highlight any features with no more than 250 words. (20%)
3. Please design and implement a solution to find out the 100 most popular words used for the title and the abstract of the publications. You should define what a “word” means under your design. For example, such “words” can be of an arbitrary length (single word/double word) and/or they should be as meaningful as possible. Explain your design and highlight any features with no more than 250 words. (20%)
4. Please design and implement the solution to use data analysis and visualization for analysing which authors collaborate (or appear) as co-authors in the publications. Explain your design, highlight any features, and showcase your findings with no more than 300 words. (20%)
5. Please design and implement the solution to use data analysis and visualization for analysing how the features of a publication would affect its “citation” (a value that can be found in the publication detail pages). Explain your design, highlight any features, and showcase your findings with no more than 400 words. (30%)
Word Limit policy
The word count as mentioned in individual questions will:
• Include all the text, including title, preface, introduction, in-text citations, quotations, footnotes, and any other item not specifically excluded below.
• Exclude diagrams, tables (including tables/lists of contents and figures), equations, executive summary/abstract, acknowledgments, declaration, bibliography/list of references, and appendices. However, it is not appropriate to use diagrams or tables merely as a way of
circumventing the word limit. If a student uses a table or figure as a means of presenting his/her own words, then this is included in the word count.
Examiners will stop reading once the word limit has been reached, and work beyond this point will not be assessed. Checks of word counts may be carried out on submitted work. Checks may take place manually and/or with the aid of the word count provided via electronic submission.
Plagiarism and collusion
Your assignment will be put through the plagiarism detection service on the Learn Ultra.
Students suspected of plagiarism, either of published work or work from unpublished sources, including the work of other students, or of collusion will be dealt with according to the Computer Science Department and University guidelines.
2022-02-08