Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMSM0089

Introduction to Data Analytics

Coursework Resit

Summer 2022

Overview

This coursework will take you through the data analytics process for an example scenario, from        processing text data to visualising information. As well as implementing data analytics methods and obtaining results, you should aim to demonstrate your understanding of the methods you use and   critically evaluate these methods. Your work should also incorporate ideas from the lecture videos  and lectorials.

We recommend that you first get a basic implementation for all parts of the required assignment, then start writing your report with some results for all tasks. You can then gradually improve your

implementation and results.

Total time required: 40 hours .

Support

The lecturers are available to answer clarification questions if you are unsure what to do for any part

of the coursework. It is best to contact us directly by email or Teams: for tasks 1 and 2, contact

Edwin (edwin.simpson@bristol.ac.uk), and for task 3, contact Ian ([email protected]).

Task 1: Text Classification (max. 29%)

Product reviews contain masses of useful information for online shopping websites and product      manufacturers about what is good and bad about particular products. Your task is to design,            implement and evaluate a text classifier for Amazon product reviews that can predict star ratings,  ranging from 1 to 5. Each review has a text body, some metadata and a star rating . The dataset can be accessed throughthe HuggingFace datasets library. Please see the Jupyter notebook                     data_loader_demo_resit.ipynb (available on Blackboard) for example code for loading the dataset.

The data is described in this paper:

Phillip Keung, Yichao Lu, György Szarvas and Noah A. Smith. “The Multilingual Amazon Reviews Corpus.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020.

1.1. Implement and train a method for automatically predicting star ratings for Amazon product     reviews. You can choose to use the data for any of the languages in the dataset (English, Japanese, German, French, Chinese or Spanish). You can work with more than one language but this is not

required. Refer to the labs, lecture materials and textbook to identify a suitable method. Include the following in your report:

•    Briefly explain how your chosen sentiment analysis method works and its main strengths and limitations (3 marks)

•    Describe the features you have chosen and why you chose them, and hypothesise how they will affect your results (3 marks)

•    Explain the preprocessing steps your method requires (1 marks)

1.2. Implement, train, and test your method. Briefly document this process in the report. If you find   the training dataset is too large to work with, you can use a random subset of the training dataset. (6 marks)

1.3. Evaluate, interpret and discuss your results, making sure to include the following points:

•    Define your performance metrics and state their limitations (2 marks)

•    Show your results using suitable plots, tables and/or a confusion matrix (4 marks)

•    How could you improve the method or experimental process? Consider the errors that your method makes. (3 marks)

1.4. Consider the probabilities or weights learned by your model for different features. Are there any features that your trained model strongly associates with 1-star or 5-star ratings? Show some              examples of such features and explain how you identified them. Can you see any problems of relying on these features? (7 marks)

High performance figures are less important for getting high marks than motivating your method

well and implementing and evaluating it correctly.

Suggested length of report for task 1: 2.5 pages.

Task 2: Named Entity Recognition (max. 21%)

Many applications require a way to extract information about organisations, places, and people.   This task is therefore to design and implement a named entity recognition method for Wikipedia text (link to HuggingFace page).  The dataset is labelled with the entity tags location (LOC), person (PER), and organisation (ORG). Code to load the dataset is provided in the Jupyter notebook           data_loader_demo_resit.ipynb (available on Blackboard).

The data is presented in this paper:

Afshin Rahimi, Yuan Li, and Trevor Cohn. 2019.Massively Multilingual Transfer for NER. In                    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 151– 164, Florence, Italy. Association for Computational Linguistics.

2.1. Design a method for tagging named entities in the WikiANN dataset. Refer to the labs, lecture materials and textbook to identify a suitable method. Include the following in your report:

•    Briefly explain how your chosen named entity recognition method works and its main strengths and limitations (3 marks)

•    Describe the features you have chosen and why you chose them, and hypothesise how they will affect your results (3 marks)

•    Explain the tagging scheme for labelling entities in this dataset, i.e., what the labels in the dataset mean and how they are used to identify entities (1 marks)

2.2. Implement, train, and test your method. Briefly document this process in the report. (6 marks) 2.3. Evaluate, interpret, and discuss your results, making sure to include the following points:

•    Explain your choice of performance metrics and their limitations (2 marks)

•    Show your results using suitable plots and/or tables (3 marks)

•    How could you improve the method or experimental process? Consider the errors your method makes (3 marks)

Suggested length of report for task 2: 2 pages.

Task 3: Information Visualisation (50%)

3.1. Use Tableau to create plots that enable the user to explore two datasets relating to child health included with this document. If you prefer, they can be downloaded from the Global Health              Observatory (by agehttps://apps.who.int/gho/data/node.main.nHE-1559AGE?lang=enand wealth  quintile data:https://apps.who.int/gho/data/node.main.nHE-1559?lang=en). Note that each figure comes with a confidence interval based on the fact that sampling was used to gather the data. For   the purposes of this task, you may just use the main figure.

You should enable the user to answer these questions:

•    In which countries has child malnutrition improved over the period and in which countries has malnutrition got worse?

•    Is there a link between wealth and child malnutrition?

•    Show the values on a world map with information on both 0- 1 years and 2-5 years appropriately presented.

In about two pages, write a description of the visualization techniques you used and a                 justification for your choices. You should refer to the principles of info vis, relevant aspects of   human perception and cognition, and the scientific literature where appropriate . (40 marks: 30 marks for the visualization; 10 marks for the description and justification).

3.2. For two of the visualisations you have produced in 3. 1, identify and describe the visual queries that the user makes to answer the question. Your report on this should be no more than one page (10 marks).

Implementation

Text Analytics: The lab notebooks provide useful example code and we recommend using Python 3 with the libraries used in the labs. You may use other libraries if preferred and you can write your  code in either Jupyter notebooks or standard Python files.

Information Visualisation: We recommend using Tableau and applying what you have learned in the labs and lectorials.

Report Formatting

•    Maximum of 10 pages

•    References do not count toward the page limit

•    We recommend using the template fromCOLING 2020 if writing the report in Latex1, or       following the same formatting style if using Word or another application. You don’t need to include the abstract or use the same section headings as this template.

•    No less than 11pt font

•    Single line spacing

•    A4 page format

•    Aim for quality rather than quantity: you do not have to use the maximum number of pages and will receive higher marks if you write concisely and clearly.

•   The text in your figures must be big enough to read without zooming in.

Citations and References

Make sure to cite a relevant source when you introduce a method or discuss results from previous work. The preferred style is given in the COLING 2020 style guide above. The details of the cited    papers must be given at the end in the references section (no page limits on the references list).    Please only include papers that you discuss in the main body of the report.

Google Scholar and similar tools are useful for finding relevant papers. The ‘cite’ link provides bibtex code for use with latex and references that you can copy, but beware they often contains errors.

Submission

•    Deadline: 13.00 (GMT+1) on 8th August.

•    On Blackboard under the “assessment, submission and feedback” link.

Please upload the following three files:

1.   Your report as a PDF with filename <student_number>.pdf, where <student_number> is your student number number (not your username).

2.   Your code inside a single zip file with filename <student_number>.zip. Please remove   datasets and other large files to minimise the upload size – we only need the code itself.

3.   A packaged Tableau workbook (use thislinkto find out more) with filename                   <student_number>.twbx containing your solution to Task 3. This enables us to run the workbook in Tableau reliably.

We will briefly review your Python code by eye we do not need to run it. Your marks will be based  on the contents of your report, with the code used to check how you carried out the experiments      described in your report. We will not assess the coding style, comments, or organisation of the code.

Please do not include your name in the report text itself: to ensure fairness, we mark the reports anonymously.

Please check that your submission follows these guidelines before uploading, otherwise you may lose marks.

Assessment Criteria

Your coursework will be evaluated based on your submitted report containing the presentation of methods, results and discussions for each task. To gain high marks your report will need to

demonstrate a thorough understanding of the tasks and the methods used, backed up by a clear   explanation (including figures) of your results and error analysis.  The exact structure of the report and what is included in it is your decision and you should aim to write it in a professional and          objective manner. The report will be assessed based on how well it addresses each of the tasks in the required and open assignments, with the percentage of marks available for each task shown    above. Marks will be awarded for appropriately including concepts and techniques from the           lectures.

Avoiding Academic Offences

Please re-readthe universitys plagiarism rulesto make sure you do not unknowingly break any        rules. Do not copy text directly from your sources always rewrite in your own words and provide a citation.

Academic offences include submission of work that is not your own, falsification of data/evidence or the use of materials without appropriate referencing. Note that sharing your report with others is     also not allowed . These offences are all taken very seriously by the University.

Suspected offences will be dealt with in accordance with the University’s policies and procedures. If  an academic offence is suspected in your work, you will be asked to attend an interview with senior  members of the school, where you will be given the opportunity to defend your work. The                   plagiarism panel can apply a range of penalties, depending on the severity of the offence. These         include a requirement to resubmit work, capping of grades and the award of no mark for an element of assessment.

Extenuating Circumstances

If the completion of your assignment has been significantly disrupted by serious health conditions,   personal problems, periods of quarantine, or other similar issues, you can apply for consideration of extenuating circumstances in accordance with the normal university policy and processes. Students

should apply for consideration of extenuating circumstances as soon as possible when the problem

occurs. Please seethe details here.