Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP20008 - Elements of Data Processing, Semester 1, 2024

Assignment 2 – Who else likes this book?

1. Overview

In this project, you will undertake an analysis of a collection of datasets containing detailed information about books and their reviews by users of an online bookstore. Your overall objective is to analyse the data and extract insights. These insights are intended to help managers of the bookstore decide which kinds of books they should buy (and not buy) in the future for best sales, and which books they should recommend to buyers as possible additional purchases. The outcomes of your analysis will be communicated through both a presentation and a written technical report targeted towards a managerial audience.

This assessment presents an opportunity for you to gain experience in data wrangling, processing and analysis for an open-ended task.

You will deliver a brief technical report summarising your findings which should be comprehensible to a reader with a reasonable level of understanding of data analysis. Through this report, you will communicate your insights and discoveries on the landscape of book reviews.

2. Assignment Structure

Group Contract – 2 marks (Due: Friday 26 Apr at 5 PM)

You must submit a group contract outlining your team's goals, expectations, and policies for working on the project. A group contract template is provided. You are welcome to work with the provided template or customize it according to your preference. Submit as a single PDF file via Canvas (Assignment 2: Group Contract).

You may vary your group contract throughout the semester, but proposed changes should be agreed to by all members. There are no marks directly allocated to the content of the Group Contract, but we may refer to it when assessing the relative contribution of each group member to resolve any dispute.

Code and Report Submission – 22 marks (Due: Friday, 10 May at 5 PM)

1. Report: Your report should consist of ten to twelve single-column A4 pages.  Maintain a line spacing of exactly 1 with normal margins and ensure that the font size is 11pt or above. Please note that if your report exceeds twelve pages, only the content within the first 12 pages will be reviewed and assessed. Any additional pages will be disregarded. Conversely, submitting a report shorter than eight pages will result in a penalty. The page limit includes all the text  including  references,  captions,  and  any table  or  image. Tables  and  image content should be readable and sensible in size.

The group name W[XX]G[X] and all group members’ names should appear on the first page after  the  title  of  the   report. Submit   as a  single   PDF  file through Canvas/Turnitin (Assignment 2: Group Report)

2. Code: One or more programs, written in Python, including all the code necessary to reproduce the results in your report (model    implementation, data processing, visualization, and evaluation). Your code should be executable  and   have  enough comments to make it understandable. You should also include a README file that briefly details your implementation and describes how to run your code to reproduce the results in the report. Submit as a single zip file through Canvas/Turnitin (Assignment 2: Code and Comments).

Slides Submission (Due: Monday, 13 May at 9 AM)

You will need to submit the slides you are going to use for delivering your oral presentation. These  slides  should  illustrate  your insights derived from the data analysis task you've undertaken. Submit as a single PowerPoint (.pptx) or PDF file  through  Canvas/Turnitin. (Assignment 2: Oral Presentation Slides) No other format is acceptable.

You will be required to use the exact slides that you have submitted for your presentation.

Oral Presentation and assessment – 8 marks (Due: from Monday 13 May to Friday 17 May)

During week 11 all teams should deliver an oral presentation of their work and findings for assignment 2. Some of the presentations will be conducted in the students' usual workshop room and some in other venues which will be announced shortly. Two markers will assess the oral presentations. See section 6 for more details.

Teamwork evaluation – 2 marks (Due: Friday 24 May at 5 PM)

For this  part  of the assessment, every team  member  needs  to evaluate  both their  own contributions to the assignment and the contributions of their teammates. This evaluation should align with the expectations you set in your submitted “group contract” .

The evaluation should be delivered via Feedback Fruit available on Canvas (Assignment 2: Teamwork Evaluation).

Your group members' evaluations will determine individual group member evaluation scores worth 2 marks. If any member is identified as a non-contributor, these scores may be used to adjust those individual’s marks for the report and oral presentation (worth 30 marks).

3. Data Sets

3.1 Main Data sets

The provided files contain data regarding various books, users, and their corresponding book ratings. You will find this information distributed across three distinct CSV files.

-     ‘BX-Books.csv’   dataset    comprises   information    on   18,185    books,   including   their International  Standard  Book  Number  (ISBN),  Title,  Author,  Year  of  publication  and Publisher.

-     ‘BX-Users.csv’ dataset comprises anonymised information on 48,299 users of the online bookstore including their ID, City, State, Country and Age.

-     ‘Bx-Ratings.csv’ dataset includes the reviews of the provided users on the given books. The columns include the user ID, book ISBN and the rating associated with that review.

What datasets you use will depend on your research question and the analysis approach your group agrees on. Details about using these text features are provided in the README file.

3.2 Recommendation Data Sets

Considering the nature of these files, there is an opportunity to develop a recommendation system  capable  of  predicting  the  ratings  that  users  might  assign  to  new  books.  While incorporating this into your research question is an optional challenge, groups opting to implement  the  recommendation  system  can  substitute  it  with  the  two  supervised  or unsupervised models outlined in section 4.3.

To  assist  you  with   implementing  a   recommendation  system  we   have   provided  three separate CSV files:

-     ‘ BX-NewBooks.csv’ dataset information on 8,924 new books, including their ISBN, Title, Author, Year of publication and Publisher.

-     ‘ BX-NewBooks-Users.csv’  dataset comprises  information on 8,520 users of the online bookstore including their ID, City, State, Country and Age. These users are not new and they have a history of rating books in the system. Your goal can be to predict the ratings that these users can provide for the books in the ‘ BX-New-Books.csv’ dataset.

-     ‘ BX-NewBooks-Ratings.csv’ dataset contains the  real ratings provided by users for the new books listed in the 'BX-NewBooks.csv' dataset, which are associated with the users' information in the 'BX-NewBooks-Users.csv' dataset. You can utilize this information to compare the predicted ratings generated by your recommendation systems against the actual ratings provided by users, allowing for comprehensive evaluation and validation of your models.

Please keep in mind that if you are not implementing a recommendation system you are not allowed to use these datasets.

4. Data Analysis Tasks

4.1. Research Question

The research question clarifies the purpose of your analysis. It identifies the problem or question  being  addressed,  sets  the  context,  and  explains  why  the  analysis  is  being conducted.

In  your  report,  it  is  essential  to  introduce  (at  least)  one  research  question  clearly  and explicitly.  We   have  presented  a  few   examples  of  possible  research  questions  in  the accompanying video to provide you with some inspiration. However, each team needs to independently formulate their own research question based on the provided dataset.

While the possibility exists to explore more than one research question, it's important to note that the pursuit of several questions is not necessarily desirable or likely to lead to greater marks (i.e. full marks are obtainable for one well-studied research question). We will primarily evaluate the quality of your work by assessing the depth of your analysis, and the insights it yields, rather than simply covering a larger quantity of content or material.

4.2. Data Pre-processing

So far  in the subject, you've  learned various ways to  prepare  and  organize data. These include  techniques  like  filling  in   missing  data  (data  imputation),   reshaping  data  (data manipulation), adjusting data  ranges  (scaling), converting data  (encoding), and grouping data into categories (discretizing). You've also explored methods to simplify complex data (dimensionality  reduction)  and  handle  text  data  (text  processing)  using  tools  like  text vectorization and TF-IDF.

For  this  project,  you're  encouraged  to  consider  applying  any  of  these  methods  to  the provided datasets.  Your objective is to implement a minimum of three data pre-processing techniques, though you're welcome to utilize as many data pre-processing techniques as you see fit. The methods you select should logically support the research question(s) you have picked, and in your report and presentation, you should explain the reason for your selection of each method.

In your report and presentation, ensure you provide justifications and explanations for all methods  you  employ  (for  both  pre-processing  and  supervised/unsupervised  models). Present the results, and highlight any interesting discoveries. It would be good if you also describe the importance (effect) of these discoveries in terms of sales.

Remember, there's no single expected solution here. The more deeply you engage with and understand your data, the better set-up you will be for subsequent stages of your project.

4.3. Use of supervised and unsupervised models

In this  subject,  we  explore  certain  Machine  Learning  related  techniques.  These  include identifying  relationships  between  variables  (correlation),  predicting  outcomes  based  on known  data  (supervised  models  like  Decision  trees  and  linear  regression),  and  finding patterns in data without prior labels (unsupervised methods like k-means and agglomerative clustering). Many other techniques are possible too.

Feel free to choose any Machine Learning method(s) that are suitable for answering your research question. Your choices should be substantiated and clarified in both your report and  presentation. The objective  is  to  implement a minimum of  two Machine Learning techniques, though you're welcome to  utilise  more  if  you  so  choose. You  might  opt to employ two supervised models, or two unsupervised methods, or one of each. As highlighted earlier, you have the flexibility to incorporate a recommendation system as your machine learning model implementation. Implementing a recommendation system will satisfy the minimum expectations of section 4.3.

In  your  report  and  presentation,  it's  important  to  articulate  your  rationale  behind  the machine learning methods you chose. Provide a concise overview of your approach and outline how you assessed the effectiveness of your chosen methods. Equally important is your interpretation of the results and their implications.

NOTE: You are welcome (and indeed strongly encouraged) to make use of any relevant existing Python libraries (such as sklearn or scipy) in your work on this assignment.

5. Report

Your  primary  submission  for  this  assignment  is  your  report.  The  report  should  follow  the structure of a technical paper. It should describe your approach and observations, both in data preparation, and the machine learning algorithms you tried. Its main aim is to provide the reader with knowledge about the problem, in particular critical analysis of your results and discoveries.

The following is the expected structure of the report for this assignment.

Executive Summary: A concise overview of the entire report, summarizing the objectives, methods used, key findings, and recommendations. This section provides a high-level snapshot of what you have done.

Introduction: This section introduces the purpose of the report, the problem or question being addressed, and introduces the data sources used. It sets the context and explains why the analysis was conducted.

Methodology: Detailed explanation of the methods, techniques, and tools employed for data preparation, analysis, and interpretation. When writing this section, you can assume that the reader is familiar with the technical terms.

Data Exploration and Analysis: Present the results of your data analysis. This section may include descriptive statistics, visualizations, and insights gained from exploring the data. Use charts, graphs, and tables to illustrate patterns, trends, and relationships.

Results: Summarize the most important insights obtained from the supervised and/or unsupervised learning models you have used. Focus on answering the main questions or addressing the problem you have introduced in the introduction. Present the results, in terms of evaluation metric(s) and, ideally, illustrative examples and diagrams.

Discussion  and  Interpretation: Provide  a  list  of  interesting  findings  and  an  in-depth interpretation of them. Bullet points or numbered lists can help highlight these findings. Explain  the  significance  of  the   patterns  observed.   Explain  why  these  findings  are interesting and valuable. Discuss any unexpected or interesting insights that emerged. (This is the most important section of your report)

Remember we are more interested in seeing evidence that you have thought about the task and can identify reasons behind your different results in different experiments. You should think beyond simple numbers to the reasons that underlie them and connect them back to your research question. You can also add complementary experiments and their results in this section.

Limitations and improvement opportunities: Address the limitations of the analysis, such as data constraints, potential biases, or assumptions made. Explain what needs to be done to improve your analysis.

Conclusion: Summarize the main points of the report and reiterate the key findings and recommendations. Emphasise the value and potential impact of the analysis.

References: List  any  sources,  references,  or  citations  used  in the  report,  especially  if you've drawn upon external research or literature to inform your analysis.

We've supplied a template for the report via the assignment page. You are welcome to work with the provided template or customize it according to your preference.

6.  Oral Presentation and Assessment

You need to conduct an oral presentation explaining what you have done for assignment 2. Your presentation should encompass the key components below:

1. Introduction of Research Question: Begin by introducing the research question that guided your assignment. Explain briefly why it is relevant to the managers of the bookstore.

2. Methods,  Techniques,  and  Tools: Elaborate on the methods, techniques, and tools you employed for both data preparation and data analysis. Explain how you gathered, cleaned, and  structured  the  data,  as  well  as  the  analytical  techniques  and  machine  learning techniques you utilized.

3. Presentation of Results: Share the outcomes derived from your data analysis.  Provide a concise overview of the insights you gained through your analytical process.

4. List  of  Findings and  In-Depth  Interpretation:  Present  a  list  of  the  findings  from  your analysis.  Then   provide  an  interpretation  of  these  findings,  shedding   light  on  the significance and implications they hold in relation to your research question.

5. Limitations and Improvement Opportunities: Address the limitations encountered during your study, discussing any constraints or challenges that might have influenced the results. Furthermore, demonstrates suggested potential areas for improvement and development.

The presentation requirements are as follows:

Timing: Your presentation should take exactly 9 minutes. If your presentation doesn't finish on time the markers will interrupt and stop you and it will also negatively impact your  mark.  There  may  be  a  further 10 minutes of  questions  and  answers  from  the markers.

Presenters: Attendance at the presentation is mandatory for all team members unless they have been granted an exemption by the teaching staff. Each member of the group is expected to contribute to the presentation content.

Slides: To ensure fairness for all groups and prevent last-minute modifications based on other teams’ work, when presenting you will be asked to use the exact version of the slides that you submitted to Canvas.

6.1. Oral Assessment

After the presentation, there will be an oral assessment of all team members’ knowledge of the assignment. During this Q&A session, each member will be evaluated individually. Tutors will ask questions about the entire report, rather than focusing on your specific sections. All members are required to respond independently to oral questions regarding both the report and the presentation. Our findings from the oral assessment can impact your report marks.

7. Teamwork

As mentioned previously, 2 marks for this assignment are determined by the results of your teamwork evaluation task.  However,  based  on  these  assessments  and  past  records,  we  will identify any non-contributing members and adjust the overall assignment grade accordingly.

The group contract outlines the expectations and responsibilities of each group member. It's crucial that every member actively participates in this assignment. Remember, your comprehension of the entire project will be assessed during the oral evaluation.

If you encounter any challenges with inactive team members who aren't responsive to your inquiries, please reach out to Hasti for assistance in finding a solution.

8. Assessment Criteria

The report will be marked according to the rubric published via the assignment page. The oral presentations and oral assessments will also be marked according to their published rubric.

Although your code is not assessed directly, you have to submit the code that produced the results  presented  in  your  report.  If  you  do  not  submit  executable  code  that  supports  your findings, we reserve the right to give your team zero marks for the report section.

9. Terms and Conditions

9.1 Changes/Updates to the Assignment Specifications

We will use Canvas to advertise any (hopefully small-scale) changes or clarifications in the assignment  specifications. Any addendums made  to  the  assignment  specifications  via Canvas will supersede the information contained in this version of the specifications.

It is your responsibility to ensure you are adhering to the latest iteration of  these specifications should updates be announced.

9.2 Late Submissions

There will be no extensions granted, and no late submissions allowed to ensure a smooth run of the oral presentations.

For students who are demonstrably  unable to submit  in time, we  may  be  able to  offer alternative  arrangements,  but  these  could  involve  not  being  able  to  complete  the  oral presentation component, with the associated work being reweighted. The arrangement will be sought on a case-by-case basis. Please email Hasti ([email protected]) with documentation of the reasons for the delay.

9.3 Academic Honesty

While it is acceptable to discuss the assignment with others in general terms, excessive collaboration with students outside of your group is considered cheating. Your submissions will be examined for originality and will invoke the University’s Academic Misconduct Policy where either an inappropriate level of collaboration or plagiarism appears to have taken place.

We highly recommend (re)taking the academic honesty training module in this subject's Canvas. We will  be  checking  submissions  for  originality  and will  invoke the  University's Academic Misconduct policy where inappropriate levels of collusion or plagiarism appear to have taken place. Content produced by generative AI (including, but not limited to, ChatGPT) is not your own work, and submitting such content will be treated as a case of academic misconduct, in line with the University's academic integrity policy and specifically recent guidance on the use of ChatGPT and other Large Language Models in student work.

9.4 Data Acknowledgement

The data used in this assignment is extracted from the datasets provided on this Kaggle page under the Creative Commons CC0 license.