Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MSIN0010: Data Analytics I

2023/24

Assignment 3

Submission deadlines: Students should submit all work by the published deadline date and time. Students experiencing sudden or unexpected events beyond your control which impact your ability to complete assessed work by the set deadlines may request mitigation via the extenuating circumstances procedure. Students with disabilities or ongoing, long-term conditions should explore a Summary of Reasonable Adjustments.

Return and status of marked assessments: Students should expect to receive feedback within one calendar month of the submission deadline, as per UCL guidelines. The module team will update you if there are delays through unforeseen circumstances (e.g. ill health). All results when first published are provisional until confirmed by the Examination Board.

Copyright Note to students: Copyright of this assessment brief is with UCL and the module leader(s) named above. If this brief draws upon work by third parties (e.g. Case Study publishers) such third parties also hold copyright. It must not be copied, reproduced, transferred, distributed, leased, licensed or shared with any other individual(s) and/or organisations, including web-based organisations, without permission of the copyright holder(s) at any point in time.

Academic Misconduct: Academic Misconduct is defined as any action or attempted action that may result in a student obtaining an unfair academic advantage. Academic misconduct includes plagiarism, obtaining help from/sharing work with others be they individuals and/or organisations or any other form of cheating. Refer to Academic Manual Chapter 6, Section 9: Student Academic Misconduct Procedure - 9.2 Definitions.

Referencing: You must reference and provide full citation for ALL sources used, including AI sources, articles, text books, lecture slides and module materials. This includes any direct quotes and paraphrased text. If in doubt, reference it. If you need further guidance on referencing please see UCL’s referencing tutorial for students. Failure to cite references correctly may result in your work being referred to the Academic Misconduct Panel.

Use of Artificial Intelligence (AI) Tools in your Assessment: Your module leader will explain to you if and how AI tools can be used to support your assessment. In some assessments, the use of generative AI is not permitted at all. In others, AI may be used in an assistive role which means students are permitted to use AI tools to support the development of specific skills required for the assessment as specified by the module leader. In others, the use of AI tools may be an integral component of the assessment; in these cases the assessment will provide an opportunity to demonstrate effective and responsible use of AI. See page 3 of this brief to check which category use of AI falls into for this assessment. Students should refer to the UCL guidance on acknowledging use of AI and referencing AI. Failure to correctly reference use of AI in assessments may result in students being reported via the Academic Misconduct procedure. Refer to the section of the UCL Assessment success guide on Engaging with AI in your education and assessment.

Content of this assessment brief

Section             Content

A                     Core information

B                     Coursework brief and requirements

C                     Module learning outcomes covered in this assessment

D                     Groupwork instructions (if applicable)

E                      How your work is assessed

F                      Additional information

Section A: Core information

Submission date

06/12/2023

Submission time

10am, UK time

Assessment is marked out of:

20

% weighting of this assessment within total module mark

6

Maximum word count/page length/duration

No maximum.

Footnotes, appendices, tables, figures, diagrams, charts included in/excluded from word count/page length?

No specified maximum.

Bibliographies, reference lists included in/excluded from word count/page length?

No specified maximum

Penalty for exceeding word count/page length

No specified maximum (and thus no penalty for exceeding)

Penalty for late submission

Standard UCL penalties apply. Students should refer to https://www.ucl.ac.uk/academic-manual/chapters/chapter-4-assessment-framework-taught-programmes/section-3-module-assessment#3.12

Artificial Intelligence (AI) category

Assistive

Submitting your assessment

The assessment needs to be submitted on Moodle. See the online coursework submission guide

Anonymity of identity. Normally, all submissions are anonymous unless the nature of the submission is such that anonymity is not appropriate, illustratively as in presentations or where minutes of group meetings are required as part of a group work submission

The nature of this assessment is such that anonymity is required.

Section B: Assessment Brief and Requirements

Exercise 1: Use the Credit dataset from the ISLR package in R to solve the following exercises.

a) (1 points) Estimate a linear regression model with Rating as the response variable and Income, Education, and Student as explanatory variables. Show the R code and the model output.

b) (2 points) Interpret the coefficient estimate and p-value for Education shown in the model output. Be clear about what each number means in the present context and what we can learn from them.

c) (3 points) Perform model selection to analyze whether a regression tree would be a better model to predict the credit rating in the future. Use a 75% training and 25% testing split of the data, and use the Mean Squared Error (MSE) as a model fit statistic. Choose whether to focus on the in-sample or out-of-sample fit and justify this choice. Provide the R code you use and the output.

Exercise 2: Load the purchases.csv dataset from Moodle. The dataset contains information on the search sessions of consumers on different days. Each observation is a randomly sampled search session. For each observation, we have data on how many clicks and purchases the consumer made during the session. Moreover, we have information on the duration of the session in minutes, on which weekday the session took place (1=Monday, 2=Tuesday, etc.), and an identifier for the city.

a) (1 point) An analyst wants to use a linear regression model to understand why some sessions lead to more purchases. What is the response variable in the data, and what are possible explanatory variables the analyst could use?

b) (1 point) Load the data and estimate a linear regression with the response and the explanatory variables you determined in a).

c) (3 points) What do you conclude from the output? Clearly state how you come to the conclusion based on the output.

d) (1 point) A colleague argues that a classification tree would be better suited in this case. Do you agree with your colleague? If yes, why? If not, why not?

Exercise 3: Load the clicks.csv dataset from Moodle. The dataset contains information on clicks and purchases for products in a product category of an online retailer. Each observation is a product shown to a randomly sampled consumer on a product list similar to this one: https://www.amazon.co.uk/s?k=headphones. For each observation, we have data on whether the consumer clicked on it and purchased it. For each product, we have information on the position at which it appeared on the list (lower position means higher on the list), the price, the review score, whether the product was on promotion, and whether the product is from a well-known brand.

a) (1 points) An analyst wants to use a regression tree to predict which products a consumer is going to click on. Is this model appropriate for the task? If yes, explain why. If not, propose a better-suited alternative model and explain why it is better-suited.

b) (1 point) Load the data and estimate the model you determine in a) using the following explanatory variables: position, price, review_score, and brand_indicator. Show your R code and the model output or tree graph, depending on the model you estimate.

c) (1 point) The analyst now wants to perform a statistical test to check whether cheaper products tend to get more clicks, conditional on the other explanatory variables in b). Which model discussed in class is best-suited for this task, and why?

d) (3 points) Perform the test in c). For this, clearly formulate the hypotheses underlying the test and draw a conclusion. Show your code and output, and be clear what in the output you use to draw your conclusion.

Exercises continue on next page.Exercise 4 (2 points): Consider the following data set on bank transfers. There are three variables: fraud indicates whether transfer was fraudulent or not, amount indicates the transfer amount in GBP (thousands), and number_transactions indicates how many transfers to the same recipient were made in the past.

ID             fraud           amount           number_transactions

1              no                    6                             4

2              yes                   4                             0

3              no                    8                             3

4              no                    2                              2

Suppose you observed the following data on a new bank transfer:

ID             fraud                  amount               number_transactions

5                 ?                          5                                0

Use the 1st - nearest neighbor algorithm (WITHOUT rescaling the data) to classify whether the transfer is fraudulent or not. Show your work.

Section C: Module Learning Outcomes covered in this Assessment

This assessment contributes towards the achievement of the following stated module Learning Outcomes as highlighted below:

- Use selected tools to analyze and visualize data.

- Understand and apply founding probability and statistical theory to data analysis.

- Understand key elements of the theory, technology, and algorithms that underpin the tools used.

Section D: Groupwork Instructions (where relevant/appropriate)

Section E: How your work is assessed

Within each section of this assessment you may be assessed on the following aspects, as applicable and appropriate to this assessment, and should thus consider these aspects when fulfilling the requirements of each section:

 The accuracy of any calculations required.

 The strengths and quality of your overall analysis and evaluation;

 Appropriate use of relevant theoretical models, concepts and frameworks;

 The rationale and evidence that you provide in support of your arguments;

 The credibility and viability of the evidenced conclusions/recommendations/plans of action you put forward;

 Structure and coherence of your considerations and reports;

 Appropriate and relevant use of, as and where relevant and appropriate, real world examples, academic materials and referenced sources. Any references should use either the Harvard OR Vancouver referencing system (see References, Citations and Avoiding Plagiarism)

 Academic judgement regarding the blend of scope, thrust and communication of ideas, contentions, evidence, knowledge, arguments, conclusions.

 Each assessment requirement(s) has allocated marks/weightings.

Student submissions are reviewed/scrutinised by an internal assessor and are available to an External Examiner for further review/scrutiny before consideration by the relevant Examination Board.

It is not uncommon for some students to feel that their submissions deserve higher marks (irrespective of whether they actually deserve higher marks). To help you assess the relative strengths and weaknesses of your submission please refer to SOM Assessment Criteria Guidelines, located on the Assessment tab of the SOM Student Information Centre Moodle site.

The above is an important link as it specifies the criteria for attaining the pass/fail bandings shown below: At UG Levels 4, 5 and 6:

80% to 100%: Outstanding Pass - 1st; 70% to 79%: Excellent Pass - 1st; 60%-69%: Very Good Pass - 2.1; 50% to 59%: Good Pass - 2.2; 40% to 49%: Satisfactory Pass - 3rd; 20% to 39%: Insufficient to Pass - Fail; 0% to 19%: Poor and Insufficient to Pass - Fail.

At PG Level 7:

86% to 100%: Outstanding Pass - Distinction; 70% to 85%: Excellent Pass - Distinction; 60%-69%: Good Pass - Merit; 50% to 59%: Satisfactory - Pass; 40% to 49%: Insufficient to Pass - Fail; 0% to 39%: Poor and Insufficient to Pass - Fail.

You are strongly advised to review these criteria before you start your work and during your work, and before you submit.

You are strongly advised to not compare your mark with marks of other submissions from your student colleagues. Each submission has its own range of characteristics which differ from others in terms of breadth, scope, depth, insights, and subtleties and nuances. On the surface one submission may appear to be similar to another but invariably, digging beneath the surface reveals a range of differing characteristics.

Students who wish to request a review of a decision made by the Board of Examiners should refer to the UCL Academic Appeals Procedure, taking note of the acceptable grounds for such appeals. Note that the purpose of this procedure is not to dispute academic judgement – it is to ensure correct application of UCL’s regulations and procedures. The appeals process is evidence-based and circumstances must be supported by independent evidence.

Section F: Additional information from module leader (as appropriate)