关键词 > ETX2250/ETF5922

ETX2250 / ETF5922 Data visualisation and analytics

发布时间：2023-02-10

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

2022 Example Exam

ETX2250 / ETF5922

Data visualisation and analytics

2 hours 10 mins

Rules

During an exam, you must not have in your possession any item/material that has not been authorised for your exam. This includes books, notes, paper, electronic device/s, mobile phone, smart watch/device, calculator, pencil case, or writing on any part of your body. Any authorised items are listed above. Items/materials on your desk, chair, in your clothing or otherwise on your person will be deemed to be in your possession.

You must not retain, copy, memorise or note down any exam content for personal use or to share with any other person by any means following your exam.

You must comply with any instructions given to you by an exam supervisor.

As a student, and under Monash University’s Student Academic Integrity procedure, you must undertake your in-semester tasks, and end-of-semester tasks, including exams, with honesty and integrity. In exams, you must not allow anyone else to do work for you and you must not do any work for others. You must not contact, or attempt to contact, another person in an attempt to gain unfair advantage during your exam session. Assessors may take reasonable steps to check that your work displays the expected standards of academic integrity.

Failure to comply with the above instructions, or attempting to cheat or cheating in an exam may constitute a breach of instructions under regulation 23 of the Monash University (Academic Board) Regulations or may constitute an act of academic misconduct under Part 7 of the Monash University (Council) Regulations.

Background for Question 1 and 2. (old style)

The following scenario will be used in question 1 and 2.

A researcher is interested in comparing economic and health indicators across countries in Africa, Asia and the Middle East based on data from the World Bank. The data used consists of a data frame countries.df:

• GDP:	Per capita Gross Domestic Product, in adjusted 2011 U.S. Dollars
• LaborRate:	Labor force participation rate.
• HealthExp:	Health expenditures in U.S. Dollars.
• InfMortality:	Infant mortality per 1000 live births.
• RegionName:	taking values Africa, Asia, Middle East
• Name:	Name of country

Question 1 [3 + 3 + 4 + 5 + 5 + 5 = 25 marks] (old style)

Use the Figure 1.1 through 1.3 as input to answer the following questions

a) What is the correlation between Infant mortality per 1000 live births and GDP in Asia?

b) What is approximately the median Infant mortality per 1000 live births and GDP in Asia?

c) Discuss a prominent outlier in the Africa data which is apparent in Figure 1.3. Explain what you can determine about this outlier using information from any relevant graphs.

d) Using Figure 1.1, discuss the relationship between Health Expenditure and Infant Mortality.

e) Which graph would you use to highlight the difference in infant mortality between Africa and the other two regions? Discuss your chosen graph in detail.

f) Write down the ggplot command for creating Figure 1.2 by using the variable names from the

background of this section.

Question 2 [5 + 5 + 5 = 15 marks] (old style)

a) Table 2.1 represents a subset of the data for 3 countries in Africa. Suppose subtab.df is the data frame containing the columns of Table 2.1. Implement by hand the following command, thus rewriting this data in long form:

gather(data = subtab.df, key = 'Measurement', value = 'Quantvalue', -Name )

b) Suppose you have as input table 2.2. Write down the dplyr commands for calculating the average value for each measurement across the regions. The input data frame is called “subtab.df”:

c) Suppose you have as input table 2.2 in the database with the tablename public.subset. Write down the sql commands for calculating the average value for each measurement across the regions.

Table 2.1

Table 2.2

Question 3 [(5 + 2 + 3) + 5 + 5 + 5 = 25 marks]

a) The data set europe.csv provides the values of economic indicators in Europe as shown in table 3.1 Answer the following questions about the code in Figure 3.1

i) Describe the results of implementing the code in Figure 3.1.

ii) Why is the scale command used? What does it do?

iii) Figure 3.2 is a plot of ss.df$ss against ss.df$k, what does this graph tell you? Which k would you choose and why?

Figure 3.1

Table 3.1

b) In the context of hierarchical cluster analysis, explain what a linkage method measures. Explain the two linkage methods “single” and “complete” .

c) The Euclidean distances between points A, B, C, D, E and F are shown in Table 3.4. Draw a reasonably accurate dendrogram, including vertical scale, that corresponds to a hierarchical collection of clusters for these points, using single linkage.

Table 3.2

A B C D E F

0	2	9	12	6	8.5
2	0	7	10	5	7
9	7	0	3	6	4
12	10	3	0	8	6
6	5	6	8	0	2
8.5	7	4	6	2	0

Please answer question on your blank piece of paper.

After your exam finishes, you’ll have extra time to access your phone to scan a QR code and upload your

answer.

Clearly label each page with Student ID and this question number (and sub part if applicable) (for example,

'Question 7a')

Do not write your Name on it

No. of answer sheets: 1

d) Suppose you are performing a cluster analysis using a data set consisting of ten binary variables V1 to V10. Two of the cases are shown in rows C1 and C2 of Table 3.2.

a. Calculate the simple matching and Jaccard measures of similarity for these two cases.

b. Also explain how you would decide which of these two measures to use.

c. How do the similarity measures relate to measures of distance between data points?

	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10
C1	1	1	1	0	1	1	1	1	0	0
C2	0	1	0	0	1	0	1	0	1	0

Question 4 [(2 + 2 + 2) + (3 + 3 + 4 + 4 ) = 20 marks]

We consider data from the U.S. Bureau of Transportation Statistics, to predict if an accident will result in injuries based on initial 5 factors that are recorded in the emergency call. The goal of this is to optimize when to send an ambulance or only the fire brigade.

The variables from the codebook are:

• vehl_invl: Number of vehicles involved

• alchl_i: Alcohol involved = 1, not involved = 2

• mancol_i_r: 0=no collision, 1=head-on, 2=other form of collision

• rel_rwy_r: 1=accident on roadway, 0=not on roadway

• spd_lim: Speed limit, miles per hour

Half the data was randomly selected as training data, and the tree in Figure 4.1 was constructed:

Figure 4.1

(a) Consider the process of building the initial tree.

i) At the root node, for example, the classification tree algorithm splits the data based on whether the number of cars involved is below or above 3. How is this choice made?

ii) Explain how we choose the value of the target variable to assign to a leaf.

iii) What does this mean for any accident report with at least 5 cars involved?

(b) Now consider the completed trees.

i) How many terminal nodes in the tree?

ii) How would you reach the second last terminal node from the right? (TRUE .41 .59)

iii) What are the variables and their values to visit the decision node spd_lim >= 48 ?

iv) Write down the decision rules for this tree

v) Calculate the accuracy for the confusion matrix on the test set. Is it a good classifier? Explain why.

Actual		Predicted
		False	True
	False	414	84
	True	347	155

Question 5 [6 + 6 + 3 = 15 marks]

An online provider of statistics courses is interested in assessing alternative sequencing and combinations of courses, and therefore wishes to conduct association analysis on its data for past students. Table 5.1 shows a sample of their data, with each row representing an individual student and each column representing a statistics course that they offer as identified by the column headings.