Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


CS430/CS910 Exercise Sheet 3:

Regression


From the previous set of exercises, we know that the abalone data set has a number of attributes that are well-correlated with each other. For questions 2-4, we will use regression to study models that predict other attributes in the abalone data set.


These exercises are best completed using Weka. The raw data is available from: http://archive.ics.uci.edu/ml/datasets/Abalone. A version in the Weka format is available at: http://www2.warwick.ac.uk/fac/sci/dcs/teaching/material/cs910/exercises/abalone.arff.


1. For each of the following examples, select which type of regression (from Simple Linear Regression, Multiple Linear Regression, and Logistic Re-gression), if any, would be most appropriate. If none of these options would be appropriate, say why.

(a) Predicting the probability of a person getting accepted into the Uni-versity of Warwick based on their score in a single entrance exam    [4]

(b) For training a system to play Super Mario Bros.    [4]

(c) Predicting the weight of a person based only on their height     [4]

(d) Predicting the income of a person based on their years of experience, age, and percentage score in their undergraduate degree (you can assume all people in the data set have an undergraduate degree)     [4]

(e) Dividing a data set into n groups, where data in the same group are more similar to each other than to data in other groups.     [4]


2. For the abalone data set, fit a simple linear regression model to give di-ameter as a function of length. Give the parameters of the model, and the correlation coefficient. Comment briefly on what the parameters of the model tell you about abalone (see http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names for a description of the attributes).    [20]


3. The dataset includes information about the total weight of each specimen, along with the weight of different pieces (e.g. shell). Fit a multiple linear regression model to give the whole weight as a function of the shucked weight, viscera weight, and shell weight, and give its parameters.

Common sense would suggest that the whole weight should be related to the sum of these weights. Looking at the model and the data, comment on whether this relation holds for the observations.    [20]


4. The male and female abalone are quite hard to tell apart, so for this ques-tion we will try to build a model to tell whether a specimen is an infant (I) or an adult (M/F).

Build a logistic regression model to predict this feature based on the fol-lowing attributes:

(a) Length only    [5]

(b) Whole Weight only    [5]

(c) Class Rings only    [5]

(d) Length, whole weight, and class rings together.    [5]

For each model, give the accuracy (percentage of training examples pre-dicted correctly).

Hint: You may find it helpful to modify the input dataset to recode the new class value.


5. For the last question, we return to the familiar adult data set. (Please download from http://www2.warwick.ac.uk/fac/sci/dcs/teaching/material/cs910/exercises/adult.arff). Build a logistic regression model for the attribute sex (M/F) using combinations of attributes from adult.data (try adding and removing attributes to see what happens). The aim is to find a model that balances simplicity with accuracy, so try to include as few variables as possible while giving an accurate result. Describe the final model you obtain, the steps you followed to reach it, and its accuracy for the task.

(a) Which attributes can be removed from the data set without affecting the accuracy of the resulting model significantly (say, by at most 1%)? Give an argument why this might be the case for the attributes in question.    [8]

(b) Why is relationship-status helpful?    [6]

(c) Why is country=Holand-Netherlands weighted heavily?    [6]