AMS 315

Data Analysis, Spring 2021

First Computing Assignment


        The first report is due on Tuesday, April 13, but can be submitted without penalty by April 20. This report is worth 100 examination points. Please remember that there is a second project coming, so that you should finish the first project as soon as possible. Please submit your project as instructed on the Class Blackboard. Please submit your report of Project 1 in pdf form. Each student has one chance to resubmit the report before the deadline. Detailed submission information is given below.

        Project 1 has two parts. There are three files for this project. Two of the files are for part A, and one file is for part B. The files are labeled with the last four digits of your Stony Brook ID number.


Part A

        The model for the Part A assignment is a first data and statistical processing task that a newly hired statistician might be given. Your report should address the issues that your future supervisor would want to know about. Part A is worth 40 points. The two files for part A each contain a column for subject ID and a column for either the dependent variable value or the independent variable value. Your first task is to sort the two files by subject ID and merge them. You should not just use “cut and paste” to merge your data. Second, you are expected to deal with missing data. Your report should contain the count of the number of subject IDs that had at least one independent variable value or dependent variable value. It should also include the count of the number of subject IDs that had an independent variable value, the count of the number of subject IDs that had a dependent variable value, the count of the number of subject IDs that had both an independent and dependent variable value, and the count of the number of subject IDs that had at least one independent variable value or dependent variable value.

        Your second task is to impute the missing values. There are a number of missing data procedures. Often a statistical package has imputation algorithms in the software. For example, R has a package called MICE that has a number of options. You may not choose listwise deletion or mean imputation (or its equivalent median imputation). Specify your choice in your report. Often, the choice of imputation method has little effect on the results if the fraction of missing data is 30% or less.


Part B

Part B is worth 60 points. The data file for part B contains a column for subject ID, a column for the value of the independent variable, and a column for the value of the dependent variable. A transformation of either IV or DV or both may be required. You should read the text for suggestions on fitting a model. An approximate lack of fit (LOF) test should be applied. It is your responsibility to find repeated (or near repeated) independent variable values.

At one point, the R software had an approximate lack of fit test available. This lack of fit test for Part B (library alr3) is no longer available in R. You can download this package from the source code.


In your file, please replace


install.packages('alr3')

library(alr3)

fit_b <- lm(y ~ x, data = data_bin)

pureErrorAnova(fit b)


with


install.packages('remotes')

library(remotes)

install_github("cran/alr3")

library(alr3)

fit b <- lm(y ~ x, data = data bin)

pureErrorAnova(fit b)


You may be prompted to update other existing packages in R during this step. By typing the number '3' into the program when you receive this prompt, only the alr3 library will be installed. After doing this, you should be able to perform the lack of fit test successfully.


Report

You must submit a one-page report on Problem A and a one-page report on Problem B. Each report should have four sections. 

1. Introduction. The introduction should contain a statement of the problem and the objective of the paper. Some of the questions that you should answer are: What is the objective of your effort? What are your research questions? What is the background of this work? The introduction is easy: your problem is to recover the function that was used to generate the dependent variable value based on the value of the independent variable.

2. Methods. The second section should describe your methodology. Specifically, how were the files were merged? What was the program used to perform the statistical analysis? What were the statistical techniques used? Did you use linear regression Did you use additional procedures such as an approximate lack of fit test? How much missing data was present in the data? What procedure did you use to deal with missing data.

3. Results. The third section should contain your results: What fraction of the variation of the dependent variable was explained? What was the analysis of variance table? What was the fitted function? What was the confidence interval for the slope? What was the conclusion to the test of the null hypothesis that the slope was zero.

4. Conclusions and Discussion. The fourth section should be conclusions and discussion. This section should focus on “big picture” issues. Was there an association between the variables? How important was it? That is, what was the r-squared value. What is your fitted function? You may submit a longer appendix of computer work and programs.

If you include a table or figure, you must discuss it. Tables and figures should be numbered and titled.


Grading of a past semester’s Project 1:

These are the grading penalties for Project 1 from a past semester presented in order of point deduction

Part A

-40 no report other than compilation of computer code

-40 no reported function or statistics

-40 inconsistent reported functions or statistics

-40 incorrect missing data report

-40 used only complete data points (used listwise deletion) -40 results not consistent with assigned data


-30 used median imputation (or mean or other single valued imputation method)

-30 no specification of imputation method

-30 incorrect report of significance of association

-30 incomplete missing data report

-30 incorrect number of observations in analysis


-20 "99.9% of variance explained”;

-20 99.9% independent variable

-20 "linear regression represents 99% of data”;

-20 incomplete specification of imputation method

-10 incorrect interpretation of CI

-10 low r-squared does not mean that transformation will help

-10 inconsistent reports of number of observations (792 vs. 791)

-5 no r or r-squared reported

Part B

-60 no report

-60 no report of function or function parameter estimates

-40 correct transformation but no report of function parameter estimates

-30 incorrect transformation selection--the r-squared for your selected transformation was one of the lowest values obtained

-30 incorrect interpretation of lack of fit results;

-30 incorrect number of observations

-30 did not pick a final model

-30 incorrect report of corr(IV,DV); correlation values reported are too small in absolute value for this data set


Important note:

Simply submitting your computer output is not acceptable and will receive a grade of 0. You must submit a formal report to get non-zero credit for this assignment.


How a student should submit the project 1 reports


1. The report should be uploaded as a pdf file and submitted via the link for the first project assignment on blackboard.

2. (Not recommended) An alternative way to submit your report is to send an email (attaching your report) to TA. The file must be named with the last five digits of your Stony Brook ID_your last name_Project1.pdf/doc/docx. The email address is [email protected]

3. The report should be in a single file. Both the one page report for Part A and the one page report for Part B should be submitted in the same file.


Signs of Plagiarism in Your Report


1. Plagiarism is a serious issue. My expectation of you is that the work that you present in your report is yours alone.

2. Results: If you analyze the wrong data set, the grade for your report will be 0, whether or not plagiarism is involved. If you have been working jointly with other students, compare your results with their results. If they are same, then there may be a plagiarism problem.

2. Codes. You may attach your computer code in an appendix to your report. If two students have the same codes, there may be a plagiarism problem.

3. Two students who submit the same report except for statistical results have engaged in plagiarism. The enabler (originator of paper) is more guilty in my eyes than the plagiarizer.