Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

AMS 315

Data Analysis, Spring 2022

First Computing Assignment

The first report is due on Tuesday, April 12, but can be submitted without penalty by April 19.  This report is worth 100 examination points. Please remember that there is a second project coming, so that you should finish the first project as soon as possible. Please submit your project on the Class Blackboard as instructed below. Please submit your report of Project 1 (both parts) in one pdf file. Each student has one chance to resubmit the report before the deadline. Detailed submission information is given below.

Project 1 has two parts. There are three files for this project. Two of the files are for part A, and one file is for part B. The files are labeled with the last six digits ofyour Stony Brook ID      number.

Part A

The model for the Part A assignment is a first data and statistical processing task that a        newly hired statistician might be given. Your report should address the issues that your future     supervisor would want to know about: how many observations, fraction of missing data in          independent variable and dependent variable, and imputation of missing data. Part A is worth 40 points. The two files for part A each contain a column for subject ID and a column for either the dependent variable value or the independent variable value. Your first task is to sort the two files by subject ID and merge them. You should not just use “cut and paste” to merge your data.         Second, you are expected to deal with missing data. Your report should contain the count of the  number of subject IDs that had at least one independent variable value or dependent variable       value. It should also include the count of the number of subject IDs that had an independent        variable value, the count of the number of subject IDs that had a dependent variable value, the    count of the number of subject IDs that had both an independent and dependent variable value,   and the count of the number of subject IDs that had at least one independent variable value or     dependent variable value.


Your second task is to impute the missing values. There are many of missing data             procedures. Often a statistical package has imputation algorithms in the software. For example, R has a package called MICE that has several options. You may not choose listwise deletion or       mean imputation (or its equivalent median imputation). Specify your choice in your report.          Often, the choice of imputation method has little effect on the results if the fraction of missing     data is 30% or less.

Part B

Part B is worth 60 points. The data file for part B contains one line for each subject ID. The line will contain the subject ID, the value of the independent variable, and the value of the dependent variable. A transformation of either IV or DV or both may be required. You should read the text  for suggestions on fitting a model. An approximate lack of fit (LOF) test should be applied. It is  your responsibility to find repeated (or near repeated) independent variable values. That is, you   will have very few exact repeats of an independent variable value. You should bin near repeated  data into one level. For example, suppose that  !  = 1.01,  "  = 1.02,  #  = 1.03  and !  = 2, "  =      3, #  = 4. While there are not exactly repeated x values, you could bin these points into one group of nearly repeated points. That is, choose the average x-value as the value of x after binning.        Then your binned data would be  !  = 1.02,  "  = 1.02,  #  = 1.02  and !  = 2, "  = 3, #  = 4.          Now perform a LOF test on the data set after binning all near repeated values. There is software  in R that performs an approximate lack of fit test.

Report

You must submit a one-page report on Problem A and a one-page report on Problem B in a pdf format file. Each report should have four sections.

1.  Introduction. The introduction should contain a statement of the problem and the              objective of the paper. Some of the questions that you should answer are: What is the       objective of your effort? What are your research questions? What is the background of     this work? The introduction is easy: your problem is to recover the function that was used to generate the dependent variable value based on the value of the independent variable.

2.  Methods. The second section should describe your methodology. Specifically, how were the files were merged? What was the program used to perform the statistical analysis?


What were the statistical techniques used? Did you use linear regression Did you use  additional procedures such as an approximate lack of fit test? How much missing data was present in the data? What procedure did you use to deal with missing data.

3.  Results. The third section should contain your results: What fraction of the variation of the dependent variable was explained? What was the analysis of variance table? What  was the fitted function? What was the confidence interval for the slope? What was the conclusion to the test of the null hypothesis that the slope was zero.

4.   Conclusions and Discussion. The fourth section should be conclusions and discussion. This section should focus on “big picture” issues. Was there an association between the variables? How important was it? That is, what was the r-squared value. What is your   fitted function? You may submit a longer appendix of computer work and programs.

 

If you include a table or figure, you must discuss it. Tables and figures should be numbered and titled.


Grading of a past semester’s Project 1:

These are the grading penalties for Project 1 from a past semester presented in order of point deduction

Part A

-40 no report other than compilation of computer code

-40 no reported fitting function or statistics

-40 inconsistent reported functions or statistics

-40 incorrect missing data report

-40 used only complete data points (used listwise deletion)

-40 results not consistent with assigned data

 

 

-30 used median imputation (or mean or other related imputation method)

-30 no specification of imputation method

-30 incorrect report of significance of association

-30 incomplete missing data report

-30 incorrect number of observations in analysis

 

 

-20 "99.9% of variance explained”;

-20 99.9% independent variable

-20 "linear regression represents 99% of data”;

-20 incomplete specification of imputation method

- 10 incorrect interpretation of CI

- 10 low r-squared does not mean that transformation will help

- 10 inconsistent reports of number of observations (792 vs. 791)

-5 no r or r-squared reported


Part B

-60 no report

-60 no report of function or function parameter estimates

-40 correct transformation but no report of function parameter estimates

-30 incorrect transformation selection--the r-squared for your selected transformation was one of the lowest values obtained

-30 incorrect interpretation of lof results;

-30 incorrect number of observations

-30 did not pick a final model

-30 incorrect report of corr(IV,DV); correlation values reported are too small in absolute value for this data set

 

Important note:

Simply submitting your computer output is not acceptable and will receive a grade of 0. You must submit aformal report to get non-zero creditfor this assignment.

How a student should submit the Project 1 report

1. The report should be uploaded as a pdf file and submitted via the link for the first project assignment on blackboard.

2.  The report should be in a single file. Both the one-page report for Part A and the one page report for Part B should be submitted in the same file.

Signs of Plagiarism in Your Report

1. Plagiarism is a serious issue. My expectation ofyou is that the work that you present in your report is yours alone.

2. Results: If you analyze the wrong data set, the grade for your report will be 0, whether or not plagiarism is involved. If you have been working jointly with other students, compare your       results with their results. If they are same, then there may be a plagiarism problem.

2. Codes. You may attach your computer code in an appendix to your report. If two students have the same codes, there may be a plagiarism problem.


3. Two students who submit the same report except for statistical results have engaged in         plagiarism. The enabler (originator of the paper) is more guilty in my eyes than the plagiarizer.