Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

HW2

General Notes

• You will submit a minimum of three files, the core files must conform to the following naming conventions (including capitalization and underscores). 123456789 is a placeholder,  please replace these nine digits with your nine-digit Bruin ID. The files you must submit are:

1.  123456789_stats102a_hw1 .R: An R script file containing all of the functions you wrote for the homework. The first line of your .Rmd file, after loading libraries, should   be sourcing this script file.

2.  123456789_stats102a_hw1 .Rmd: Your markdown file which generates the output file of your submission.

3.  123456789_stats102a_hw1 .html/pdf: Your output file, either a PDF or an HTML file depending on the output you choose to generate.

4.  included image files: You may name these what you choose, but you must include all of the image files you generated for your structured flowcharts, otherwise your file will not knit. One way to attach an image file is to use ![title](Images/example1 .png).

5. Please place all of your files into a single folder named  123456789_stats102a_hw2 and compress the folder into 123456789_stats102a_hw2.zip.

If you fail to submit any of the required core files you will receive ZERO points for the assignment. If you submit any files which do not conform to the specified naming convention, you will receive half credit for the assignment.

• Your .Rmd file must be knittable. If your .Rmd file does not knit in graders’ computer, you will receive at most half credit for the assignment.

The two most common reason files fail to knit are because of workspace/directory structure issues and because of missing include files. To remedy the first, ensure all of the file paths in your document are relative paths pointing at the current working directory. To remedy the second, simply make sure you upload any and all files you source or include in your .Rmd file.

• Your coding should adhere to the tidyverse style guide: https://style.tidyverse.org/.

my_function <- function(x, y, ...){

#A short description of the function

#Args:

#x: Variable type and dimension

#y: Variable type and dimension

#Return:

#Variable type and dimension

Your codes begin here

}

NOTE: Everything you need to do this assignment is here, in your class notes, or was covered in discussion or lecture.

 DO NOT look for solutions online. 

• DO NOT collaborate with anyone inside (or outside) of this class.

• DO NOT print out the whole dataset in the markdown.

• Work INDEPENDENTLY on this assignment.

• EVERYTHING you submit MUST be 100% your, original, work product. Any student suspected of plagiarizing, in whole or in part, any portion of this assignment, will be immediately referred to the Dean of Student’s office without warning.

1: Teachers Gradebook

The formatting takes 15% of this homework grade. The efficiency, accuracy, and robustness of your function account for 60%. The total is 75% for this question. If your function violent any instructions of this assignment, the function will not be graded.

Requirements:

1. Draw flowcharts (or an algorithms) of the main functions sufficiently complete, clear, and concise enough to enable a person to accurately implement the function in any programming languages they are adept with using.

2. Write the function(s) which accurately implements the algorithm(s) as described or requested.

3. Include the error-handling to ensure your function(s) work properly.

Note: The requirements apply to your messy_impute() and tidy_impute() functions.     One scenario which naturally creates non-tidy data is a teacher’s gradebook. Table 1 shows an example with five homework and five quizzes.

Table 1: An example of a teacher’s gradebook

UID

Homework_1

Homework_2

...   Homework_5

Quiz_1

...    Quiz_5

123456787

70

90

...   80

76

...    70

123456788

91

85

...    73

90

...   80

123456789

60

71

...    78

88

...    73

(a)  Create a simulated dataset in R1  called gradebook that represents a possible gradebook in

the basic format as Table 1:

• Each row of the gradebook should contain all measurements for a student.

• Each column should contain scores for one assignment.

• The last column should be “Quiz_5.”

The simulated gradebook should contain the grades for 100 students and scores (out of 100) for 5 homework and 5 quizzes. Set the seed for simulating your data with your UID.

(b) Write R code in R markdown file to randomly replace 10% of Homework_4 and Quiz_4 by NA respectively, and then use is.na() in conjunction with sum() to show your results.

(c) Write a function messy_impute() that imputes missing values in the gradebook. Please

also present your algorithm or flowchart to answer this question in the R markdown file.

Note:

Imputation is the process of replacing missing values by estimated values. The simplest (far from preferred) method to impute values is to replace missing values by the most  typical value, say the mean.

• Assume the format of the gradebook is fixed (five homework and five quizzes), but NA values may occur in any assignments except for UID.

• The messy_impute() function should have at least three arguments and ...:

– A data frame contains the gradebook as specified in the example. However, the number of rows and where NA values occur may be different while grading.

– A center argument should be a character object indicating the impute functions (Mean or Median). The default center is “Mean”.

– A margin argument should be an integer (1 or 2) indicating either the function    imputes the missing values by row/assignment (1) or by column/assignment (2). If choosing by column, the function should process homework and quizzes separately.

• The function should return the imputed data frame (or tibble).

• Toy example with three homework assignments only: Suppose the input data frame contains the data as follow.

UID

Homework_1

Homework_2

Homework_3

111111111

41.00

27.00

52.00

222222222

43.00

94.00

88.00

333333333

2.00

79.00

51.00

444444444

46.00

45.00

54.00

555555555

60.00

23.00

NA

To impute the NA value by column with mean, we compute  = 41.5 to replace the NA. If do it by row, the function should compute  = 61.25 to replace the

NA.

(d) Select two students missing Homework_4 and two students missing Quiz_4. Please use   these cases to demonstrate your function in the R markdown file. Here are some suggested cases but not limited to while grading.

• messy_impute(gradebook, “mean”, 1)

• messy_impute(gradebook, “median”, 2)

• messy_impute(gradebook, “mean”, 1, trim = 0.25)

(e) Write R code using the main function in the tidyr package to convert the gradebook into  the tidy format. Name this object gradebook_tidy. You may directly write code in the R markdown file.

(f) Write a function tidy_impute() that imputes missing values in gradebook_tidy object. The tidy_impute() function should have the same arguments as in the messy_impute() function. You should return an imputed gradebook_tidy object. Please also present your algorithm or flowchart to answer this question in the R markdown file.

Note: Don’t convert gradebook_tidy object to be a messy format or reuse your messy_impute() in any steps of your tidy_impute().

(g) Please use the cases you select from (d) to demonstrate your function in the R markdown file.

2:   Short Answers

The formatting takes 10% of your homework grade. The accuracy takes 15% of your grade. The total is 25% for this question.

(a) Please find three examples in which data might naturally be presented in a messy (non-tidy)

way. For each example, include the context, observations, and variables that would be recorded. Showcase a small sample, say 10 observations, for each example. You may search online for context, but you MUST cite your sources. Please don’t make up data on your own.

(b) For each of the three examples in (a), describe how the data might be better presented in a tidy way. Please use tidyverse functions to reorganize the small sample datasets in (a) into tidy format.