Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DS1000B – Assignment #4

Due: Apr 7, 2024 @ 11:55pm

. Submissions must be done via Gradescope. You must carefully assign pages to their corresponding questions. You will receive a grade of zero in each case below:

a)    Submission is not in PDF format.

b)   Questions with no pages assigned to them.

c)   Submission that is blurry or too small to  read easily. We will not be using the zoom tool or downloading your submission to grade it

.    Please submit a single PDF file. Here is a recommended way to achieve this:

a)    If you write your derivation on papers, you can scan them into a pdf file (if they are images, paste images to a word document then save as a pdf file).

b)   Write your Python code (e.g. in Jupyter notebook) then save it as a pdf file.

c)   Combine all the pdf files above into one pdf file.

.    If you have difficulty in formatting your submission, please see the “Lab1-preparation” file, or attend TA office hours as soon as possible.

. You may work with a partner for this assignment. If you choose to do so, only one of you should submit the assignment. Ensure to include your partner's name in the designated place in Gradescope, linking the grade to both of you. In the event of forgetting this step, be certain that both names are on the submitted PDF file. If you are not linked to an assignment or your name is not on a submission file, that you will receive a grade of zero.

.    Each assignment submission, whether it be an individual submission or a partnered submission, must be your own work. Scholastic offences are taken seriously. Please refer to this website for details:

http://www.uwo.ca/univsec/pdf/academic_policies/appeals/scholastic_discipline_undergrad.pdf

Grade Breakdown:

Part 1: Written Answer

Question 1          14

Question 2          13

Question 3 10

Total Points =     37

Part 2: Python

Question 4         12

Question 5         10

Question 6          11

Total Points =     33

Total Points: 70

Part 1 – Written Answer (Be sure to show all your work by default)

Question 1 [14 Points]

Movies at a theatre can be rated 1 star, 2 stars, 3 stars, 4 stars or 5 stars, all with an equally likely chance. Assume you go to this theatre to see two movies successively. Then you go back home to check the ratings of these two movies.

a)   [2 points] Write out the sample space of all outcomes of the ratings of the two movies (e.g. outcome “12” means the rating of the first movie is 1 star and the rating of the second one is 2 stars).

b)   [3 points] Let A be the event that both movies have the same rating. List the outcomes in A. What is P(A)?

c)   [3 points] Let B be the event that one movie is rated 2 stars higher than the other movie. List the outcomes in B. What is P(B)?

d)   [3 points] Let C be the event that the first movie has a lower rating than the second movie. List the outcomes in C.  What is P(C)?

e)   [3  points] Are events A & B disjoint events? What  does this tell you about the P(A or B)? Calculate P(A or B).

Question 2 [13 Points]

Let A, B, C be the events corresponding to the following transit options used in the past 30 days: A = car

B = bus

C = train

Suppose the probabilities that a randomly selected Western student used these transit options in the past 30 days are:

P(A) = 0.45                                   P(B) = 0.55                      P(C) = 0.55

P(A and C) = 0.25                        P(A and B) = 0.30          P(B and C) = 0.20          P(A and B and C) = 0.15

a)   [5 points] Sketch a Venn diagram for events A, B, C and the sample space S, and be sure to label the probability of each disjoint subset (the middle calculation steps are optional).

b)   [1  point] From the Venn diagram, what is the probability that a randomly selected Western student did not use any of these transit options in the past 30 days?  (The middle steps are optional.)

c)   [1 point] From the Venn diagram, what is the probability that a randomly selected Western student used the bus, but no other forms of transit? (The middle steps are optional.)

d)   [2 points] If a student used the train, what is the probability that this student also used a car?

e)   [2 points] If a student only used one type of transit, what is the probability that it wasn’t the train?

f)    [2 points] If a student used exactly two types of transit, what is the probability that one of the options was the bus?

Question 3 [10 Points]

To estimate the mean “ of the DS1000 midterm scores, you obtain a simple random sample (SRS) of scores from n = 50 students. From previously published information, you know that the midterm scores are approximately Normal, with a mean of 75 and a standard deviation of 10.

a)   [3 points] What is the approximate distribution of the sample mean test score, x(̅), according to the central limit theorem?

b)   [2 points] What is the approximate probability that x(̅) is above 78?  Use Table A.

c)   [2 points] What is the sample size you need to make the standard deviation of the sample mean equal to 1? Why?

d)   [3  points]  Suppose  it turns out your SRS gives an x(̅) of 74. Do you have enough statistical evidence to reject the hypothesis that the true population mean is 75?

Part 2 – Python (All numbers and graphs need to be produced using Python by default)

Question 4 [12 points]

[Scores.csv] Suppose we have the final scores of students from our DS1000 class that form the data file Scores.csv. This dataset consists of three variables:

-     ID: the student ID

-     Score: the final score

-     Program: the program of the student.

A researcher named Bob wants to draw a small sample from this big dataset. Let us assist Bob in this procedure. For each question, to show your sampling result, you only need to print the IDs of the selected students in the sample. Set all the random seeds as 123 if applicable. You are allowed to borrow some existing codes from Labs.

a)   [2 points] Perform a simple random sampling by only using the ID variable to draw a sample with size 40.

b)   [3  points] Perform a systematic random sampling by only using the ID variable. You need to randomly select one from the first 10 IDs. Then choose every 10th  ID after that until you get a sample with size 40. (e.g. if you select ID = 2 in the first 10 IDs, the next one should be ID = 12.)

c)   [3 points] Perform a cluster sampling based on the “Program” variable (Recall the meaning of cluster sampling). Randomly select three clusters and combine them to form a sample. For this question, you only need to print the first 10 IDs of each selected cluster (no need to print all the IDs in the sample).

d)   [4  points]  Perform  a stratified sampling based on the “Program” variable. In each stratum, randomly select 10 students.

Question 5 [10 points]

This is a continuation of Question 2. You can use the results from Question 2. We are going to draw Venn diagrams in Python using the same setup. Suppose the size of the sample space S is 100.

a)   [2 points] What is the size of set A? What is size of the set “A and not B”? (You can compute it either in python or by hand. Hints: P(A) = size of A/size of S. To get “A and not B”, you may first draw a diagram to assist your understanding.)

The following Venn diagrams need to be drawn in Python. By default, set labels using the Big letters (e.g. “A”, “B” and “C”). Set title as “Venn diagram for A, B” (or “ … for A, C”). The degree of transparency is 0.7.

b)   [3 points]  Draw a Venn diagram for two events A, B. Set colors as (orange, blue). (Hints: use `venn2` and specify `subsets` accordingly. The middle steps to compute the sizes of subsets are optional.)

c)   [5 points] Draw a Venn diagram for three events A, B, C. Set colors as (orange, blue, red). (Hints: use `venn3` and specify `subsets` accordingly. The middle steps to compute the sizes of subsets are optional.)

Question 6 [11 points]

[Woods.csv] How heavy a load (in pounds) is needed to pull apart pieces of Douglas fir 4 inches long and 1.5 inches square? The file wood.csv contains the data collected from students doing a laboratory exercise. It has only one variable called “load” .

a)   [1 point] Compute the mean and standard deviation (sd) of the data.

b)   [4 points] Perform the following steps:

1.   Randomly select n = 5 (no need to set a random seed) from the data to form a sample;

2.   Compute the sample mean of the selected sample.

3.   Repeat the procedure above for 2500 times to get a sequence of sample means.

c)   [3 points] Repeat part d) by changing n to 50, 500, 5000 to produce three extra sequences. Draw a histogram for each of the sequences (with a fitted density) and overlay them together using different colors.

d)   [3  points]  From the  histograms  in the  previous two  parts, can you spot any pattern in the shapes of the histograms as n increases? Can you recall a related theorem we mentioned in the course?