LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
LAB 1 INSTRUCTIONS
DESCRIBING AND DISPLAYING DATA
This lab will assist you in learning how to summarize and display categorical and quantitative data in R Commander. In particular, you will learn how to obtain frequency and contingency tables for categorical data and display the data with bar charts and pie charts. You will also learn how to obtain the appropriate measures of center and spread for quantitative data and display the data with histograms andboxplots. Finally, you will study how to display data over time with a time plot. The document should be used as a reference in your work on the Lab 1 assignment.
1. Summarizing and Displaying Categorical Data
The categorical variables below described as Sex (possible values: male, female) and Smoker (possible values: smoker, non-smoker) can be summarized by providing the counts (frequencies) or proportions (relative frequencies) of observations falling into each category.
To demonstrate the graphical and numerical tools in R Commander, we will use the Framingham Heart Study data file introduced in the Introductory Lab; however, we will add one more column: Smoker (column 4) to the introlabdata.txt data file defined below. For your convenience, we will also provide the definitions of the other three variables in the data file along with the extended data file below:
Column Variable Description of Variable
1 Sex M-Male, F-Female,
2 Age 30-64 years,
3 Systolic Systolic blood pressure (82-300 mmHg),
4 Smoker 0 if not a current smoker, 1 if current smoker.
Sex |
Age |
Systolic |
Smoker |
F |
59 |
170 |
1 |
M |
35 |
130 |
0 |
M |
46 |
136 |
0 |
F |
43 |
96 |
0 |
M |
53 |
120 |
0 |
M |
50 |
110 |
0 |
M |
33 |
100 |
0 |
M |
57 |
145 |
1 |
F |
41 |
132 |
0 |
F |
40 |
112 |
0 |
M |
54 |
140 |
0 |
M |
53 |
148 |
1 |
F |
53 |
165 |
1 |
M |
49 |
100 |
0 |
Add the entries in a new column (Smoker) to the introlabdata.txt data file used in the Introductory Lab.
1.1 Summaries for Categorical Data: Frequency and Contingency Tables
Frequency and Relative Frequency Tables
Select Statistics > Summaries > Frequency distributions …
Select the variable Sex and click Ok.
The feature provides the frequency and relative frequency for each distinct value within selected columns.
counts:
Sex
F M
5 9
percentages:
Sex
F M
35.71 64.29
Suppose we were interested in obtaining the frequencies and relative frequencies of females and males with systolic blood pressure greater than 135 mmHg. First, we will subset our data to includeonly those observations with systolic blood pressure over 135. To do this, go to Data > Active data set > Subset
active data set …
Set the subset expression to be Systolic > 135, rename the data set and click Ok. Your new data set should only include observations with systolic blood pressure over 135 mmHg.
Now, we can obtain the frequency and relative frequency tables for the Sex variable like we did previously, but only considering the observations with Systolic > 135.
counts:
Sex
F M
2 4
percentages:
Sex
F M
33.33 66.67
Contingency Tables
The association between two categorical variables can be summarized with a contingency table. The rows in this table list the categories of one variable and the columns list the categories of the other variable. Each cell in the table is the frequency of observations for a specific combination of values of the two variables.
First, ensure you have switched back to the original data set that is not subsetted to onlyinclude observations with Systolic > 135. To create a two-way contingency table, R needs to have two variables classified as a “factor”. Currently, the Smoker variable is being viewed as an integer when it really represents levels of a categorical variable. To change this variable into a factor, go to Data > Manage variables in active data set > Convert numeric variables to factors …
Select the Smoker variable and rename it. Here we kept the original variable name, so you will be asked if you want to override the original variable for which you should say yes.
Define the level names for Smoker as 0-No and 1-Yes. Now, go to Statistics > Contingency Tables > Two-way table …
Choose your row and column variable.
Under the Statistics tab, select your desired percentage option. You do not need to select any hypothesis tests options for this lab.
Selecting “Row percentages”, the following output will be generated:
Frequency table:
Smoker
Sex No Yes
F 3 2
M 7 2
Row percentages:
Smoker
Sex No Yes Total Count
F 60.0 40.0 100 5
M 77.8 22.2 100 9
If instead we clicked “Column percentages”, we would get the following results::
Column percentages:
Smoker
Sex No Yes
F 30 50
M 70 50
Total 100 100
Count 10 4
Alternatively, selecting “Percentages of total”, we would yield the following output:
Total percentages:
No Yes Total
F 50.0 14.3 64.3
M 21.4 14.3 35.7
Total 71.4 28.6 100.0
1.2 Graphs for Categorical Data
Bar Plots
Bar plots use vertical bars to display the frequency or relative frequency for all distinct values (categories) of selected columns. The length of each bar is equal to the frequency or relative frequency for the corresponding value (category). Bar plots can be used to examine the association between two categorical variables like sex and smoking status. For example, to explore the association between Sex and Smoker variables in the Framingham Heart Study data file, we can obtain a bar plot for the variable Smoker for each sex category. Click Graphs > Bar graph …
Select Sex as the variable and where is says “Plot by groups…” select Smoker.
Under Options, you can make a few adjustments to your bar plot:
● Axis Scaling: you have the option to select either Frequency counts or Percentages depending on what you would like to display.
● Color selection: if you would like to change the colours of your graph you can do so by manually changing the code in the R Script or by using the color palette.
● Style of Group Bars: you can choose between side-by-side (recommended) or stacked bar plots.
● Percentages for Group Bars: choose between conditional or total percentages.
● Position of legend: decide where the legend will appear on your graph.
● Axis labels and title: you can edit the axis labels and title of your bar plot.
Click Ok when you are ready to make your bar plot.
Here is an example of what would happen if we selected Frequency counts instead of Percentages:
Pie Charts
Pie chart consists of several slices corresponding to all distinct values of a categorical variable and the size of each slice corresponds to the percentage (relative frequency) of observations in the category. To create a pie chart, select Graphs > Pie chart …
Select the variable, say Smoker, and then you can adjust the titles/axis labels as needed.
Click Ok to create the following pie chart.
Here we can see that 71% of the people in the study were non-smokers and 29% were smokers.
Unfortunately, R Commander does not have an option to plot pie charts by group. For example, if you
wanted to plot pie charts of smoker status for each level of sex, you would have to subset the data set for each sex and make individual pie charts of smoker status for males and females. Note to subset the data based on a categorical variable, we use two equals signs to denote the condition being true or false. For example, the subset expression to look at only the females in the data set would be: Sex == “F” .
2. Summarizing and Displaying Quantitative Data
Now you will learn how to obtain the measures of center and spread for quantitative data and how to display the data with histograms andboxplots.
2.1 Summaries for Quantitative Data
R Commander provides several descriptive statistics for single variables as well as measures that indicate the extent to which two variables co-vary (tend to rise or fall together).
Numerical Summaries
The numerical summaries option provides the following default options for descriptive statistics: sample size (n), mean, variance, standard deviation (Std. dev.),standard error (Std. err.), median, range, minimum, maximum, first quartile (Q1), and third quartile (Q3). To obtain numerical summary statistics for a certain variable, click on Statistics > Summaries > Numerical summaries …
First, select a variable and summarize by Smoker if you want separate summary statistics for each
Smoker category. Under the Statistics tab you can select which statistics you would like to output. If you wish to not compute one of the default statistics, remove the check next to the statistics to be removed. Your output will look like this:
mean sd IQR 0% 25% 50% 75% 100% Systolic:n
No 117.6 16.26995 29 96 102.50 116.0 131.50 140 10
Yes 157.0 12.35584 19 145 147.25 156.5 166.25 170 4
We can see that the mean systolic blood pressure is higher for smokers (157 mmHg) than non-smokers (117.6 mmHg). The median (50th percentile) is 116 mmHg and 156.5 mmHg for non-smokers and
smokers, respectively. For measures of spread, we can see that the IQRs are 29 mmHg and 19 mmHg and standard deviations are 16.27 mmHg and 12.36 mmHg for non-smokers and smokers, respectively. Also, the sample size for non-smokers is 10 while the sample size for smokers is 4. Note that 0%
corresponds to the minimum observation and 100% corresponds to the maximum observation for each group.
2.2 Displaying Quantitative Data: Histograms and Boxplots
Now we will discuss the graphical tools to display quantitative data.
Histograms
Histogram is the most important statistical tool to display quantitative data. To obtain a histogram, we divide the range of data into non-overlapping intervals of equal width, count the number of observations falling into each interval, and create a bar with height equal to the frequency (frequency histogram) or relative frequency (relative frequency histogram) for each interval. The bar heights in the histogram are calculated by dividing relative frequency by interval width so that the total area of the bars equals 1.
We assume that the left endpoint of each interval is included, the right endpoint is excluded. The endpoints of intervals are called bins. The bins uniquely specify the intervals if the starting bin and the common interval width are provided.
Suppose we would like to compare the distributions of systolic blood pressure for the two sex groups in the Framingham Heart Study example. Click Graphs > Histogram …
Select the systolic variable and click Plot by groups… and choose Sex to generate separate histogram for males and females.
Under the options tab, you can customize the number of bins, the axis labels and title and whether we want to create a frequency or relative frequency (percent) histogram.
To change the colour of your histogram from the default of dark gray, manually change the colour in the R Script to whatever colour you desire, highlight the command and resubmit the code:
with(Lab1_Data,Hist(Systolic, groups=Sex, scale="percent", col="lightblue", xlab="Systolic BP",main="Relative Frequency Histograms of Systolic BP by
Sex") If you are plotting a single histogram for a variable (i.e., not plotting by groups), you can manually set the bin widths by adjusting the breaks in the R Script file. For example, the following R code will plot a histogram of systolic blood pressure with the bins ranging from 95-175 with a bin width of 10.
with(Lab1_Data, Hist(Systolic, scale="percent", breaks=seq(95,175,10),
col="lightblue", xlab="Systolic BP", ylab="Percent", main="Relative Frequency Histogram of Systolic BP"))
Also, the default for histograms is right-inclusion (i.e., (a,b]). If you would like to change this to left- inclusion (i.e., [a,b)) please add the argument right=FALSE intothe R script for the histogram.
Boxplots
The boxplot is a graph of the five-number summary: minimum value, first quartile (Q1), median, third quartile (Q3), and maximum value. The distance from Q1 to Q3 is called the interquartile range (IQR). We will demonstrate the feature using the Framingham Heart Study data. Suppose we wish to obtain side-by- side boxplots of systolic blood pressure for males and females. Click Graphs > Boxplot …
Select the Systolic variable and click “Plot by groups…” and select Sex.
Adjust axis labels and the title under the Options tab.
The lower and upper fences are located 1.5*IQR to the left and right of Q1 and Q3, respectively. A point beyond the lower or upper fences is considered an outlier.
We can adjust the colours of both boxplots manually in the R Script. If we want to specify two different colours for each separate boxplot, we need to store the colour names in a vector. In R, we use the c() function to indicate a vector and separate each argument with a comma.
Boxplot(Systolic~Sex, data=Lab1_Data, id=list(method="y"), ylab="Systolic BP",
main="Boxplots of Systolic BP by Sex", col=c("lightblue","dodgerblue"))
Scatterplots
Scatterplots allow the user to obtain a plot of one numerical variable versus another numerical variable.
We will demonstrate how to construct a scatterplot by plotting Systolic (y) vs. Age (x). Click on Graphs > Scatterplot..
Select the x- and y-variables. Note that we also have the options to plot separate scatterplots by group (e.g., male/female) if desired.
You can customize aspects such as the labels, title and plotting characters under the Options tab. Setting plotting characters equal to 19 results in filled circles.
Overall, we can see that there seems to be a positive relationship between Age and Systolic, meaning that as age increases, systolic blood pressure typically increases as well.
2024-01-27