Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

LAB 1 INSTRUCTIONS

DESCRIBING AND DISPLAYING DATA

This lab will assist you in learning how to summarize and display categorical and quantitative data in R Commander. In particular, you will learn how to obtain frequency and contingency tables for categorical data and display the data with bar charts and pie charts. You will also learn how to obtain the appropriate measures of center and spread for quantitative data and display the data with histograms andboxplots. Finally, you will study how to display data over time with a time plot. The document should be used as a reference in your work on the Lab 1 assignment.

1.          Summarizing and Displaying Categorical Data

The categorical variables below described as Sex (possible values: male, female) and Smoker (possible values: smoker, non-smoker) can be summarized by providing the counts (frequencies) or proportions (relative frequencies) of observations falling into each category.

To demonstrate the graphical and numerical tools in R Commander, we will use the Framingham Heart Study data file introduced in the Introductory Lab; however, we will add one more column: Smoker (column 4) to the introlabdata.txt data file defined below. For your convenience, we will also provide the definitions of the other three variables in the data file along with the extended data file below:

Column            Variable            Description of Variable

1                        Sex                   M-Male, F-Female,

2                        Age                    30-64 years,

3                        Systolic              Systolic blood pressure (82-300 mmHg),

4                        Smoker              0 if not a current smoker, 1 if current smoker.

Sex

Age

Systolic

Smoker

F

59

170

1

M

35

130

0

M

46

136

0

F

43

96

0

M

53

120

0

M

50

110

0

M

33

100

0

M

57

145

1

F

41

132

0

F

40

112

0

M

54

140

0

M

53

148

1

F

53

165

1

M

49

100

0

Add the entries in a new column (Smoker) to the introlabdata.txt data file used in the Introductory Lab.

1.1        Summaries for Categorical Data: Frequency and Contingency Tables

Frequency and Relative Frequency Tables

Select Statistics > Summaries > Frequency distributions 

 

Select the variable Sex and click Ok.

 

The feature provides the frequency and relative frequency for each distinct value within selected columns.

counts:

Sex

F M

5 9

percentages:

Sex

F     M

35.71 64.29

Suppose we were interested in obtaining the frequencies and relative frequencies of females and males with systolic blood pressure greater than 135 mmHg. First, we will subset our data to includeonly those observations with systolic blood pressure over 135. To do this, go to Data > Active data set > Subset

active data set 

 

Set the subset expression to be Systolic > 135, rename the data set and click Ok. Your new data set should only include observations with systolic blood pressure over 135 mmHg.

 

Now, we can obtain the frequency and relative frequency tables for the Sex variable like we did previously, but only considering the observations with Systolic > 135.

counts:

Sex

F M

2 4

percentages:

Sex

F     M

33.33 66.67

Contingency Tables

The association between two categorical variables can be summarized with a contingency table. The rows in this table list the categories of one variable and the columns list the categories of the other variable. Each cell in the table is the frequency of observations for a specific combination of values of the two variables.

First, ensure you have switched back to the original data set that is not subsetted to onlyinclude observations with Systolic > 135. To create a two-way contingency table, R needs to have two variables classified as a “factor”. Currently, the Smoker variable is being viewed as an integer when it really represents levels of a categorical variable. To change this variable into a factor, go to Data > Manage variables in active data set > Convert numeric variables to factors 

 

Select the Smoker variable and rename it. Here we kept the original variable name, so you will be asked if you want to override the original variable for which you should say yes.

 

Define the level names for Smoker as 0-No and 1-Yes. Now, go to Statistics > Contingency Tables > Two-way table 

 

Choose your row and column variable.

 

Under the Statistics tab, select your desired percentage option. You do not need to select any hypothesis tests options for this lab.

Selecting “Row percentages”, the following output will be generated:

Frequency table:

Smoker

Sex No Yes

F  3   2

M  7   2

Row percentages:

Smoker

Sex   No  Yes Total Count

F 60.0 40.0   100     5

M 77.8 22.2   100     9

If instead we clicked “Column percentages”, we would get the following results::

Column percentages:

Smoker

Sex             No Yes

F             30  50

M             70  50

Total        100 100

Count        10   4

Alternatively, selecting “Percentages of total”, we would yield the following output:


Total         percentages:

                No  Yes Total

F               50.0 14.3  64.3

M              21.4 14.3  35.7

Total 71.4 28.6 100.0

1.2        Graphs for Categorical Data

Bar Plots

Bar plots use vertical bars to display the frequency or relative frequency for all distinct values (categories) of selected columns. The length of each bar is equal to the frequency or relative frequency for the corresponding value (category). Bar plots can be used to examine the association between two categorical variables like sex and smoking status. For example, to explore the association between Sex and Smoker variables in the Framingham Heart Study data file, we can obtain a bar plot for the variable Smoker for each sex category. Click Graphs > Bar graph 

 

Select Sex as the variable and where is says “Plot by groups…” select Smoker.

Under Options, you can make a few adjustments to your bar plot:

●    Axis Scaling:  you have the option to select either Frequency counts or Percentages depending on what you would like to display.

●    Color selection: if you would like to change the colours of your graph you can do so by manually changing the code in the R Script or by using the color palette.

●    Style of Group Bars: you can choose between side-by-side (recommended) or stacked bar plots.

●    Percentages for Group Bars: choose between conditional or total percentages.

●    Position of legend: decide where the legend will appear on your graph.

●    Axis labels and title: you can edit the axis labels and title of your bar plot.

Click Ok when you are ready to make your bar plot.

 

Here is an example of what would happen if we selected Frequency counts instead of Percentages:

Pie Charts

Pie chart consists of several slices corresponding to all distinct values of a categorical variable and the   size of each slice corresponds to the percentage (relative frequency) of observations in the category. To create a pie chart, select Graphs > Pie chart 

Select the variable, say Smoker, and then you can adjust the titles/axis labels as needed.

 


Click Ok to create the following pie chart.

 

Here we can see that 71% of the people in the study were non-smokers and 29% were smokers.

Unfortunately, R Commander does not have an option to plot pie charts by group. For example, if you

wanted to plot pie charts of smoker status for each level of sex, you would have to subset the data set for each sex and make individual pie charts of smoker status for males and females. Note to subset the data based on a categorical variable, we use two equals signs to denote the condition being true or false. For  example, the subset expression to look at only the females in the data set would be:  Sex == “F” .

2.          Summarizing and Displaying Quantitative Data

Now you will learn how to obtain the measures of center and spread for quantitative data and how to display the data with histograms andboxplots.

2.1        Summaries for Quantitative Data

R Commander provides several descriptive statistics for single variables as well as measures that indicate the extent to which two variables co-vary (tend to rise or fall together).

Numerical Summaries

The numerical summaries option provides the following default options for descriptive statistics: sample size (n), mean, variance, standard deviation (Std. dev.),standard error (Std. err.), median, range, minimum, maximum, first quartile (Q1), and third quartile (Q3). To obtain numerical summary statistics for a certain variable, click on Statistics > Summaries > Numerical summaries 

 

First, select a variable and summarize by Smoker if you want separate summary statistics for each

Smoker category. Under the Statistics tab you can select which statistics you would like to output. If you wish to not compute one of the default statistics, remove the check next to the statistics to be removed.  Your output will look like this:

mean       sd IQR  0%    25%   50%    75% 100% Systolic:n

No  117.6 16.26995  29  96 102.50 116.0 131.50  140         10

Yes 157.0 12.35584  19 145 147.25 156.5 166.25  170          4

We can see that the mean systolic blood pressure is higher for smokers (157 mmHg) than non-smokers (117.6 mmHg). The median (50th percentile) is 116 mmHg and 156.5 mmHg for non-smokers and

smokers, respectively. For measures of spread, we can see that the IQRs are 29 mmHg and 19 mmHg  and standard deviations are 16.27 mmHg and 12.36 mmHg for non-smokers and smokers, respectively. Also, the sample size for non-smokers is 10 while the sample size for smokers is 4. Note that 0%

corresponds to the minimum observation and 100% corresponds to the maximum observation for each group.

2.2        Displaying Quantitative Data: Histograms and Boxplots

Now we will discuss the graphical tools to display quantitative data.

Histograms

Histogram is the most important statistical tool to display quantitative data. To obtain a histogram, we     divide the range of data into non-overlapping intervals of equal width, count the number of observations falling into each interval, and create a bar with height equal to the frequency (frequency histogram) or    relative frequency (relative frequency histogram) for each interval. The bar heights in the histogram are  calculated by dividing relative frequency by interval width so that the total area of the bars equals 1.

We assume that the left endpoint of each interval is included, the right endpoint is excluded. The endpoints of intervals are called bins. The bins uniquely specify the intervals if the starting bin and the common interval width are provided.

Suppose we would like to compare the distributions of systolic blood pressure for the two sex groups in the Framingham Heart Study example. Click Graphs > Histogram 


Select the systolic variable and click Plot by groups… and choose Sex to generate separate histogram for males and females.

 

Under the options tab, you can customize the number of bins, the axis labels and title and whether we want to create a frequency or relative frequency (percent) histogram.

 

To change the colour of your histogram from the default of dark gray, manually change the colour in the R Script to whatever colour you desire, highlight the command and resubmit the code:

with(Lab1_Data,Hist(Systolic, groups=Sex, scale="percent", col="lightblue", xlab="Systolic BP",main="Relative Frequency Histograms of Systolic BP by


Sex") If you are plotting a single histogram for a variable (i.e., not plotting by groups), you can manually  set the bin widths by adjusting the breaks in the R Script file. For example, the following R code will plot a histogram of systolic blood pressure with the bins ranging from 95-175 with a bin width of 10.

with(Lab1_Data, Hist(Systolic, scale="percent", breaks=seq(95,175,10),

col="lightblue", xlab="Systolic BP", ylab="Percent", main="Relative Frequency Histogram of Systolic BP"))

Also, the default for histograms is right-inclusion (i.e., (a,b]). If you would like to change this to left- inclusion (i.e., [a,b)) please add the argument right=FALSE intothe R script for the histogram.

Boxplots

The boxplot is a graph of the five-number summary: minimum value, first quartile (Q1), median, third quartile (Q3), and maximum value. The distance from Q1 to Q3 is called the interquartile range (IQR). We will demonstrate the feature using the Framingham Heart Study data. Suppose we wish to obtain side-by- side boxplots of systolic blood pressure for males and females. Click Graphs > Boxplot 

 

Select the Systolic variable and click “Plot by groups…” and select Sex.

 

Adjust axis labels and the title under the Options tab.

 

The lower and upper fences are located 1.5*IQR to the left and right of Q1 and Q3, respectively. A point beyond the lower or upper fences is considered an outlier.

We can adjust the colours of both boxplots manually in the R Script. If we want to specify two different colours for each separate boxplot, we need to store the colour names in a vector. In R, we use the c() function to indicate a vector and separate each argument with a comma.

Boxplot(Systolic~Sex, data=Lab1_Data, id=list(method="y"), ylab="Systolic BP",

main="Boxplots of Systolic BP by Sex", col=c("lightblue","dodgerblue"))

Scatterplots

Scatterplots allow the user to obtain a plot of one numerical variable versus another numerical variable.

We will demonstrate how to construct a scatterplot by plotting Systolic (y) vs. Age (x). Click on Graphs > Scatterplot..

 

Select the x- and y-variables. Note that we also have the options to plot separate scatterplots by group (e.g., male/female) if desired.


You can customize aspects such as the labels, title and plotting characters under the Options tab. Setting plotting characters equal to 19 results in filled circles.

 

 

 

Overall, we can see that there seems to be a positive relationship between Age and Systolic, meaning that as age increases, systolic blood pressure typically increases as well.