STAT 3690 Lecture 01
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT 3690 Lecture 01
2022
Outline
● Topics to be covered
- Multivariate normal distribution
- Inference on a mean vector
- Comparisons of several multivariate means
- Multivariate linear regression
- Principal component analysis
- Factor analysis
- Canonical correlation analysis
- and so forth
R basics
● Installation
- download and install BASE R from https://cran.r-project.org
- download and install Rstudio from https://www.rstudio.com
- download and install packages via Rstudio
● Working directory
- When you ask R to open a certain file, it will look in the working directory for this file.
- When you tell R to save a data file or figure, it will save it in the working directory.
getwd()
mainDir <- "c:/"
subDir <- "stat3690Lec01"
dir.create(file.path(mainDir, subDir), showWarnings = FALSE)
setwd(file.path(mainDir, subDir))
● Packages
- installation: install.packages()
- loading: library()
install.packages( !nlme !)
library(nlme)
● Help manual: help(), ?, google, stackoverflow, etc.
● R is free but not cheep
- Open-source
- Citing packages
- NO quality control
- Requiring statistical sophistication
- Time-consuming to become a master
● References for R
- M. L. Rizzo (2019) Statistical Computing with R, 2nd Ed. (forthcoming)
- O. Jones, R. Maillardet, A. Robinson (2014) Introduction to Scientific Programming and Simulation Using R, 2nd Ed.
- . . . . . .
● Courses online
- https://www.pluralsight.com/search?q=R
- . . . . . .
● Data types: let str() or class() tell you
- numbers (integer, real, or complex)
- characters (“abc”)
- logical (TRUE or FALSE)
- date & time
- factor (commonly encoutered in this course)
- NA (different from Inf, “ ’ ’, 0, NaN etc.)
● Data structures: let str() or class() tell you
- vector: an ordered collection of the same data type
- matrix: two-dimensional collection of the same data type
- array: more than two dimensional collection of the same data type
- data frame: collection of vectors of same length but of arbitrary data types
- list: collection of arbitrary objects
● Data input and output
- create
* vector: c(), seq(), rep()
* matrix: matrix(), cbind(), rbind()
* data frame
- output: write.table(), write.csv(), write.xlsx()
- import: read.table(), read.csv(), read.xlsx()
* header: whether or not assume variable names in first row
* stringsAsFactors: whether or not convert character string to factors
- scan(): a more general way to input data
- save.image() and load(): save and reload workspace
- source(): run R script
● Parenthesis in R
- paenthesis () to enclose inputs for functions
- square brackets [], [[]] for indexing
- braces {} to enclose forloop or statements such as if or ifelse
# Create numeric vectors
v1 = c ( 1 ,2 ,3); v1
v2 = seq (4 ,6 ,by=0.5); v2
v3 = c (v1,v2); v3
v4 = rep (pi,5); v4
v5 = rep(v1,2); v5
v6 = rep(v1,each=2); v6
# Create Character vector
v7 <- c ("one" , "two" , "three"); v7
# Select specific elements
v1[c (1 ,3)]
v7[2]
# Create matrices
m1 = matrix(- 1 :4 , nrow=2); m1
m2 = matrix(- 1 :4 , nrow=2 , byrow=TRUE); m2
m3 = cbind (m1,m2); m3
(m4 = cbind (m1,m2))
# Create a data frame
e <- c ( 1 ,2 ,3 ,4)
f <- c ("red" , "white" , "black" , NA)
g <- c (TRUE ,TRUE ,TRUE ,FALSE)
mydata <- data.frame(e,f,g)
names (mydata) <- c ( "ID" , "Color" , "Passed") # name variable
mydata
# Output
write.csv(mydata, file= !mydata.csv ! , row.names=F)
# Import
(simple = read.csv( !mydata.csv ! , header=TRUE , stringsAsFactors=TRUE))
class(simple)
class(simple[[1]])
class(simple[[2]])
class(simple[[3]])
(simple = read.csv( !mydata.csv ! , header=FALSE , stringsAsFactors=FALSE))
class(simple[[3]])
# EXERCISE
# Create a matrix with 2 rows and 6 columns such that it contains the numbers 1,4,7,...,34. # Make sure the numbers are increasing row-wise; ie, 4 should be in the second column. # Use the seq() function to generate the numbers. Do NOT type them out by hand!
# ANSWER
matrix(seq (from=1 , to=34 , by=3), nrow=2)
● Elementary arithmetic operators
- +, -, *, /, ˆ
- log, exp, sin, cos, tan, sqrt
- FALSE and TRUE becoming 0 and 1, respectively
- sum(), mean(), median(), min(), max(), var(), sd(), summary()
● Matrix calculation
- element-wise multiplication: A * B
- matrix multiplication: A %*% B
- singlar value decomposition: eigen(A)
● Loops: for() and while()
● Probabilities
- normal distribution: dnorm(), pnorm(), qnorm(), rnorm()
- uniform distribution: dunif(), punif(), qunif(), runif()
- multivariate normal distribution: dmvnorm(), rmvnorm()
# Generate two datasets
set.seed(100)
x = rnorm (250 , mean=0 , sd=1)
y = runif(250 , -3 , 3)
● Basic plots
- strip chart, histogram, box plot, scatter plot
- Package ggplot2 (RECOMMENDED)
# Strip chart
stripchart(x)
# Histogram
hist(x)
# Box plot
boxplot(x)
# Side-bu-side box plot
xy = data.frame(normal=x, uniform=y)
boxplot(xy)
# Scatter Plot with fitted line
plot(x, y ,xlab="x" , ylab = "y" , main = "scatter plot between x and y")
abline(lm(y~x))
# EXERCISE
# Play with a data set called "Gasoline" included in the package "nlme".
# 1. How many variables are contained in this data set? What are they?
# 2. Generate a histogram of yield and calculate the five number summary for it. # What is the shape of the histogram?
# 3. Generate side-by-side boxplots,
# comparing the temperature at which all the gasoline is vaporized (endpoint) to sample.
# Does it seem that the temperatures at which all the gasoline is vaporized differ by sample? # 4. Generate a plot that illustrates the relationship between yield and endpoint. # Describe the relationship between these two variables.
# 5. What if the plot created in Q4 were separated by sample?
# Generate a plot of yield v.s. endpoint, separated by sample.
# ANSWER
attach(nlme ::Gasoline)
# 1. Six variables: yield, endpoint, sample, API, vapor, ASTM
# 2.
summary (yield)
hist(yield, nclass=50)
# 3.
boxplot(endpoint ~ Sample)
anova (lm(endpoint ~ Sample))
# 4.
plot(x=endpoint, y=yield, xlab="endpoint" ,ylab = "yield" ,
main = "scatter plot between endpoint and yield")
abline(lm(yield~endpoint))
# 5.
par (mfrow=c (2 ,5))
for (i in 1 : 10){
plot(x=endpoint[Sample==i], y=yield[Sample==i], xlab= !! , ylab= !! , main=paste( !Sample= ! , i)) abline(lm(yield[Sample==i]~endpoint[Sample==i]))
}
# Do not forget to detach the dataset after using it.
detach(nlme ::Gasoline)
Matrix properties
● Determinant and trace
- Applicable only to square matrices
- Properties for determinant
* |AT | = |A|
* |A- 1 | = |A|- 1
* |aA| = an |A| for n × n matrix A and scalar a
* |AB| = |A||B| if A and B are square matrices of the identical dimension
* |A| = i ←i
* tr(aA) = atr(A) for scalar a
* tr(A + B) = tr(A) + tr(B) if A and B are square matrices of the identical dimension
* tr(AB) = tr(BA) for m × n A and n × m B
* (tr(AAT ))1/2 = ( i,j Ai(2)j )1/2 Frobenius norm (a generilization of Euclidean norm)
* tr(A) = i ←i
● Exercise: Prove that
1. tr(AB) = tr(BA) for m × n A and n × m B.
2. tr(A1 · · · Ak ) = tr(Ak l +1 · · · Ak A1 · · · Ak l ) for 1 _ k/ _ k .
3. tr(A) = i ←i .
4. |A| = i ←i .
● Remark:|A| and tr(A) can be taken as measures of the size of A when A is positive definite.
● Proof:
1. tr(AB) = i j Aij <ji = j i <ji Aij = tr(BA).
2. Take A1 · · · Ak l and Ak l +1 · · · Ak as a whole, respectively.
3. tr(UJU- 1 ) = tr(JU- 1 U) = tr(J) = i ←i .
4. |A| = |UJU- 1 | = |U||J||U- 1 | = |J|.
2022-03-17