ECO 2150 Descriptive Statistics 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ECO 2150
Descriptive Statistics
2022
1 Introduction
Statistics that summarize numerical information
❼ Later in this course and in ECO 2151 will learn a variety of tests and procedures that can
be applied to numerical data
❼ BUT: before applying such procedures, it is a good idea to just look at one’s data
❼ Examination of the data can reveal a variety of “stylized facts,” and important anomalies
that your economic theory needs to explain
❼ Graphs are one way to explore properties of data
❼ Other way to summarize numerical data is to compute various summary measures, known
as descriptive statistics
Aspects of data to summarize
❼ Two basic types of descriptive statistics:
1. measures of central tendency
2. measures of dispersion
❼ Will look at both types of measure
❼ Note: some statistical theory underlies measures we will look at and explains why they are
used
– Statisticians have examined their properties
❼ For now, will take statisticians’ word for it that these measures have good properties
What is a statistic?
A statistic is defined as follows:
Definition. A statistic is any function of the sample information.
In contrast:
Definition. A parameter is a numerical measure that describes a specific characteristic of a population.
Descriptive statistics are estimates of unobservable population parameters, based on sample data
Population versus sample
Recall:
❼ A population is the complete set of all items in which the investigator is interested ❼ A sample is a subset of a population
❼ An observation is simply one element of the sample or population
Notation: Let xi be the value of variable x for observation i
❼ Population consists of x1 , . . . ,xN
❼ Sample consists of a set of observations x1 , . . . ,xn , with n < N
2 The Summation Operator
Summation operator
Before continuing, need to review/introduce the summation operator:
Definition.
n
Xi = Xi = X1 + X2 + ... + Xn
i=1 i
Properties of
1. Let k be a constant. Then n
k = nk .
i=1
❼ IMPORTANT: When dealing with summations, a constant is anything that does not
depend on i, the index of the summation
2. Suppose again that k is a constant. Then
n n
kXi = k Xi .
i=1 i=1
More properties of
3. Let both a and b be constants. Then
n n
(a + bXi ) = na + b Xi .
i=1 i=1
4. Sum of a sum of terms:
n n n
i=1 i=1 i=1
5. Product of two sums: m n m n
Xi Yj = Xi Yj .
i=1 j=1 i=1 j=1
3 Measures of Central Tendency
What are measures of central tendency?
❼ Measures of central tendency are methods of defining the centre or middle of a set of numbers ❼ Three commonly-used measures of central tendancy:
1. Mean
2. Median
3. Mode
❼ Will look at each measure in turn
The Mean
Definition. The mean of a set of numerical observations is the sum of the data values divided by the number of observations; that is, their average. Mathematically, we can write this as
N
µ = xi for the population;
i=1
x = xi for the sample.
The Mean
❼ Also known as the arithmetic mean
❼ Mathematically, the two formulae are the same – main difference is that in one case we use
N to indicate the number of observations, while in the other case we use n
❼ Also use a different variable name on the left-hand side
❼ Conventional in statistics to use the Greek letter µ to represent population mean of a variable
❼ Also a convention to represent the sample mean of a variable by placing a bar over the name
of the variable – e.g., , y¯
❼ May add a subscript to µ to indicate which variable it pertains to – e.g., µX , µY ❼ Usually will be computing the sample mean
The Median
Definition. The median of a set of observations is the middle observation if the number of observations is odd; it is the average of the middle pair if the number of observations is even. Alternatively, the median can be defined as the value such that 50% of the observations lie above it and 50% of the observations lie below it.
To find the median:
1. Order the observations in either ascending or descending order.
2. If the number of observations is odd, the median will be the value of observation (n + 1)/2 (the middle observation).
3. If n is even, the median will be the average of the values of observations n/2 and (n + 2)/2 (the two middle observations).
The Mode
Definition. The mode of a set of observations is the value that occurs most frequently.
❼ Some data sets do not have a mode
❼ Some data sets have more than one mode
The Geometric Mean
In calculation of growth rates or return on assets, analysts often use the geometric mean
Definition. The geometric mean of a set of values is the nth root of the product of the n values:
g = √nx1 x2 ··· xn = (x1 x2 ··· xn )1/n .
Take natural log of geometric mean:
ln g = lnxi
The Geometric Mean Rate of Return
❼ Geometric mean rate of return is used to compute average percentage return of investment
over time:
g = (x1 x2 ··· xn )1/n − 1
❼ Geometric mean and geometric mean rate of return take compounding into account
❼ THIS IS ALL THE ATTENTION WE WILL PAY TO THE GEOMETRIC MEAN IN THIS
COURSE!
Example: Measures of central tendency
Suppose that you have a sample of final exam grades for 10 students:
Obs. No. Grade (%)
1
2
3
4
5
6
7
8
9
10
71
85
44
66
71
95
56
78
81
79
Find the mean, median, and mode.
4 Measures of Dispersion
Measures of Dispersion
❼ Measures of dispersion look at how spread out the data are
– Are observations clustered closely around the central value, or do they lie far apart from each other?
❼ We will look at four measures of dispersion
The Range
Definition. The range of a set of data is the difference between the values of the largest and the smallest observations; that is,
Range = xMAX − xMIN .
❼ Range is simplest and most straightforward measure of spread of data
❼ PROBLEM: range doesn’t distinguish between situations where there are only one or two
outliers, and situations where the observations are fairly evenly distributed over the range
The Range and Outliers
❼ An outlier is an observation that is very different from most of the other observations in the
data set
❼ Intuitively, would think there is less dispersion if observations are evenly distributed over
range
❼ Outliers are given too much weight by range
Deviations from the mean
❼ Two alternative measures of dispersion are based on deviations from the mean
– Deviations from mean are differences xi − (sample) or xi − µ (population)
❼ Deviations themselves are not that helpful – positive deviations cancel out negative ones
when they are summed
❼ Measures are
1. The mean absolute deviation
2. The standard deviation
Mean Absolute Deviation
Definition. The mean absolute deviation of a set of observations is the average of the absolute deviations.
N
MAD = X |xi − µ| for the population;
MAD = |xi − | for the sample.
Mean Absolute Deviation
❼ Taking absolute values before summing ensures that positive and negative deviations do not
cancel each other out
❼ MAD is used less frequently in practice, primarily because absolute values are relatively
difficult to work with mathematically – this makes it more difficult to derive the properties of the MAD
– The MAD is not a continuous function
The Variance
Definition. Let x1 ,x2 , . . . ,xn be a sample of n observations. The sample variance, denoted s2 ,
is defined as follows:
s2 = (xi − )2 .
If by chance we had data for the entire population, we could compute the population variance,
σ 2 :
N
σ 2 = X(xi − µ)2 .
The Standard Deviation
Definition. The standard deviation is simply the square root of the variance:
s = √s2 (sample) ,
σ = √σ 2 (population) .
The Variance and the Standard Deviation
❼ Squaring deviations is another means of ensuring that positive and negative deviations do
not cancel each other out
❼ Squaring the deviations attaches a higher weight to large deviations than taking the absolute
value does
❼ Note that the formulae used to compute the population and sample variances are not math-
ematically the same (will discuss why later)
– For population, divide by number of observations
– For sample, divide by the number of observations less one
Units of measurement of descriptive statistics
❼ Units of measurement of MAD, standard deviation, range, mean, median, and mode are all
the same as those of original variable
❼ Units of measurement of the variance are the original units squared
❼ Be careful with units!
The Coefficient of Variation
❼ Problem with MAD and variance: cannot be used to compare the degree of variation of
variables that are measured in different units
– Both are sensitive to units of measurement
❼ Coefficient of variation is a unit-free measure of dispersion
Definition. The coefficient of variation is the standard deviation divided by the mean; that is,
σ
µ
s
The Coefficient of Variation
❼ CV is a measure of relative, not absolute, dispersion
❼ Can be used to compare dispersion relative to the mean of variables measured in different
units, or with very different means and variances
❼ Invariant to scaling of the data because scaling will simply adjust both the numerator and
the denominator by the same factor
❼ Measures of relative dispersion are not better than measures of absolute dispersion; they are
just different
– Income inequality!
Example: Measures of dispersion
Suppose that you have a sample of final exam grades for 10 students:
Obs. No. Grade (%)
1
2
3
4
5
6
7
8
9
10
71
85
44
66
71
95
56
78
81
79
Find the range, variance, standard deviation, mean absolute deviation, and coefficient of vari- ation.
Use of the Standard Deviation
❼ Standard deviation can be used in at least two ways:
1. To compare the degree of variability of two data sets (of variables measured in similar units)
2. To construct an interval that contains a specified proportion of the population data
❼ Also used in hypothesis testing, but will not get to that for some time
❼ To construct an interval containing a given proportion of the population data, can use
Chebychev’s Theorem
Chebychev’s Theorem
Theorem. Chebychev’s Theorem: For any population with mean µ and standard deviation σ , the percent of observations that lie within the interval [µ ± kσ] is
at least 100 1 − % ,
where k > 1 is the number of standard deviations.
In other words, we can construct an interval in which a certain proportion of the population lies within a certain number of standard deviations of the mean
Chebychev’s Theorem: Examples
❼ Example: Choose k = 1.5. Then Chebychev’s Theorem implies that
100 1 − % = 55.6%
of the population members lie within 1.5 standard deviations of the mean
❼ Can also construct an estimate of the interval using the sample mean and sample standard
deviation
❼ Example: Exam grades for 10 students
– x = 72.6 and s = 14.68
– Therefore, the approximate boundaries of the interval containing 55.6% of the popula- tion are + 1.5s = 94.62 and − 1.5s = 50.58
More precise intervals
❼ Chebychev’s Theorem applies to ALL populations, regardless of what the actual distribution
of the data is
❼ Given information about distribution of observations in the population, could construct more
precise intervals
❼ In real world, many large populations have a symmetric, bell-shaped distribution
❼ For such populations, we can specify more precise intervals based on an “Empirical Rule”
– Will learn later where the empirical rule comes from
Empirical Rule
Rule. For large populations with a symmetric, bell-shaped distribution,
❼ approximately 68% of the population members will lie within the interval µ ± σ; ❼ approximately 95% of the population members will lie within the interval µ ± 2σ; ❼ approximately all of the population members will lie within the interval µ ± 3σ .
5 Numerical Summary of Grouped Data
Grouped Data
❼ Sometimes data for a large sample are made available to researchers in grouped form
– For example: Statistics Canada often reports age and income as categorical variables in Public Use Microdata Files (PUMFs)
❼ In such cases, not possible to apply the formulae that we have seen for the mean, variance,
etc., since we do not have individual data
❼ How, then, do we construct summary statistics?
❼ Solution is to make use of the frequency distribution for the variable
Mean and Variance of Grouped Data: Population
Definition. Suppose that one has data grouped into K classes, with frequencies f1 ,f2 , . . . ,fK . Let the midpoints of the range of each class be m1 ,m2 , . . . ,mK . Then for a population of N observations, where N = fj , the mean is
= = fj mj
and the variance is
K
2 = fj (mj − )2 .
Note on Notation
❼ Note that , not µ, is used to for the population mean when the data are grouped
❼ This is because the grouped data formula can only approximate the true population mean
– Only exception is special case where each interval contains only one numerical value (i.e., xi is the same for everyone in interval)
❼ For similar reasons, use 2 instead of σ 2 for the population variance for grouped data
Mean and Variance of Grouped Data: Sample
Definition. Suppose that one has data grouped into K classes, with frequencies f1 ,f2 , . . . ,fK . Let the midpoints of the range of each class be m1 ,m2 , . . . ,mK . Then for a sample of n observations, where n = fj , the mean is
K
= fj mj
and the variance is
K
s2 = fj (mj − )2 .
Weighted Mean
❼ Note that mean for grouped data is a weighted mean
❼ Weights are actually relative frequencies: fi /n
❼ Relative frequency weights must sum to 1
Median of Grouped Data
❼ Can easily determine which class contains the median value by examining the frequency
distribution
❼ But: cannot directly observe the median
❼ Should you wish to do so, one can make use of this formula for estimating the value of a
particular observation:
Rule. Estimating the Value of Observation i in class j: Suppose that class j contains fj observations, and let L be the lower boundary and U the upper boundary of class j . If these observations were to be arranged in ascending order, the value of the ith observation
is estimated to be
L + i − (U fj(−) L) , i = 1, . . . ,fj .
The Modal Class
❼ When the data are grouped it is impossible to determine the mode
❼ Instead, we need to define a new concept: the modal class
Definition. The modal class is the class with the highest frequency.
Example: Question 2.31 of Newbold et al. (2013)
For a random sample of 25 students from a very large university, the accompanying table shows the amount of time (in hours) spent studying for final exams.
Study time |
0 < 4 |
4 < 8 |
8 < 12 |
12 < 16 |
16 < 20 |
Number of students |
3 |
7 |
8 |
5 |
2 |
a. Estimate the sample mean study time.
b. Estimate the sample standard deviation.
6 Relationships Between Variables Measuring Relationships Between Variables
❼ Earlier, saw that scatter plots can reveal relationships between variables
❼ Sometimes need a measure of the direction and/or strength of such relationships
❼ Two related measures of the linear relationship between two variables can be computed:
1. Covariance
2. Correlation coefficient
Population Covariance
Definition. The population covariance between two variables x and y is
N
cov (x,y) = σxy = X (xi − µx )(yi − µy )
where xi and yi are the observed values of the variables, µx and µy are the population means, and N is the size of the population.
Sample Covariance
Definition. The sample covariance between two variables x and y is
cov (x,y) = sxy = (xi − x)(yi − y)
where xi and yi are the observed values of the variables, x and y are the sample means, and n is the size of the sample.
Interpretation of covariance
❼ Sign of the covariance tells us about the direction of the relationship between the two vari-
ables
– Positive sign implies an upward-sloping relationship
– Negative sign implies a downward-sloping relationship
❼ But: covariance does not really tell us how strong the relationship is because it is sensitive
to units of measurement
Correlation Coefficient
Definition. The population correlation coefficient between two variables x and y is given by
σxy
ρxy =
The sample correlation coefficient between two variables x and y is given by
sxy
rxy =
Note: This measure also known as the Pearson correlation coefficient.
Interpretation of correlation coefficient
❼ Correlation coefficient always lies between -1 and 1
❼ If correlation coefficient equals -1, have a perfect negative relationship (a downward-sloping
straight line)
❼ If correlation coefficient equals 1, have a perfect positive relationship (an upward-sloping
straight line)
❼ The closer are |r| or |ρ| to 1, the stronger the linear relationship
❼ Value of 0 implies no linear relationship
2022-06-29