闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment - A good example

High Dimensional Data Analysis

9 August 2018

Data description

Data are collected on ﬁve variables each representing diﬀerent measures of welfare for the 50 states of the United States of America (USA). These are per capita income (Income), the percentage of the population that are illiterate (Illiteracy), life expectancy (LifeExp), the number of murders per 100000 people (Murder) and the percentage of the population that are high school graduates (HSGrad). An analysis of these data, and in particualar a biplot based on Principal Components can help us to glean insight into the quality of life in diﬀerent regions of the USA.

Preliminary Analysis

Before carrying out a Princial Components Analysis it is worth exploring the features of the original variables themselves. Box plots of each variable are provided below.

Figure 1: Box plots of original variables. Description section.

6000

5000

4000

3000

For a full desciption

of abbreviated variable names see Data

These boxplots reveal a few interesting characteristics. Most prominent is the presence of an outlier in the distribution for income. On closer investigation, this outlier is the state of Alaska which has a per capita

income of $6315. Also, the distribution of Illiteracy is positively skewed while the distribution of income is negatively skewed which suggests that there may be a group of states that are particuarly disadvantaged especially with respect to these two variables.

Principal Components Analysis and Biplot

Principal components analysis ﬁnds a small number of linear combinations of the orginal variables that explain a large proportion of overall variation in the data. Since the variables in the dataset under investigation are measured in diﬀerent units we standardise the data by dividing by the standard deviation before conducting the analysis. By selecting two principal components we are able visualise the data using a biplot which is included below

−5 0

−0.3 −0.1 0.1 0.2

PC1

Figure 2: Biplot based on principal components. For a full desciption of abbreviated variable names see Data Description section.

The biplot can be interpreted as follows. The ﬁrst principal component is a measure of overall well-being since it is positively correlated with variables that indicate improved well-being, such as income, high scool graduation rates and life expectancy while it is negatively correlated with variables that signify reduced well-being, namely the rates of both murder and illiteracy. Some states with low values of the ﬁrst principal component are Louisiana (LA), South Carolina (SC) and Mississippi (MS), these are also in close geographical proximity to one another in the Southern region of the US. As such, the data suggest that the southern region suﬀers from a lower socio-economic status compared to the rest of the USA.

The biplot also highlights that the states of Alaska (AK) and to a lesser extent Nevada (NV) are outliers, particularly on the second principal component. The second principal component has a high weight of 0.732 on Income 1 and as mentioned earlier Alaska has an unusually high income. Alaska and Nevada are unusual states in that both a sparsely populated due to extreme weather conditions, and the economies of each state are dominated by a successful industry (oil in the case of Alaska and gaming in the case of Nevada). This may explain their positions as outliers in the biplot.

Limitations of the Analysis

Any dimension reduction technique such as principal components analysis represents a loss of information. In this example 82.789% of the overall variation is explained by the ﬁrst two principal components and therefore accurately depicted in the biplot. Finally there is some concern that the outliers of Alaska and Nevada may lead to a misleading analysis. However repeating the analysis with these two states excluded did not change the position of the remaining states, or of the variables on the biplot. This biplot is included in the Appendix

Appendix

The weights used to form the ﬁrst two principal components are provided below.

Table 1: Summary of weights on ﬁrst two principal components