Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Econ 390

Using STATA Part 2

Data Manipulation

New STATA Commands:

1) list

2) desc

3) table

4) gen

5) drop

6) if

Note: STATA commands are case sensitive. I capitalized commands and variables (e.g. CLEAR, DVDES25...) in the notes because I want to highlight them. The variable names should be entered (could be either uppercase or lowercase, check the codebook) as in the original data.

1. Download the data econ390.dta from the course website.

The data is put under “lab” folder on Canvas. Try to save it on the desktop of the lab computer. Directory location is c:\Users\Arts User\Desktop

2. Open Stata.

Click on the Stata icon, or find STATA in the program menu.

3. Load the dataset into STATA.

This requires the USE command.

use "c:\Users\Arts User\Desktop\econ390.dta"

Error message? Don’t forget to CLEAR the previous dataset from the memory.

4. Saving datasets

To save a dataset, use the SAVE command.

save "c:\Users\Arts User\Desktop\econ390test.dta"

Now try saving it again.

save "c:\Users\Arts User\Desktop\econ390test.dta"

What happened? The problem is that a dataset already exists. For this reason, you must use the ‘replace’ option, which tells STATA to replace the existing version.

save "c:\Users\Arts User\Desktop\econ390test.dta", replace

Now try the USE command to get back to the original dataset.

use "c:\Users\Arts User\Desktop\econ390.dta"

What happened? Since there is already an active dataset, you must first clear the memory using a CLEAR statement.

5. Examine the data directly.

Use the LIST command. Try listing just some of the variables rather than all of them.

6. Find the summary statistics of the dataset.

a) Use the SUM command.

b) Now try the DESC command. What’s the difference?

DESC command in STATA will tell you whether the variable is str (string = categorical) or not.

Regression and many other things in STATA can only be done with numercial variables. If the variable is categorical, you will get an error message from STATA indicating your command cannot be complete.

Suppose a variable that you want to use in your regression is originally categorical in the data. To convert categorical variables to numercial variables, you would need to generate a new variable from the categorical variable. The command to generate a new variable is GEN, it will be discussed in step 8.

The new one will be numercial, but the old one will still be categorical. You will use your newly created numercial variable when you run your commands.

7. Making tables.

a) Try making a table for the variable REGION.

table region

b) Now try finding the mean RRSP contribution for each region. This requires specifying a statistic for the table. Try the command:

table region, c(mean rrsp)

The ‘c’ tells STATA the ‘contents’ of the table.

c) You can choose other statistics than just the mean. Use the ‘help’ feature of STATA to find out how to make a table with the median of rrsp by region.

d) You can also make two-way tables: try the following command to make a table that has the means by region and by marital status:

table region mard, c(mean rrsp)

8. Creating new variables.

a) The command to generate new variable is GEN. Now let’s try, for example, to create a binary variable for those who are equal to or older than 65. Note the double ‘=’ in the ‘if’ statement. STATA requires a double == in all ‘if’ statements.

gen over65 = 1 if age>=65

replace over65 = 0 if over65==.

If over 65==. in the above line means if the value in over65 is missing (==.).

summ over65

label variable over65 “person is age 65 or more”

b) Now create a binary variable for those who are exactly 30 years old. Notice the ‘~=’ command. This means ‘not equal to.’

gen age30 = 1 if age==30

replace age30 = 0 if age~=30

summ

c) Now create a binary variable for those who are exactly 30 years old AND married.

gen age30mar = 1 if age==30 & mard==1

replace age30mar = 0 if age30mar==.

summ

The ‘&’ sign means ‘and’. Both of the conditions have to be true for the if statement to be satisfied. If you need an ‘or’ statement, you can use |

d) Try finding the summary statistics of RRSP only for married families by using the IF command.

summ rrsp if mard==1

9. Dropping observations.

You can drop observations that you don’t want. Imagine you are only interested in those who are under age 65. We have a variable called over65, so let’s use that.

drop if over65==1

count

STATA drops all observations that have over65=1, since we don’t want them.

10. Keeping/ Dropping variables.

Keeping and dropping individual variables is different than dropping observations. Try dropping the age30 variable created in the previous step.

drop age30

Try the summ command to see what affect this had on your data set.

Codebook for econ390.dta:

The variables are derived from the 1984 version of the Family Expenditure Survey conducted by Statistics Canada.

Variable

Description

FAMINC

Family income in 1984 dollars.

MARD

Marital status. Takes the value 1 if married or in common-law relationship; 0 otherwise.

AGE

The age of the head of the household.

RRSP

The dollar value of Registered Retirement Savings Plan contributions in 1984, in 1984 dollars.

MTR

Marginal tax rate. The combined (federal/provincial) rate of income tax payable on the last dollar of the head of household’s income.

REGION

Region of residence. Takes the value 1 for Atlantic provinces; 2 for Quebec; 3 for Ontario; 4 for Prairie provinces; 5 for BC.

Some frequently used STATA commands:

Command

Use

cd c:\

Changes directory to c:\

clear

Clears the memory so that a new dataset can be loaded.

dir

Displays contents of current directory

count

Provides a count of the number of observations

desc

Shows variables currently in memory with description

summ

Shows means and standard deviations for all variables - can use ',detail' to get percentiles

reg y x

Runs OLS using y as dependent variable and x as independent variable

gen

Creates new variable

replace

Replaces old value with new value

drop x

Drops variable x

keep x

Drops all variables except for x

drop if x==0

Drops all observations with x=0

keep if x==0

Keeps only those observations with x=0

dprobit

Runs a probit regression, reports marginal probabilities

use data.dta

Brings dataset data.dta into memory

save data.dta

Saves dataset data.dta to disk (add ', replace') to overwrite existing data.dta

table x

Shows a table with the frequency distribution for variable x

table x, c(mean y)

Shows the mean of y for each value of x

compress

Compresses the dataset to take up the minimum amount of memory possible

insheet x y using data.dat, t

Brings variables x and y from ascii file data.dat (tab delimited) into memory

log using econ390.log, t

Creates a log file called econ390.log, text format

log close

Closes any open log file

capture log close

Closes any open log file, but doesn’t crash if there is no open log file

real (x)

Transfers variable x from text to numeric form. e.g. gen numbvar = real(textvar)