Econ 390 Using STATA Part 2 Data Manipulation
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Econ 390
Using STATA Part 2
Data Manipulation
New STATA Commands:
1) list
2) desc
3) table
4) gen
5) drop
6) if
Note: STATA commands are case sensitive. I capitalized commands and variables (e.g. CLEAR, DVDES25...) in the notes because I want to highlight them. The variable names should be entered (could be either uppercase or lowercase, check the codebook) as in the original data.
1. Download the data econ390.dta from the course website.
The data is put under “lab” folder on Canvas. Try to save it on the desktop of the lab computer. Directory location is c:\Users\Arts User\Desktop
2. Open Stata.
Click on the Stata icon, or find STATA in the program menu.
3. Load the dataset into STATA.
This requires the USE command.
use "c:\Users\Arts User\Desktop\econ390.dta"
Error message? Don’t forget to CLEAR the previous dataset from the memory.
4. Saving datasets
To save a dataset, use the SAVE command.
save "c:\Users\Arts User\Desktop\econ390test.dta"
Now try saving it again.
save "c:\Users\Arts User\Desktop\econ390test.dta"
What happened? The problem is that a dataset already exists. For this reason, you must use the ‘replace’ option, which tells STATA to replace the existing version.
save "c:\Users\Arts User\Desktop\econ390test.dta", replace
Now try the USE command to get back to the original dataset.
use "c:\Users\Arts User\Desktop\econ390.dta"
What happened? Since there is already an active dataset, you must first clear the memory using a CLEAR statement.
5. Examine the data directly.
Use the LIST command. Try listing just some of the variables rather than all of them.
6. Find the summary statistics of the dataset.
a) Use the SUM command.
b) Now try the DESC command. What’s the difference?
DESC command in STATA will tell you whether the variable is str (string = categorical) or not.
Regression and many other things in STATA can only be done with numercial variables. If the variable is categorical, you will get an error message from STATA indicating your command cannot be complete.
Suppose a variable that you want to use in your regression is originally categorical in the data. To convert categorical variables to numercial variables, you would need to generate a new variable from the categorical variable. The command to generate a new variable is GEN, it will be discussed in step 8.
The new one will be numercial, but the old one will still be categorical. You will use your newly created numercial variable when you run your commands.
7. Making tables.
a) Try making a table for the variable REGION.
table region
b) Now try finding the mean RRSP contribution for each region. This requires specifying a statistic for the table. Try the command:
table region, c(mean rrsp)
The ‘c’ tells STATA the ‘contents’ of the table.
c) You can choose other statistics than just the mean. Use the ‘help’ feature of STATA to find out how to make a table with the median of rrsp by region.
d) You can also make two-way tables: try the following command to make a table that has the means by region and by marital status:
table region mard, c(mean rrsp)
8. Creating new variables.
a) The command to generate new variable is GEN. Now let’s try, for example, to create a binary variable for those who are equal to or older than 65. Note the double ‘=’ in the ‘if’ statement. STATA requires a double == in all ‘if’ statements.
gen over65 = 1 if age>=65
replace over65 = 0 if over65==.
If over 65==. in the above line means if the value in over65 is missing (==.).
summ over65
label variable over65 “person is age 65 or more”
b) Now create a binary variable for those who are exactly 30 years old. Notice the ‘~=’ command. This means ‘not equal to.’
gen age30 = 1 if age==30
replace age30 = 0 if age~=30
summ
c) Now create a binary variable for those who are exactly 30 years old AND married.
gen age30mar = 1 if age==30 & mard==1
replace age30mar = 0 if age30mar==.
summ
The ‘&’ sign means ‘and’. Both of the conditions have to be true for the if statement to be satisfied. If you need an ‘or’ statement, you can use |
d) Try finding the summary statistics of RRSP only for married families by using the IF command.
summ rrsp if mard==1
9. Dropping observations.
You can drop observations that you don’t want. Imagine you are only interested in those who are under age 65. We have a variable called over65, so let’s use that.
drop if over65==1
count
STATA drops all observations that have over65=1, since we don’t want them.
10. Keeping/ Dropping variables.
Keeping and dropping individual variables is different than dropping observations. Try dropping the age30 variable created in the previous step.
drop age30
Try the summ command to see what affect this had on your data set.
Codebook for econ390.dta:
The variables are derived from the 1984 version of the Family Expenditure Survey conducted by Statistics Canada.
Variable |
Description |
FAMINC |
Family income in 1984 dollars. |
MARD |
Marital status. Takes the value 1 if married or in common-law relationship; 0 otherwise. |
AGE |
The age of the head of the household. |
RRSP |
The dollar value of Registered Retirement Savings Plan contributions in 1984, in 1984 dollars. |
MTR |
Marginal tax rate. The combined (federal/provincial) rate of income tax payable on the last dollar of the head of household’s income. |
REGION |
Region of residence. Takes the value 1 for Atlantic provinces; 2 for Quebec; 3 for Ontario; 4 for Prairie provinces; 5 for BC. |
Some frequently used STATA commands:
Command |
Use |
cd c:\ |
Changes directory to c:\ |
clear |
Clears the memory so that a new dataset can be loaded. |
dir |
Displays contents of current directory |
count |
Provides a count of the number of observations |
desc |
Shows variables currently in memory with description |
summ |
Shows means and standard deviations for all variables - can use ',detail' to get percentiles |
reg y x |
Runs OLS using y as dependent variable and x as independent variable |
gen |
Creates new variable |
replace |
Replaces old value with new value |
drop x |
Drops variable x |
keep x |
Drops all variables except for x |
drop if x==0 |
Drops all observations with x=0 |
keep if x==0 |
Keeps only those observations with x=0 |
dprobit |
Runs a probit regression, reports marginal probabilities |
use data.dta |
Brings dataset data.dta into memory |
save data.dta |
Saves dataset data.dta to disk (add ', replace') to overwrite existing data.dta |
table x |
Shows a table with the frequency distribution for variable x |
table x, c(mean y) |
Shows the mean of y for each value of x |
compress |
Compresses the dataset to take up the minimum amount of memory possible |
insheet x y using data.dat, t |
Brings variables x and y from ascii file data.dat (tab delimited) into memory |
log using econ390.log, t |
Creates a log file called econ390.log, text format |
log close |
Closes any open log file |
capture log close |
Closes any open log file, but doesn’t crash if there is no open log file |
real (x) |
Transfers variable x from text to numeric form. e.g. gen numbvar = real(textvar) |
2022-10-26