Smith College Department of Mathematics and Statistics

Statistical Data Analysis Tools

Using Stata at Smith

Contents

Availability
Starting
Reading data
Basic statistics
Data recoding
Graphics
Regression
Summary of commands
Useful links

Introduction

This is a tutorial for using Stata, a general-purpose statistical software package widely used by business and academic institutions. We will demonstrate basic functionality of the program using a dataset of crime rate in the U.S. Analysis on the example includes calculations of summary statistics, recoding of data, graphical displays and estimation of multiple regression models.

Availability

Stata is available to be installed on all Smith-owned Windows, Mac, and Scinix machines, and is installed on all classroom computers. Stata can be puchased for personal use by students (contact Nicholas Horton for more details).

Starting Stata

Stata can be opened from Windows by clicking on Start > All Programs > Science Applications > Stata SE. On a Mac, Stata can be opened by navigating to Applications > Stata > Stata SE. In Linux, Stata can be opened by typing stata or xstata at the shell prompt.

In all cases, you should see a welcome message:

Stata (GUI) window has four sections: "Review" keeps track of the past commands; "Variables" shows all the available variables; "Stata Result" displays the output; and "Command" allows the user to enter commands.

Reading data

Data files in native Stata (.dta) format can be opened by File --> Open. Files can be read in comma separated values (csv) format using insheet, or Stat/Transfer can be used to convert another format into .dta or .csv format.

Stata can also load files in using the network. For remote access to the dataset we'll illustrate below, enter


use "http://www.math.smith.edu/tutorial/crime.dta", clear

Be careful, as the clear option will delete whatever is in memory.

If you've downloaded this file to your local machine as dta format, you can choose the data file you wish to load in Stata using File->Open and navigating to the file location. Once the dataset is loaded, variables of the dataset will automatically appear in the "Variables" window.

We can view a brief description of the dataset by keying in the command: describe after we successfully loaded the data

Basic Statistics

The command summarize will display descriptive statistics.

stata_summarize

To calculate summary statistics of a specific variable, we use the command summarize Murder. The command stem Murder generates a stem-and-leaf plot.

stata_Summarize(Murder)

stata_stem

Commands tabulate and table are useful for displaying contingency tables.

Data Recoding

First of all, we shall create a new variable, which defines the level of UrbanPop, with the same length as the variable "UrbanPop". Let's say, "ULevel" (command as below)

gen U_level = 0 if UrbanPop < 66.00
replace U_level = 1 if UrbanPop >= 66.00

so that the new variable U_level that defines the level of urban population is created.To examine the summary of our new variable, we can now try tab U_level or table U_level again.

stata_tab

stata_tab_M_U

Both of these functions can also be opened through the "Statistics" tab in Stata, Statistics->Summaries,Tables,and tests

Graphics

Stata offers a list of choices for statistical graphs, which can be found under the "Graphics" tab. The most common graph is boxplot, and it can be gerated also by the following command
graph box variable1 variable2. Stata will then display the well-labelled boxplot in the graphics window. For example, we would like to visually compare the boxplots of murder and rape rates. The command graph box Murder Rape will create the plot.

Two-way scatter plot with a fitted line and smoothed (lowess) curve can be produced by opening the "Graphics" tab or specifying the command twoway (scatter Murder UrbanPop)(lfit Murder UrbanPop)(lowess Murder UrbanPop), title(Scatter Plot of Murder Rate on UrbanPop)

Linear regression

To fit a multiple linear regression model with Murder as the outcome and UrbanPop, Assault and Rape as predictors, we use the following command:


reg Murder UrbanPop Assault Rape
A table of results will be given as below:

To check the residual, we need to generate a new variable yhat (the predicted value of Murder) to predict the residuals. Commands are as below:
predict yhat
predict resid, resid
Then we can go ahead and generate the histogram or residuals-versus-fit plot to examine the validity of the model with commands:
hist resid
rvfplot

If you'd like to explore more, you might want to take a look at for online help of Stata. It covers the instructions for basic and advanced functions in Stata.

Summary of commands

use http://www.math.smith.edu/tutorial/crime.dta
describe
summarize
summarize(Murder)
stem Murder
gen U_level = 0 if UrbanPop<66
replace U_level = 1 if UrbanPop >=66
table U_level
tab Murder U_level
graph box Murder Rape
twoway (scatter Murder UrbanPop, sort)(lfit Murder UrbanPop)(lowess Murder UrbanPop), title(Scatter Plot of Murder Rate on UrbanPop)
reg Murder Assault UrbanPop Rape
predict yhat
predict resid, resid
hist resid
rvfplot
Introduction to Stata
Online Help from Stata Website
Resources to help you learn and use Stata
Stata Tutorial

Created by Zehui Chen and Nicholas Horton, July 27, 2009.
Updated by Sarah Anoke, July 24, 2011.